Stratified Splitting for Efficient Monte Carlo Integration · PDF fileStratified Splitting for Efficient Monte Carlo Integration ... curse of dimensionality, ... Chapter 5), the

Stratified Splitting for Efficient Monte CarloIntegration

Radislav Vaisman, Robert Salomone, and Dirk P. KroeseSchool of Mathematics and PhysicsThe University of Queensland, Brisbane, Australia

E-mail: [email protected],[email protected],[email protected]

Summary.The efficient evaluation of high-dimensional integrals is of importance inboth theoretical and practical fields of science, such as Bayesian infer-ence, statistical physics, and machine learning. However, due to thecurse of dimensionality, deterministic numerical methods are inefficientin high-dimensional settings. Consequentially, for many practical prob-lems one must resort to Monte Carlo estimation. In this paper, we intro-duce a novel Sequential Monte Carlo technique called Stratified Splittingwhich enjoys a number of desirable properties not found in existing meth-ods. Specifically, the method provides unbiased estimates and can han-dle various integrand types including indicator functions, which are usedin rare-event probability estimation problems. Moreover, this algorithmachieves a rigorous efficiency guarantee in terms of the required samplesize. The results of our numerical experiments suggest that the Strati-fied Splitting method is capable of delivering accurate results for a widevariety of integration problems.

Keywords: Monte Carlo integration; Multilevel splitting; Markov chainMonte Carlo; Algorithmic efficiency; Sequential Monte Carlo; Resample-move; Nested sampling; Power posteriors

2 Vaisman et al.

1. Introduction

We consider the evaluation of expectations and integrals of the form

Ef [ϕ(X)] =∑

X

ϕ(x)f(x) or Ef [ϕ(X)] =

∫

Xϕ(x)f(x) dx,

where X ∼ f is a random variable taking values in a set X ⊆ Rd, fis a probability density function (pdf) with respect to the Lebesgue orcounting measure, and ϕ : X → R is a real-valued function.

The evaluation of such high-dimensional integrals is of critical im-portance in many scientific areas, including statistical inference (Gelmanet al., 2003; Lee, 2004), rare-event estimation (Asmussen and Glynn,2007), machine learning (Russell and Norvig, 2009; Koller and Friedman,2009), and cryptography (McGrayne, 2011). An important application isthe calculation of the normalizing constant of a probability distribution,such as the marginal likelihood (model evidence) in Bayesian statistics(Hooper, 2013). However, often obtaining even a reasonably accurateestimate of Ef [ϕ(X)] can be hard (Robert and Casella, 2004).

In this paper, we propose a novel Sequential Monte Carlo (SMC)approach for reliable and fast estimation of high-dimensional integrals.Our method extends the Generalized Splitting (GS) algorithm of Botevand Kroese (2012), to allow the estimation of quite general integrals.In addition, our algorithm is specifically designed to perform efficientsampling in regions of X where f takes small values and ϕ takes largevalues. In particular, we present a way of implementing stratificationfor variance reduction in the absence of knowing the strata probabilities.A major benefit of the proposed Stratified Splitting algorithm (SSA) isthat it provides an unbiased estimator of Ef [ϕ(X)], and that it can beanalyzed in a non-asymptotic setting. In particular, the SSA provides abound on the sample size required to achieve a predefined error.

Due to its importance, the high-dimensional integration problem hasbeen considered extensively in the past. Computation methods that useFubini’s theorem (Friedman, 1980) and quadrature rules or extrapola-tions (Forsythe et al., 1977), suffer from the curse of dimensionality, withthe number of required function evaluations growing exponentially withthe dimension. In order to address this problem, many methods havebeen proposed. Examples include Bayesian quadrature, sparse grids,and various Monte Carlo, quasi-Monte Carlo, and Markov Chain Monte

Stratified Splitting Algorithm 3

Carlo (MCMC) algorithms (O’Hagan, 1991; Morokoff and Caflisch, 1995;Newman and Barkema, 1999; Heiss and Winschel, 2008; Kroese et al.,2011). Many of these procedures are based on the SMC approach (Gilksand Berzuini, 2001; Chen et al., 2005; Del Moral et al., 2006; Friel andPettitt, 2008; Andrieu et al., 2010), and provide consistent estimatorsthat possess asymptotic normality. However, one might be interestedin the actual number of required samples to achieve a predefined errorbound. Our method, which also belongs to the SMC framework, is ca-pable of addressing this issue.

Among alternative methods, we distinguish the Nested Sampling (NS)algorithm of Skilling (2006), the Annealed Importance Sampling (AIS)method of Neal (2001), and the Power posterior approach of Friel andPettitt (2008), for their practical performance and high popularity (Mur-ray et al., 2005; Feroz and Skilling, 2013; Andrieu et al., 2010). As always,due to the varied approaches of different methods and nuances of differentproblems, no individual method can be deemed universally better. Forexample, despite good practical performance and convergence in proba-bility to the true integral value, the NS algorithm is not unbiased andin fact, to ensure its consistency, both sample size and ratio of samplingiterations to sample population size should be infinite for certain classesof integrands (Evans, 2007). Moreover, consistency of estimates obtainedwith Nested Sampling when Markov Chain Monte Carlo (MCMC) is usedfor sampling remains an open problem (Chopin and Robert, 2010).

Similar to other well-known SMC methods, the SSA falls into a multi-level estimation framework, which will be detailed in Section 2. As withclassical stratified sampling (see, e.g., Rubinstein and Kroese (2017),Chapter 5), the SSA defines a partition of the state space into strata,and uses the law of total probability to deliver an estimator of the valueof the integral. To do so, one needs to obtain a sample population fromeach strata and know the exact probability of each such strata. Underthe classical stratified sampling framework, it is assumed that the formeris easy to achieve and the latter is known in advance. However, suchfavorable scenarios are rarely seen in practice. In particular, obtainingsamples from within a stratum and estimating the associated probabilitythat a sample will be within this stratum is hard in general (Jerrum et al.,1986). To resolve this issue, the SSA incorporates a multi-level splittingmechanism (Kahn and Harris, 1951; Botev and Kroese, 2012; Rubinsteinet al., 2013) and uses an appropriate MCMC method to sample from

4 Vaisman et al.

conditional densities associated with a particular stratum.

The rest of the paper is organized as follows. In Section 2 we introducethe SSA, explain its correspondence to a generic multi-level samplingframework, and prove that the SSA delivers an unbiased estimator ofthe expectation of interest. In Section 3, we provide a rigorous analysisof the approximation error of the proposed method. In Section 4, weintroduce a difficult estimation problem called the weighted componentmodel, for which the SSA provides the best possible efficiency resultone can hope to achieve. Namely, we show that the SSA can obtainan arbitrary level of precision by using a sample size (and computationtime) that is polynomial in the corresponding problem size. In Section 5,we report our numerical findings on various test cases that typify classesof problems for which the SSA is of practical interest. Finally, in Section6 we summarize the results and discuss possible directions for futureresearch.

2. Stratified splitting algorithm

2.1. Generic multilevel splitting frameworkWe begin by considering a very generic multilevel splitting framework,similar to (Gilks and Berzuini, 2001). Let X ∼ f be a random variabletaking values in a set X , and consider a decreasing sequence of setsX = X0 ⊇ · · · ⊇ Xn = ∅. Define Zt = Xt−1 \Xt, for t = 1, . . . , n,and note that Xt−1 =

⋃ni=t Zi , and that {Zt} yields a partition of X ;

that is

X =

n⋃

t=1

Zt, Zt1 ∩Zt2 = ∅ for 1 ≤ t1 < t2 ≤ n. (1)

Then, we can define a sequence of conditional pdfs

ft(x) = f(x | x ∈Xt−1) =f(x)1{x ∈Xt−1}Pf (X ∈Xt−1)

for t = 1, . . . , n, (2)

where 1 denotes the indicator function. Also, define

gt(x) = f (x | x ∈ Zt) =f(x)1{x ∈ Zt}Pf (X ∈ Zt)

for t = 1, . . . , n. (3)


Our main objective is to sample from the pdfs ft and gt in (2) and (3),respectively. To do so, we first formulate a generic multilevel splittingframework, given in Algorithm 1.

Algorithm 1: Generic multilevel splitting framework

input : X0, . . . ,Xn and {ft, gt}1≤t≤n.output: Samples from ft and gt for 1 ≤ t ≤ n.Create a multi-set X1 of samples from f1.for t = 1 to n do

Set Zt ← Xt ∩Zt.Set Yt ← Xt \ Zt.if t < n then

Create a multi-set Xt+1 of samples (particles) from ft+1,(possibly) using elements of the Yt set. This step is called thesplitting or the rejuvenation step.

return multi-sets {Xt}1≤t≤n, and {Zt}1≤t≤n.

Note that the samples in {Xt}1≤t≤n and {Zt}1≤t≤n are distributedaccording to ft and gt, respectively, and these samples can be used tohandle several tasks. In particular, the {Xt}1≤t≤n sets allow one to handlethe general non-linear Bayesian filtering problem (Gilks and Berzuini,2001; Gordon et al., 1993; Del Moral et al., 2006). Moreover, by trackingthe cardinalities of the sets {Xt}1≤t≤n and {Zt}1≤t≤n, one is able totackle hard rare-event probability estimation problems, such as deliveringestimates of Pf (X ∈ Zn) (Botev and Kroese, 2012; Kroese et al., 2011;Rubinstein et al., 2013). Finally, it was recently shown by Vaisman et al.(2016) that Algorithm 1 can be used as a powerful variance minimizationtechnique for any general SMC procedure. In light of the above, wepropose taking further advantage of the sets {Xt}1≤t≤n and {Zt}1≤t≤n,to obtain an estimation method suitable for general integration problems.

2.2. The SSA set-upFollowing the above multilevel splitting framework, it is convenient toconstruct the sequence of sets {Xt}0≤t≤n by using a performance functionS : X → R, such that {Xt}0≤t≤n can be written as super level-sets of Sfor chosen levels γ0, . . . , γn, where γ0 and γn are equal to infx∈X S(x) andsupx∈X S(x), respectively. In particular, Xt = {x ∈X : S(x) ≥ γt}

6 Vaisman et al.

for t = 0, . . . , n. The partition {Zt}1≤t≤n, and the densities {ft}1≤t≤nand {gt}1≤t≤n, are defined as before via (1), (2), and (3), respectively.Similarly, one can define a sequence of sub level-sets of S; in this paperwe use the latter for some cases and whenever appropriate.

Letting ztdef= Ef [ϕ (X) | X ∈ Zt]Pf (X ∈ Zt) for t = 1, . . . , n, and

combining (1) with the law of total probability, we arrive at

zdef= Ef [ϕ (X)] =

n∑

t=1

Ef [ϕ (X) | X ∈ Zt]Pf (X ∈ Zt) =

n∑

t=1

zt. (4)

The SSA proceeds with the construction of estimators Zt for zt fort = 1, . . . , n and, as soon as these are available, we can use (4) to deliver

the SSA estimator for z, namely Z =∑n

t=1 Zt.

For 1 ≤ t ≤ n, let ϕtdef= Ef [ϕ (X) | X ∈ Zt], pt

def= Pf (X ∈ Zt),

and let Φt and Pt be estimators of ϕt and pt, respectively. We defineZt = Φt Pt, and recall that, under the multilevel splitting framework, weobtain the sets {Xt}1≤t≤n, and {Zt}1≤t≤n. These sets are sufficient to

obtain unbiased estimators {Φt}1≤t≤n and {Pt}1≤t≤n, in the followingway.

(a) We define Φt to be the (unbiased) Crude Monte Carlo (CMC) esti-mator of ϕt, that is,

Φt =1

|Zt|∑

Z∈Zt

ϕ(Z) for all t = 1, . . . , n.

(b) The estimator Pt is defined similar to the one used in the General-ized Splitting (GS) algorithm of Botev and Kroese (2012). In par-ticular, the GS product estimator is defined as follows. Define the

level entrance probabilities r0def= 1, rt

def= Pf (X ∈Xt | X ∈Xt−1)

for t = 1, . . . , n, and note that Pf (X ∈Xt) =∏ti=0 ri. Then, for

t = 1, . . . , n, it holds that

pt = Pf (X ∈ Zt) = Pf (X ∈Xt−1)− Pf (X ∈Xt)

=

t−1∏

i=0

ri −t∏

i=0

ri = (1− rt)t−1∏

i=0

ri.


This suggests the estimator Pt = (1 − Rt)∏t−1i=0 Ri, for pt, where

R0def= 1, and Rt = |Yt|

|Xt| = |Xt\Zt||Xt| , for all t = 1, . . . , n.

In practice, obtaining the {Xt}1≤t≤n and {Zt}1≤t≤n sets requires theimplementation of a sampling procedure from the conditional pdfs in(2) and (3). However, for many real-life applications, designing such aprocedure can be extremely challenging. Nevertheless, we can use theYt = Xt \ Zt set from iteration t, to sample Xt+1 from ft+1 for eacht = 1, . . . , n− 1, via MCMC. In particular, the particles from the Yt setcan be “split”, in order to construct the desired set Xt+1 for the nextiteration, using a Markov transition kernel κt+1 (· | ·) whose stationarypdf is ft+1, for each t = 1, . . . , n−1. Algorithm 2 summarizes the generalprocedure for the SSA.

Algorithm 2: The SSA for estimating z = Ef [ϕ (X)]

input : A set X , a pdf f , the functions ϕ : X → R and S : X → R,a sequence of levels γ0, . . . , γn, and the sample size N ∈ N.

output: Z — an estimator of z = Ef [ϕ(X)].

Set R0 ← 1, X1 ← ∅, and f1 ← f .for i = 1 to N do

draw X ∼ f1(x) and add X to X1.

for t = 1 to n doSet Zt ← {X ∈ Xt : X ∈ Zt} and Yt ←Xt \ Zt.Set Φt ← 1

|Zt|∑

X∈Ztϕ(X).

Set Rt ← |Yt|N , and Pt ←

(1− Rt

)∏t−1j=0 Rj .

Set Zt ← Φt Pt.if t < n then

/* Performing splitting to obtain Xt+1. */

Set Xt+1 ← ∅ and draw Ki ∼ Bernoulli(0.5), for i = 1, . . . , |Yt|,such that

∑|Yt|i=1 Ki = N mod |Yt|.

for Y ∈ Yt do

Set Mi ←⌊N|Yt|

⌋+Ki and Xi,0 ← Y.

for j = 1 to Mi doDraw Xi,j ∼ κτt+1 (· | Xi,j−1) (where κτt+1(· | ·) is a τ -steptransition kernel using κt+1(· | ·)), and add Xi,j to Xt+1.

return Z =∑nt=1 Zt.

8 Vaisman et al.

Remark 2.1 (The splitting step). The particular splitting stepdescribed in Algorithm 2 is a popular choice (Botev and Kroese, 2012),especially for hard problems with unknown convergence behavior of thecorresponding Markov chain. However, one can also apply different split-ting strategies. For example, we can choose a single element from Yt, anduse it to obtain all samples in the Xt+1 set. Note that, in this case, Xt+1

contains dependent samples. On the other hand, we might be interestedto have independent samples in the Xt+1 set. To do so, one will generallyperform additional runs of the SSA, and take a single element (from eachSSA run) from Yt to produce a corresponding sample in the Xt+1 set.Such a strategy is clearly more expensive computationally. Under thissetting, a single SSA run will require a computational effort that is pro-portional to that of Algorithm 2, squared. However, such an approach isbeneficial for an analysis of the SSA’s convergence. See also Remark A.1.

Theorem 2.1 (Unbiased estimator). Algorithm 2 outputs an un-

biased estimator; that is, it holds that E[Z]

= Ef [ϕ (X)] = z.

Proof. See Appendix A.

An immediate consequence of Theorem 2.1 is that the SSA introduces anadvantage over conventional SMC algorithms, which provide only con-sistent estimators.

We next proceed with a clarification for a few remaining practicalissues regarding the SSA.

Determining the SSA levels. It is often difficult to make an educatedguess how to set the values of the level thresholds. However, the SSArequires the values of {γt}1≤t≤n to be known in advance, in order toensure that the estimator is unbiased. To resolve this issue, we performa single pilot run of Algorithm 2 using a so-called rarity parameter 0 <ρ < 1. In particular, given samples from an Xt set, we take the ρ |Xt|performance quantile as the value of the corresponding level γt, and formthe next level set. Such a pilot run, helps to establish a set of thresholdvalues adapted to the specific problem. After the completion of the pilotrun we simply continue with a regular execution of Algorithm 2 usingthe level threshold values observed in the pilot run.

Controlling the SSA error. A common practice when working with aMonte Carlo algorithm that outputs an unbiased estimator, is to run it


for R independent replications to obtain Z(1), . . . , Z(R), and report theaverage value. Thus, for a final estimator, we take

Z = R−1R∑

j=1

Z(j).

To measure the quality of the SSA output, we use the estimator’s relativeerror (RE), which is defined by

RE =

√Var

(Z) /

E[Z]√

R.

As the variance and expectation of the estimator are not known explicitly,we report an estimate of the relative error by estimating both terms fromthe result of the R runs.

In the following section, we establish efficiency results for our esti-mator by conducting an analysis common in the field of randomizedalgorithms.

3. Efficiency of the SSA

In this section, we present an analysis of the SSA under a set of verygeneral assumptions. We start with a definition a randomized algorithm’sefficiency.

Definition 3.1 (Mitzenmacher and Upfal (2005)). A random-ized algorithm gives an (ε, δ)-approximation for the value z if the output

Z of the algorithm satisfies

P(z(1− ε) ≤ Z ≤ z(1 + ε)

)≥ 1− δ.

With the above definition in mind, we now aim to specify the sufficientconditions for the SSA to provide an (ε, δ)-approximation to z. The proofclosely follows a technique that is used for the analysis of approximatecounting algorithms. For an extensive overview, we refer to (Mitzen-macher and Upfal, 2005, Chapter 10). A key component in our analysis

is to construct a Markov chain{X

(m)t , m ≥ 0

}with stationary pdf ft

10 Vaisman et al.

(defined in (2)), for all 1 ≤ t ≤ n, and to consider the speed of conver-

gence of the distribution of X(m)t as m increases. Let µt be the probability

distribution corresponding to ft, so

µt(A) =

∫

Aft(u)λ(du),

for all Borel sets A, where λ is some base measure, such as the Lebesgueor counting measure. To proceed, we have

κτt (A | x) = P(X

(τ)t ∈ A | X(0)

t = x),

for the τ -step transition law of the Markov chain. Consider the totalvariation distance between κτt (· | x) and µt, defined as:

‖κτt (· | x)− µt‖TV = supA|κτt (A | x)− µt(A)|.

An essential ingredient of our analysis is the so-called mixing time (seeRoberts et al. (2004) and Levin et al. (2009) for an extensive overview),which is defined as τmix(ε,x) = min {τ : ‖κτt (· | x)− µt‖TV ≤ ε}. Letµt = κτt (· | x ) be the SSA sampling distribution at steps 1 ≤ t ≤ n,where for simplicity, we suppress x in the notation of µt.

Finally, similar to µt and µt, let νt be the probability distribution cor-responding to the pdf gt (defined in (3)), and let νt be the SSA samplingdistribution, for all 1 ≤ t ≤ n. Theorem 3.1 details the main efficiencyresult for the SSA.

Theorem 3.1 (Efficiency of the SSA). Let ϕ be a strictly pos-itive real-valued function, at = minx∈Zt

{ϕ(x)}, bt = maxx∈Zt{ϕ(x)},

and rt = min {rt, 1− rt} for 1 ≤ t ≤ n. Then, the SSA gives an (ε, δ)-approximation to z = Ef [ϕ(X)], provided that for all 1 ≤ t ≤ n, thefollowing holds.

(a) The samples in the Xt set are independent and are distributed ac-cording to µt, such that (for every x)

‖µt − µt‖TV ≤ε rt32n

and |Xt| ≥3072n2 ln(4n2/δ)

ε2r2t.


(b) The samples in the Zt set are independent and are distributed ac-cording to νt, such that (for every x)

‖νt − νt‖TV ≤ε at

16(bt − at)and |Zt| ≥

128(bt − at)2 ln(4n/δ)

ε2a2t.

Proof. See Appendix A.

In some cases, the distributions of the states in Xt and Zt gener-ated by Markov chain defined by the kernel κτt , approach the targetdistributions µt and νt very fast. This occurs for example when thereexists a polynomial in n (denoted by P(n)), such that the mixing time(Levin et al., 2009) is bounded by O(P(n)), (bt − at)2/a2 = O(P(n)),and rt = O(1/P(n)) for all 1 ≤ t ≤ n. In this case, the SSA becomesa fully polynomial randomized approximation scheme (FPRAS) (Mitzen-macher and Upfal, 2005). In particular, the SSA results in a desired(ε, δ)-approximation to z = Ef [ϕ(X)] with running time bounded by apolynomial in n, ε−1, and ln(δ−1). Finally, it is important to note thatan FPRAS algorithm for such problems is essentially the best result onecan hope to achieve (Jerrum and Sinclair, 1996).

We next continue with a non-trivial example for which the SSA pro-vides an FPRAS.

4. FPRAS for the weighted component model

We consider a system of k components. Each component i generates aspecific amount of benefit, which is given by a positive real number wi,i = 1, . . . , k. In addition, each component can be operational or not.

Let w = (w1, . . . , wk)> be the column vector of component weights

(benefits), and x = (x1, . . . , xk)> be a binary column vector, for which

xi indicates the ith component’s operational status for 1 ≤ i ≤ k. Thatis, if the component i is operational xi = 1, and xi = 0 if it is not. Underthis setting, we define the system performance as

S(w,x) =

k∑

i=1

wi xi = w>x.

We further assume that all elements are independent from each other,and that each element is operational with probability 1/2 at any given

12 Vaisman et al.

time. For the above system definition, we might be interested in thefollowing questions.

(a) Conditional expectation estimation. Given a minimal threshold per-

formance γ ≤∑ki=1wi, what is the expected system performance?

That is to say, we are interested in the calculation of

E [S(w,X) | S(w,X) ≤ γ] , (5)

where X is a k-dimensional binary vector generated uniformly atrandom from the {0, 1}k set. This setting appears (in a more generalform), in a portfolio credit risk analysis (Glasserman and Li, 2005),and will be discussed in Section 5.

(b) Tail probability estimation (Asmussen and Glynn, 2007). Given theminimal threshold performance γ, what is the probability that theoverall system performance is smaller than γ? In other words, weare interested in calculating

P(S(X) ≤ γ) = E [1{S(X) ≤ γ}] . (6)

The above problems are both difficult, since a uniform generation ofX ∈ {0, 1}k, such that w>X ≤ γ, corresponds to the knapsack prob-lem, which belongs to #P complexity class (Valiant, 1979; Morris andSinclair, 2004).

In this section, we show how one can construct an FPRAS for bothproblems under the mild condition that the difference between the min-imal and the maximal weight in the w vector is not large. This section’smain result is summarized next.

Proposition 4.1. Given a weighted component model with k weights,w = (w1, . . . , wk) and a threshold γ, let w = min{w}, w = max{w}.Then, provided that w = O (P(k))w, there exists an FPRAS for theestimation of both (5) and (6).

Prior to stating the proof of Proposition 4.1, define

Xb =

{x ∈ {0, 1}k :

k∑

i=1

wi xi ≤ b}

for b ∈ R, (7)


and let µb be the uniform distribution on the Xb set. Morris and Sinclair(2004) introduce an MCMC algorithm that is capable of sampling fromthe Xb set almost uniformly at random. In particular, this algorithmcan sample X ∼ µb, such that ‖µb − µb‖TV ≤ ε. Moreover, the authorsshow that their Markov chain mixes rapidly, and in particular, that itsmixing time is polynomial in k and is given by τmix(ε) = O

(k9/2+ε

).

Consequentially, the sampling from µb can be performed in O(P(k))time for any ε > 0.

We next proceed with the proof of Proposition 4.1 which is dividedinto two parts. The first for the conditional expectation estimation andthe second for the tail probability evaluation.

Proof (Proposition 4.1: FPRAS for (5)). With the powerful re-sult of Morris and Sinclair (2004) in hand, one can achieve a straightfor-ward development of an FPRAS for the conditional expectation estima-tion problem. The proof follows immediately from Lemma A.3.

In particular all we need to do in order to achieve an (ε, δ) approxi-mation to (5), is to generate

m =(w − w)2 ln(2/δ)

2(ε/4)2w2

samples from µγ , such that

‖µγ − µγ‖TV ≤εw

4(w − w).

Recall that the mixing time is polynomial in k, and note that the numberof samples m is also polynomial in k, thus the proof is complete, since

m =(w − w)2 ln(2/δ)

(ε/4)2w2=︸︷︷︸

w=O(P(k))w

O(P(k))ln(2/δ)

ε2. 2

Of course, the development of an FPRAS for the above case did not usethe SSA, as an appropriate choice of transition kernel was sufficient forthe result. The tail probability estimation problem, however, is moreinvolved, and can be solved via the use of the SSA.

Proof (Proposition 4.1: FPRAS for (6)). In order to put thisproblem into the SSA setting and achieve an FPRAS, a careful definition

14 Vaisman et al.

of the corresponding level sets is essential. In particular, the number oflevels should be polynomial in k, and the level entrance probabilities{rt}1≤t≤n, should not be too small. Fix

n =

(∑ki=1wi

)− γ

w

,

to be the number of levels, and set γt = γ + (n − t)w for t = 0, . . . , n.For general γ it holds that

E [1{S(X) ≤ γ}] = E [1{S(X) ≤ γ + nw} | S(X) ≤ γ + nw]P (S(X) ≤ γ + nw)

+ E

[1{S(X) ≤ γ} | γ + nw < S(X) ≤

k∑

i=1

wi

]P

(γ + nw < S(X) ≤

k∑

i=1

wi

)

= E [1{S(X) ≤ γ} | S(X) ≤ γ + nw]︸︷︷︸(∗)

2k − 1

2k+

(k∑

i=1

wi

)1

2k,

where the last equality follows from the fact that there is only one vectorx = (1, 1, . . . , 1) for which γ + nw < S(x) ≤ ∑k

i=1wi. That is, it issufficient to develop an efficient approximation to (∗) only, since the restare constants.

We continue by defining the sets X = Xγ0 ⊇ · · · ⊇ Xγn via (7), andby noting that for this particular problem, our aim is to find P(X ∈Xγn),so the SSA estimator simplifies into (see Section 2.2 (b)),

Z =

n∏

t=0

Rt.

In order to show that the SSA provides an FPRAS, we will need tojustify only condition (a) of Theorem 3.1, which is sufficient in our casebecause we are dealing with an indicator integrand. Recall that theformal requirement is

‖µt − µt‖TV ≤ ε rt/32n, and |Xt| ≥ 3072n2 ln(4n2/δ)/ε2r2t ,

where µt is the uniform distribution on Xγt for t = 0, . . . , n, and eachsample in Xt is distributed according to µt. Finally, the FPRAS resultis established by noting that the following holds.


(a) From Lemma A.6, we have that rt ≥ 1k+1 for 1 ≤ t ≤ n.

(b) The sampling from µt can be performed in polynomial (in k) time(Morris and Sinclair, 2004).

(c) The number of levels n (and thus the required sample size {|Xt|}1≤t≤n),is polynomial in k since

(∑ki=1wi

)− τ

w

≤ k w

w=︸︷︷︸

w=O(P(k))w

O(P(k)). 2

Unfortunately, for many problems, an analytical result such as theone obtained in this section is not always possible to achieve. The aim ofthe following numerical section is to demonstrate that the SSA is capableof handling hard problems in the absence of theoretical performance.

5. Numerical experiments

5.1. Portfolio credit riskWe consider a portfolio credit risk setting (Glasserman and Li, 2005).Given a portfolio of k assets, the portfolio loss L is the random variable

L =

k∑

i=1

li Xi, (8)

where li is the risk of asset i ∈ {1, . . . , k}, and Xi is an indicator randomvariable that models the default of asset i. Under this setting (and similarto Section 4), one is generally interested in the following.

(a) Conditional Value at Risk (CVaR). Given a threshold (value at risk)v, calculate the conditional value at risk c = E[L | L ≥ v].

(b) Tail probability estimation. Given the value at risk, calculate thetail probability P(L ≥ v) = E [1{L ≥ v}].

The SSA can be applied to both problems as follows. For tail probabil-ity estimation, we simply set ϕ(x) = 1{∑k

i=1 li xi ≥ v}. For conditional

expectation, we set ϕ(x) =∑k

i=1 li xi.

16 Vaisman et al.

Note that the tail probability estimation (for which the integrand isthe indicator function), is a special case of a general integration. Recallthat the GS algorithm of Botev and Kroese (2012) works on indicatorintegrands, and thus GS is a special case of the SSA. Consequently, inthis section we will investigate the more interesting (and more general)scenario of estimating an expectation conditional on a rare event.

As our working example, we consider a credit risk in a Normal Copulamodel and, in particular, a 21 factor model from Glasserman and Li(2005) with 1, 000 obligors.

The SSA setting is similar to the weighted component model fromSection 4. We define a k-dimensional binary vector x = (x1, . . . , xk),for which xi stands for the ith asset default (xi = 1 for default, and0 otherwise). We take the performance function S(x) to be the lossfunction (8). Then, the level sets are defined naturally by Xt = {x :S(x) ≥ γt}, (see also Section 2.2). In our experiment, we set γ0 = 0 and

γn = 1+∑k

i=1 li. In order to determine the remaining levels γ1, . . . , γn−1,we execute a pilot run of Algorithm 2 with N = 1, 000 and ρ = 0.1. Asan MCMC sampler, we use a Hit-and-Run algorithm (Kroese et al., 2011,Chapter 10, Algorithm 10.10), taking a new sample after 50 transitions.

It is important to note that despite the existence of several algorithmsfor estimating c, the SSA has an interesting feature, that (to the bestof our knowledge) is not present in other methods. Namely, one is ableto obtain an estimator for several CVaRs via a single SSA run. Tosee this, consider the estimation of c1, . . . , cs for s ≥ 1. Suppose thatv1 ≤ · · · ≤ vs and note that it will be sufficient to add these valuesto the {γt} (as additional levels), and retain s copies of Pt and Zt. Inparticular, during the SSA execution, we will need to closely follow theγ levels, and as soon as we encounter a certain vj for 1 ≤ j ≤ s, we

will start to update the corresponding values of P(j)t and Z

(j)t , in order

to allow the corresponding estimation of cj = E[L | L ≥ vj ]. Despitethat such a procedure introduces a dependence between the obtainedestimators, they still remain unbiased.

To test the above setting, we perform the experiment with a view toestimate {cj}1≤j≤13 using the following values at risk:

{10000, 14000, 18000, 22000, 24000, 28000, 30000, (9)

34000, 38000, 40000, 44000, 48000, 50000}.


The execution of the SSA pilot run with the addition of the desiredVaRs (levels) from (9) (marked in bold), yields the following level valuesof (γ0, . . . , γ21):

(0, 788.3, 9616.7, 10000, 14000, 18000, 22000, 24000, 28000,

30000, 34000, 38000, 40000, 44000, 47557.6, 48000, 49347.8,

50000, 50320.6, 50477.4, 50500, ∞).

Table 1 summarizes the results obtained by executing 1 pilot and 5regular independent runs of the SSA. For each run, we use the parameterset that was specified for the pilot run (N = 1, 000 and burn-in of 50).The overall execution time (for all these R = 1+5 = 6 independent runs)is 454 seconds. The SSA is very accurate. In particular, we obtain anRE of less than 1% for each c while employing a very modest effort.

v c RE v c RE10000 1.68× 104 0.67 % 34000 3.80× 104 0.05 %14000 2.09× 104 0.51 % 38000 4.11× 104 0.05 %18000 2.46× 104 0.21 % 40000 4.26× 104 0.08 %22000 2.82× 104 0.19 % 44000 4.56× 104 0.08 %24000 2.99× 104 0.23 % 48000 4.86× 104 0.02 %28000 3.32× 104 0.21 % 50000 5.01× 104 0.02 %30000 3.48× 104 0.14 %

Table 1: The SSA results for the Normal copula credit risk model with21 factors and 1, 000 obligors.

The obtained result is especially appealing, since the correspondingestimation problem falls into rare-event setting (Glasserman, 2004). Thatis, a CMC estimator will not be applicable in this case.

5.2. Linear regression – non-nested modelsTable 2 summarizes a dataset from Willams (1959), where observationsfrom 42 specimens of radiata pine are considered. In particular, this datadescribes the maximum compression strength yi, the density xi, and theresin-adjusted density zi.

Similar to (Friel and Pettitt, 2008; Han and Carlin, 2001; Chib, 1995;Bartolucci and Scaccia, 2004), we compare the following two models

M1 : yi = α+ β(xi − x) + εi, εi ∼ N(0, σ2),

M2 : yi = γ + δ(zi − z) + ηi, ηi ∼ N(0, τ2).

18 Vaisman et al.

i yi xi zi i yi xi zi i yi xi zi1 3040 29.2 25.4 15 2250 27.5 23.8 29 1670 22.1 21.32 2470 24.7 22.2 16 2650 25.6 25.3 30 3310 29.2 28.53 3610 32.3 32.2 17 4970 34.5 34.2 31 3450 30.1 29.24 3480 31.3 31.0 18 2620 26.2 25.7 32 3600 31.4 31.45 3810 31.5 30.9 19 2900 26.7 26.4 33 2850 26.7 25.96 2330 24.5 23.9 20.0 1670 21.1 20.0 34 1590 22.1 21.47 1800 19.9 19.2 21 2540 24.1 23.9 35 3770 30.3 29.88 3110 27.3 27.2 22 3840 30.7 30.7 36 3850 32.0 30.69 3670 32.3 29.0 23 3800 32.7 32.6 37 2480 23.2 22.610 2310 24.0 23.9 24 4600 32.6 32.5 38 3570 30.3 30.311 4360 33.8 33.2 25 1900 22.1 20.8 39 2620 29.9 23.812 1880 21.5 21.0 26 2530 25.3 23.1 40 1890 20.8 18.413 3670 32.2 29.0 27 2920 30.8 29.8 41 3030 33.2 29.414 1740 22.5 22.0 28 4990 38.9 38.1 42 3030 28.2 28.2

Table 2: The radiata pine data-set taken from Willams (1959). The ex-planatory variables are the density xi, and the resin-adjusted density zi.The response variable yi stands for the maximum compression strengthparallel to the grain.

We wish to perform a Bayesian model comparison of M1 and M2 viathe SSA. Similar to the above references, we choose a prior

Norm

([3000185

],

[106 00 104

]),

for both (α, β)> and (γ, δ)>, and an inverse-gamma prior with hyperpa-rameters α = 3 and β = (2 · 3002)−1 for both εi and ηi. By performingnumerical integration, it was found by O’Hagan (1995), that the Bayesfactor B21 under this setting is equal to

B21 =P(D | M2)

P(D | M1)=

∫P(γ, δ, τ2 | M2

)P(D | γ, δ, τ2,M2) dγ dδ dτ2∫

P (α, β, σ2 | M1)P(D | α, β, σ2,M1) dα dβ dσ2= 4862.

Next, we run the SSA to obtain two estimators ZM1and ZM2

forP(D | M1) and P(D | M2). In this setting our initial density f is theprior and we set our performance function S to be the likelihood function(alternatively, one could use the log-likelihood). Consequentially, theSSA approximation of the Bayes factor B21 is given by the ratio estimatorB21 = ZM2

/ZM1.


Our experimental setting for the SSA algorithm is as follows. We usea sample size of N = 10, 000 for each level set. For sampling, we use therandom walk sampler for each ft, with σ2 = 1000. In order to benchmarkthe SSA performance in the sense of the closeness to the real B21 value,both ZM1

and ZM2were estimated (via R independent replications) until

the relative error for both is less than 0.5%. The levels used for M1 andM2, respectively, are:

{0, 9.40× 10−146, 1.88× 10−134, 7.39× 10−133, 1.66× 10−132, ∞}

and

{0, 1.20× 10−141, 2.77× 10−130, 4.26× 10−129, 7.83× 10−129, ∞}.

To reach the predefined 0.5% RE, the SSA stopped after 258 itera-tions. The average estimates are 2.5123× 10−135 and 1.2213× 10−131

for ZM1and ZM2

, respectively, B21 = ZM2/ZM1

≈ 4861. The associ-ated 95% confidence interval (obtained via the Delta method) for z2/z1is (4798.8, 4923.8).

101 101.2 101.4 101.6 101.8 102 102.2 102.44,800

4,850

4,900

4,950

R

p B21�

p ZM

2{p ZM

1

101 101.2 101.4 101.6 101.8 102 102.2 102.4

0.5

1

1.5

2

2.5

0.5

R

RE%

M1

M2

Fig. 1: The leftmost plot shows the convergence of B21 as a function ofthe SSA iteration number (R), to the true value — 4862. The rightmost

plot shows the RE as a function of R for both ZM1and ZM2

.

In practice, however, we could stop the SSA execution much earlier.In particular, the leftmost plot in Fig. 1 clearly indicates a convergenceof B21 to a factor that is greater than 4000 after as few as 20 iterations.That is, by the well-known Jeffrey’s (Jeffrey, 1961) scale, Model 2 can be

safely declared as superior since B21 � 102, indicating decisive evidencefor the second model. It is worth noting that, under the above setting, ourprecision result is comparable with the power-posterior results obtained

20 Vaisman et al.

by Friel and Pettitt (2008), which outperform reversible jump MCMCfor this problem.

We note that in the setting of estimating model evidence (and inBayesian inference as in the next section), the SSA bears some similar-ities to Nested Sampling. Namely, the use of a population of particles,and the sampling from increasing thresholds of the likelihood function.Indeed, our tests on various problems indicate that both methods per-form similarly in terms of estimator variance, however it is important tonote that unlike the SSA, Nested Sampling is not unbiased. In fact, asmentioned earlier, it is not even known to be consistent when MCMCsampling is used.

5.3. Bayesian inferenceSuppose that a pdf h is known up to its normalization constant, that ish ∝ L · f . For example, it is convenient to think of L · f as likelihoodmultiplied by prior, and of h as the corresponding posterior distribution.Under this setting, we are interested in the estimation of Eh[H(X)] forany H : X → R. Recall that the generic multilevel splitting frameworkand in particular the SSA, provide the set of samples {Zt}1≤t≤n. Thesesamples can be immediately used for an estimation of Eh[H(X)], via

∑Rr=1

[∑nt=1 Pt

1|Zt|∑

X∈ZtH(X)L(X)

]

∑Rr=1

[∑nt=1 Pt

1|Zt|∑

X∈ZtL(X)

] ,

since by the law of large numbers the above ratio converges (as R→∞)to

Ef[∑n

t=1 Pt1|Zt|∑

X∈ZtH(X)L(X)

]

Ef[∑n

t=1 Pt1|Zt|∑

X∈ZtL(X)

] =Ef [H(X)L(X)]

Ef [L(X)]

=zh∫H(x)L(x)f(x)zh

dx

zh∫ L(x)f(x)

zhdx

=

∫H(x)h(x)dx∫h(x)dx

= Eh[H(X)].

We next consider the SSA Bayesian inference applied on models M1

and M2 from Section 5.2. We first run the (random walk) Metropolisalgorithm used by Han and Carlin (2001), and estimate the posterior


mean for each parameter. In particular, for both M1 and M2, we startwith an initial θ0 =

(3000, 185, 3002

), and for t > 0, we sample θt from

multivariate normal with mean µ = θt−1 and covariance matrix

Σ =

(5000 0 00 250 00 0 1

).

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1·104

2,9003,0003,100

iteration

α

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1·104

140160180200220

iteration

β

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1·104

1.221.241.261.28

·105

iteration

σ2

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1·104

2,9003,0003,100

iteration

γ

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1·104

160180200

iteration

δ

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1·104

11.051.1·105

iteration

τ 2

Fig. 2: The leftmost and rightmost plots correspond to the convergenceof the Metropolis algorithm for the M1 and the M2 models, respectively.The results were obtained using 10, 000 MCMC samples after burn-inperiod of 90, 000.

The proposal θt is then accepted with probability min{

1, p(θt)p(θt−1)

},

where p denotes unnormalized posterior density function. Our exper-iment (see Fig. 2), indicates that the proposed Metropolis scheme hasgood mixing properties with respect to the slope and the intercept pa-rameters, when applied to both M1 and M2. However, the mixing ofthe σ2 and the τ2 parameters does not look sufficient. Despite that wechanged the Σ(3, 3) parameter of the co-variance matrix to be 101, 102

and 103 instead of 1, the typical mixing results did not change.We next apply the SSA to the above inference problem. Using a

single SSA run (with N = 20, 000), we collected 100, 000 samples foreach model. The obtained results are summarized in Table 3.

M1 M2

Algorithm α β σ2 γ δ τ2

Metropolis 2993.2 184.2 1.246× 105 2992.5 182.9 1.025× 105

SSA 2992.3 184.2 1.136× 105 2991.9 183.2 7.840× 104

Table 3: Inference results for models M1 and M2 obtained via Metropolisand the SSA. The required CPU for each model inference is about 16seconds for each algorithm.

22 Vaisman et al.

The Metropolis estimates of σ2 and τ2 are approximately equal to1.246× 105 and 1.025× 105, respectively. In contrast, the SSA estimatesof these quantities are 1.125× 105 and 7.797× 104, respectively (see Ta-ble 3). We conjecture that the SSA estimators of σ2 and τ2 are superior.In order to check, we implement a Gibbs sampler (90, 000 samples, witha 10, 000 sample burn in) and obtain the following results, supportingour suspicions.

M1 M2

Algorithm α β σ2 γ δ τ2

Gibbs 2991.4 184.5 1.126× 105 2991.8 183.3 7.792× 104

Table 4: Inference results for models M1 and M2 obtained via GibbsSampling.

The superior performance of the SSA using a Random Walk samplerover the standard Random Walk approach is most likely due to the SSAusing a collection of Markov Chain Samplers at each stage instead ofa single chain, and that the SSA sampling takes place on the prior asopposed to the posterior.

5.4. Self-avoiding walksIn this section, we consider random walks of length n on the two-dimensionallattice of integers, starting from the origin. In particular, we are inter-ested in estimating the following quantities:

(a) cn: the number of SAWs of length n,

(b) ∆n: the expected distance of the final SAW coordinate to the origin.

To put these SAW problems into the SSA framework, define the setof directions, X = {Left,Right,Up,Down}n, and let f be the uniformpdf on X . Let ξ(x) denote the final coordinate of the random walkrepresented by the directions vector x. We have cn = Ef [1{X is SAW}]and ∆n = Ef

[‖ξ(X)‖

∣∣1{X is SAW}].

Next, we let Xt ⊆ X be the set of all directions vectors that yield avalid self-avoiding walk of length at least t, for 0 ≤ t ≤ n. In addition, wedefine Zt to be the set of all directions vectors that yield a self-avoidingwalk of length (exactly) t, for 1 ≤ t ≤ n. The above gives the required


partition of X . Moreover, the simulation from ft(x) = f (x | x ∈Xt−1),reduces to the uniform selection of the SAW’s direction at time 1 ≤ t ≤ n.

Our experimental setting for SAWs of lengths n is as follows. We setthe sample size of the SSA to be Nt = 1000 for all t = 1, . . . , n. In thisexperiment, we are interested in both the probability that X lies in Zn,and the expected distance of X ∈ Zn (uniformly selected) to the origin.These give us the required estimators of cn and ∆n, respectively. Theleftmost plot of Fig. 3 summarizes a percent error (PE), which is definedby

PE = 100cn − cncn

,

where cn stands for the SSA’s estimator of cn.

10 20 30 40 50 60 70−15

−10

−5

0

5

10

n

PE(%

)

3% RE1% RE

10 20 30 40 50 60 700

50

100

150

200

n

R/n

3%1%

Fig. 3: The PE (leftmost plot) and the number of independent runsdivided by SAW’s length (rightmost plot) of the SSA as a function ofSAW length n for 3% and 1% RE.

In order to explore the convergence of the SSA estimates to the truequantity of interest, the SSA was executed for a sufficient number oftimes to obtain 3% and 1% relative error (RE) (Rubinstein and Kroese,2017), respectively. The exact cn values for n = 1, . . . , 71 were takenfrom (Guttmann and Conway, 2001; Jensen, 2004)); naturally, when weallow a smaller RE, that is, when we increase R, the estimator convergesto the true value cn, as can be observed in leftmost plot of Fig. 3. Inaddition, the rightmost plot of Fig. 3 shows that regardless of the RE,the required number of independent SSA runs (R) divided by SAW’slength (n), is growing linearly with n.

24 Vaisman et al.

Finally, we investigate the SSA convergence by considering the fol-lowing asymptotic property (Noonan, 1998):

µ = limn→∞

c1

nn ∈ [µ, µ] = [2.62002, 2.679192495].

Fig. 4 summarizes our results compared to the theoretical bound. Inparticular, we run the SSA to achieve the 3% RE for 1 ≤ n ≤ 200. It can

be clearly observed, that the estimator c1/nn converges toward the [µ, µ]

interval as n grows.

50 100 150 2002.6

2.7

2.8

2.9

n

µ

c1/nnµ

Fig. 4: The c1/nn as a function of the SAW’s length n.

6. Discussion

In this paper we described a general sequential Monte Carlo procedurefor multi-dimensional integration, the SSA, and applied it to variousproblems from different research domains. We showed that this methodbelongs to a very general class of SMC algorithms and developed itstheoretical foundation. The proposed SSA is relatively easy to implementand our numerical study indicates that the SSA yields good performancein practice. However, it is important to note that generally speaking,the efficiency of the SSA and similar sequential algorithms is heavilydependent on the mixing time of the corresponding Markov chains thatare used for sampling. A rigorous analysis of the mixing time for differentproblems is thus of great interest. Opportunities for future research


include coconducting a similar analysis for other SMC algorithms, suchas the resample-move method. Finally, based on our numerical study, itwill be interesting to apply the SSA to a further variety of problems thatdo not admit to conditions of the efficiency Theorem 3.1.

Acknowledgement

This work was supported by the Australian Research Council Centre ofExcellence for Mathematical & Statistical Frontiers, under grant numberCE140100049.

A. Technical arguments

Proof of Theorem 2.1. Recall that Z1, . . . ,Zn is a partition of the set X ,so from the law of total probability we have

Ef [ϕ (X)] =

n∑

t=1

Ef [ϕ (X) | X ∈ Zt]Pf (X ∈ Zt).

By the linearity of expectation, and since Z =∑n

t=1 Zt, it will be suffi-cient to show that for all t = 1, . . . , n, it holds that

E[Zt

]= Ef [ϕ (X) | X ∈ Zt]Pf (X ∈ Zt) .

To see this, we need the following.

(a) Although that the samples in the Zt set for 1 ≤ t ≤ n are notindependent due to MCMC and splitting usage, they still have thesame distribution; that is, for all t, it holds that

E

[∑

X∈Zt

ϕ(X)∣∣∣ |Zt|

]= |Zt|Ef [ϕ (X) | X ∈ Zt] . (10)

(b) From the unbiasedness of multilevel splitting (Kroese et al., 2011;Botev and Kroese, 2012), it holds for all 1 ≤ t ≤ n that

E[Pt

]= E

(

1− Rt) t−1∏

j=0

Rj

= Pf (X ∈ Zt) . (11)

26 Vaisman et al.

Combining (10) and (11) with a conditioning on the cardinalities of theZt sets, we complete the proof with:

E[Zt

]= E

[Φt Pt

]= E

[Pt

1

|Zt|∑

X∈Zt

ϕ(X)

]

= E

[E

[Pt

1

|Zt|∑

X∈Zt

ϕ(X)

∣∣∣∣∣ |Z0|, . . . , |Zt|]]

= E

[Pt

1

|Zt|E

[∑

X∈Zt

ϕ(X)

∣∣∣∣∣ |Z0|, . . . , |Zt|]]

=︸︷︷︸(10)

E[Pt

1

|Zt||Zt|Ef [ϕ (X) | X ∈ Zt]

]

= Ef [H (X) | X ∈ Zt] E[Pt

]=︸︷︷︸(11)

Ef [H (X) | X ∈ Zt]Pf (X ∈ Zt) .

2

Proof of Theorem 3.1. The proof of this theorem consists of the fol-lowing steps.

(a) In Lemma A.1, we prove that an existence of an(ε, δn

)-approximation

to {zt}1≤t≤n implies an existence of an (ε, δ)-approximation to z =∑nt=1 zt.

(b) In Lemma A.2, we prove that an existence of an(ε4 ,

δ2n

)-approximation

to {ϕt}1≤t≤n and {pt}1≤t≤n implies an(ε, δn

)-approximation exis-

tence to {zt}1≤t≤n.

(c) In Lemmas A.3, A.4 and A.5, we provide the required(ε4 ,

δ2n

)-

approximations to ϕt and pt for 1 ≤ t ≤ n.

Lemma A.1. Suppose that for all t = 1, . . . , n, an(ε, δn

)-approximation

to zt exists. Then,

P(z(1− ε) ≤ Z ≤ z(1 + ε)

)≥ 1− δ.


Proof. From the assumption of the existence of the(ε, δn

)-approximation

to zt for each 1 ≤ t ≤ n, we have

P(∣∣∣Zt − zt

∣∣∣ ≤ εzt)≥ 1− δ

n, and P

(∣∣∣Zt − zt∣∣∣ > εzt

)<δ

n.

By using the Boole’s inequality (union bound), we arrive at

P(∃ t :

∣∣∣Zt − zt∣∣∣ > εzt

)≤

n∑

t=1

P(∣∣∣Zt − zt

∣∣∣ > εzt

)< n

δ

n= δ,

that is, it holds for all t = 1, . . . , n, that

P(∀ t :

∣∣∣Zt − zt∣∣∣ ≤ εzt

)= 1− P

(∃ t :

∣∣∣Zt − zt∣∣∣ > εct

)≥ 1− δ,

and hence,

P

((1− ε)

n∑

t=1

zt ≤n∑

t=1

Zt ≤ (1 + ε)

n∑

t=1

zt

)= P

(Z ∈≤ z(1± ε)

)≥ 1− δ.

2

Lemma A.2. Suppose that for all t = 1, . . . , n, there exists an(ε4 ,

δ2n

)-

approximation to ϕt and pt. Then,

P(zt(1− ε) ≤ Zt ≤ zt(1 + ε)

)≥ 1− δ

nfor all t = 1, . . . , n.

Proof. By assuming an existence of(ε4 ,

δ2n

)-approximation to ϕt and

pt, namely:

P

(∣∣∣∣∣Φt

ht− 1

∣∣∣∣∣ ≤ ε/4)≥ 1−δ/2n, and P

(∣∣∣∣∣Ptpt− 1

∣∣∣∣∣ ≤ ε/4)≥ 1−δ/2n,

and combining it with the union bound, we arrive at

P

1− ε ≤︸︷︷︸

(∗)

(1− ε/2

2

)2

≤ ΦtPtϕt pt

≤(

1 +ε/2

2

)2

≤︸︷︷︸(∗)

1 + ε

≥ 1−δ/n,

28 Vaisman et al.

where (∗) follows from the fact that for any 0 < |ε| < 1 and n ∈ N wehave

1− ε ≤(

1− ε/2

n

)nand

(1 +

ε/2

n

)n≤ 1 + ε. (12)

To see that (12) holds, note that by using exponential inequalities from

Bullen (1998), we have that(

1− ε/2n

)n≥ 1− ε

2 ≥ 1− ε. In addition, it

holds that |eε − 1| < 7ε/4, and hence:(

1 +ε/2

n

)n≤ eε/2 ≤ 1 +

7(ε/2)

4≤ 1 + ε. 2

To complete the proof of Theorem 3.1, we need to provide(ε4 ,

δ2n

)-

approximations to both ϕt and pt. However, at this stage we have totake into account a specific splitting strategy, since the SSA sample sizebounds depend on the latter. Here we examine the independent setting,for which the samples in each {Xt}1≤t≤n set are independent. That is, weuse multiple runs of the SSA at each stage (t = 1, . . . , n) of the algorithmexecution. See Remark 2.1 for further details.

Remark A.1 (Alternative splitting mechanisms). Our choiceof applying the independent setting does not impose a serious limitationfrom theoretical time-complexity point of view, and is more convenientfor an analysis. In particular, when dealing with the independent setting,we can apply a powerful concentration inequalities (Chernoff, 1952; Ho-effding, 1963). Alternatively, one could compromise the independence,and use Hoeffding-type inequalities for dependent random variables, suchas the ones proposed in (Glynn and Ormoneit, 2002; Paulin, 2015).

The key to obtaining the desired approximation results is summarizedin Lemma A.3.

Lemma A.3. Let X ∼ π be a strictly positive univariate random vari-able such that a ≤ X ≤ b, and let X1, . . . , Xm be its independent realiza-tions. Then, provided that

‖π − π‖TV ≤ε a

4(b− a), and m ≥ (b− a2) ln(2/δ)

2(ε/4)2a2,

it holds that:

P

((1− ε)Eπ[X] ≤ 1

m

m∑

i=1

Xi ≤ (1 + ε)Eπ[X]

)≥ 1− δ.


Proof. Recall that

‖π − π‖TV =1

b− a supϕ :R→[a,b]

∣∣∣∣∫ϕ(x) π(dx)−

∫ϕ(x)π(dx)

∣∣∣∣,

for any function ϕ : R → [a, b], (Proposition 3 in Roberts et al. (2004)).Hence,

Eπ [X]− (b− a)ε a

4(b− a)≤ Eπ [X] ≤ Eπ [X] + (b− a)

ε a

4(b− a).

Combining this with the fact that X ≥ a, we arrive at

1− ε

4≤ 1− ε a

4Eπ [X]≤ Eπ [X]

Eπ [X]≤ 1 +

ε a

4Eπ [X]≤ 1 +

ε

4. (13)

Next, since

Eπ

[1

m

m∑

i=1

Xi

]= Eπ [X] ,

we can apply the Hoeffding (1963) inequality, to obtain

P

(1− ε

4≤

1m

∑mi=1Xi

Eπ [X]≤ 1 +

ε

4

)≥ 1− δ, (14)

for

m =(b− a)2 ln(2/δ)

2(ε/4)2 (Eπ [X])2≥ (b− a)2 ln(2/δ)

2(ε/4)2a2.

Finally, we complete the proof by combining (13) and (14), to obtain:

P

(1 + ε ≤

(1− ε/2

2

)2

≤ Eπ [X]

Eπ [X]

1m

∑mi=1Xi

Eπ [X]≤(

1 +ε/2

2

)2

≤ 1 + ε

)

= P

((1− ε)Eπ[X] ≤ 1

m

m∑

i=1

Xi ≤ (1 + ε)Eπ[X]

)≥ 1− δ. 2

Remark A.2 (Lemma A.3 for binary random variables). Fora binary random variable X ∈ {0, 1}, with a known lower bound onits mean, Lemma A.3 can be strengthened via the usage of Chernoff

30 Vaisman et al.

(1952) bound instead of the Hoeffding (1963) inequality. In particular,the following holds.

Let X ∼ π(x) be a binary random variable and let X1, . . . , Xm be itsindependent realizations. Then, provided that Eπ[X] ≥ E′π[X],

‖π − π‖TV ≤εE′π[X]

4, and m ≥ 3 ln(2/δ)

(ε/4)2(E′π[X])2,

it holds that:

P

((1− ε)Eπ[X] ≤ 1

m

m∑

i=1

Xi ≤ (1 + ε)Eπ[X]

)≥ 1− δ.

The corresponding proof is almost identical to the one presented inLemma A.3. The major difference is the bound on the sample size in(14), which is achieved via the Chernoff bound from (Mitzenmacher andUpfal, 2005, Theorem 10.1) instead of Hoeffding’s inequality.

Lemma A.4. Suppose that at = minx∈Zt{ϕ(x)}, bt = maxx∈Zt

{ϕ(x)}for all t = 1, . . . , n. Then, provided that the samples in the Zt set areindependent, and are distributed according to νt such that

‖νt − νt‖TV ≤ε at

16(bt − at), and |Zt| ≥

128(bt − at)2 ln(4n/δ)

ε2a2t,

then Φt = |Zt|−1∑

X∈Ztϕ(X) is an

(ε4 ,

δ2n

)-approximation to ϕt.

Proof. The proof is an immediate consequence of Lemma A.3. Inparticular, note that

‖νt − νt‖TV ≤ε4 at

4(bt − at)=

ε at16(bt − at)

,

and that

|Zt| ≥(bt − at)2) ln(2/ δ

2n)

2( ε4/4)2a2=

128(bt − at)2 ln(4n/δ)

ε2a2t. 2

Lemma A.5. Suppose that the samples in the Xt set are independent,and are distributed according to µt, such that

‖µt − µt‖TV ≤ε rt32n

, and |Xt| ≥3072n2 ln(4n2/δ)

ε2r2t,


where rt = min {rt, 1− rt} for 1 ≤ t ≤ n. Then, Pt is an(ε4 ,

δ2n

)-

approximation to pt.

Proof. Recall that Pt =(

1− Rt)∏t−1

j=0 Rj for t = 1, . . . , n. Again,

by combining the union bound with (12), we conclude that the de-sired approximation to pt can be obtained by deriving the

(ε8n ,

δ2n2

)-

approximations for each rt and 1− rt. In this case, the probability thatfor all t = 1, . . . , n, Rt/rt satisfies 1− ε/8n ≤ Rt/rt ≤ 1 + ε/8n is at least

1− δ/2n2. The same holds for (1− Rt)/(1− rt), and thus, we arrive at:

P

(1− ε/4 ≤

(1−

ε4/2

n

)n≤ Ptpt≤(

1 +ε4/2

n

)n≤ 1 + ε/4

)≥ 1−δ/2n.

The bounds for each Rt and (1− Rt) are easily achieved via RemarkA.2. In particular, it is not very hard to verify that in order to get an(ε8n ,

δ2n2

)-approximation, it is sufficient to take

‖µt − µt‖TV ≤ε8n rt

4=ε rt32n

,

and

|Xt| ≥3 ln

(2/ δ

2n2

)

( ε8n/4)2

=3072n2 ln(4n2/δ)

ε2r2t. 2

Lemma A.6. Suppose without loss of generality that w = (w1, . . . , wk)satisfies w1 ≤ w2 ≤ · · · ≤ wk, that is w = w1. Then, for Xb and Xb−w1

sets defined via (7), it holds that:

r =|Xb−w1

||Xb|

≥ 1

k + 1.

Proof. For any b ∈ R and x = (x1, . . . , xk), define a partition of Xb

via

X(w1)b = {x ∈Xb : x1 = 1}, and X

(−w1)b = {x ∈Xb : x1 = 0}.

Then, the following holds.

(a) For any x ∈ X(w1)b , replace x1 = 1 with x1 = 0, and note that the

resulting vector is in X(−w1)b−w1

set, since its performance is at most

32 Vaisman et al.

b−w1, that is∣∣∣X (w1)

b

∣∣∣ ≤∣∣∣X (−w1)

b−w1

∣∣∣. Similarly, for any x ∈X(−w1)b−w1

,

setting x1 = 1 instead of x1 = 0, results in a vector which belongs

to the X(w1)b set. That is:

∣∣∣X (w1)b

∣∣∣ =∣∣∣X (−w1)

b−w1

∣∣∣ . (15)

(b) For any x ∈ X(w1)b , replace x1 = 1 with x1 = 0 and note that

the resulting vector is now in the X(−w1)b set, that is |X (w1)

b | ≤|X (−w1)

b |. In addition, for any x ∈ X(−w1)b , there are at most

k − 1 possibilities to replace x’s non-zero entry with zero and set

x1 = 1, such that the result will be in the X(w1)b set. That is,∣∣∣X (−w1)

b

∣∣∣ ≤ (k− 1)∣∣∣X (w1)

b

∣∣∣+ 1, (where +1 stands for the vector of

zeros), and we arrive at∣∣∣X (w1)

b

∣∣∣ ≤∣∣∣X (−w1)

b

∣∣∣ ≤ (k − 1)∣∣∣X (w1)

b

∣∣∣+ 1 ≤ k∣∣∣X (w1)

b

∣∣∣ . (16)

Combining (15) and (16), we complete the proof by noting that

|Xb−w1|

|Xb|=

∣∣∣X (w1)b−w1

∣∣∣+∣∣∣X (−w1)

b−w1

∣∣∣∣∣∣X (w1)

b

∣∣∣+∣∣∣X (−w1)

b

∣∣∣≥

∣∣∣X (−w1)b−w1

∣∣∣∣∣∣X (w1)

b

∣∣∣+∣∣∣X (−w1)

b

∣∣∣

=︸︷︷︸(15),(16)

∣∣∣X (w1)b

∣∣∣∣∣∣X (w1)

b

∣∣∣+ k∣∣∣X (w1)

b

∣∣∣=

1

k + 1.

2

References

Andrieu, C., A. Doucet, and R. Holenstein (2010). Particle Markov ChainMonte Carlo methods. Journal of the Royal Statistical Society: SeriesB (Statistical Methodology) 72 (3), 269–342.

Asmussen, S. and P. W. Glynn (2007). Stochastic Simulation: Algo-rithms and Analysis. Applications of Mathematics. Springer Scienceand Business Media, LLC.


Bartolucci, F. and L. Scaccia (2004). A new approach for estimating theBayes factor. Technical report, University di Perugia.

Botev, Z. I. and D. P. Kroese (2012). Efficient Monte Carlo simulationvia the Generalized Splitting method. Statistics and Computing 22,1–16.

Bullen, P. (1998). A Dictionary of Inequalities. Monographs and Re-search Notes in Mathematics. Oxfordshire: Taylor & Francis.

Chen, Y., J. Xie, and J. S. Liu (2005). Stopping-time resampling for se-quential Monte Carlo methods. Journal of the Royal Statistical Society:Series B (Statistical Methodology) 67 (2), 199–217.

Chernoff, H. (1952, 12). A Measure of Asymptotic Efficiency for Testsof a Hypothesis Based on the sum of Observations. Ann. Math.Statist. 23 (4), 493–507.

Chib, S. (1995). Marginal likelihood from the Gibbs output. Journal ofthe American Statistical Association 90 (432), 1313–1321.

Chopin, N. and C. P. Robert (2010). Properties of nested sampling.Biometrika 97 (3), 741–755.

Del Moral, P., A. Doucet, and A. Jasra (2006). Sequential Monte Carlosamplers. Journal of the Royal Statistical Society: Series B (StatisticalMethodology) 68 (3), 411–436.

Evans, M. (2007). Bayesian statistics 8. Chapter Discussion of Nestedsampling for Bayesian computations by John Skilling, pp. 491–524.New York: Oxford University Press.

Feroz, F. and J. Skilling (2013). Exploring multi-modal distributionswith nested sampling. AIP Conference Proceedings 1553 (1), 106–113.

Forsythe, G. E., M. A. Malcolm, and C. B. Moler (1977). Computermethods for mathematical computations. Prentice-Hall series in auto-matic computation. Englewood Cliffs (N.J.): Prentice-Hall.

Friedman, H. (1980). A consistent Fubini-Tonelli theorem for nonmea-surable functions. Illinois J. Math. 24 (3), 390–395.

34 Vaisman et al.

Friel, N. and A. N. Pettitt (2008, July). Marginal likelihood estimationvia power posteriors. Journal of the Royal Statistical Society: SeriesB (Statistical Methodology) 70 (3), 589–607.

Gelman, A., J. B. Carlin, H. S. Stern, and D. B. Rubin (2003, July).Bayesian Data Analysis (3 ed.). Oxfordshire: Taylor & Francis.

Gilks, W. R. and C. Berzuini (2001). Following a moving target - MonteCarlo inference for dynamic Bayesian models. Journal of the RoyalStatistical Society: Series B (Statistical Methodology) 63 (1), 127–146.

Glasserman, P. (2004). Monte Carlo methods in financial engineering.Applications of mathematics. New York: Springer. Permire parutionen dition broche 2010.

Glasserman, P. and J. Li (2005). Importance sampling for portfolio creditrisk. Management Science 51 (11), 1643–1656.

Glynn, P. W. and D. Ormoneit (2002, January). Hoeffding’s inequal-ity for uniformly ergodic Markov chains. Statistics & Probability Let-ters 56 (2), 143–146.

Gordon, N. J., D. J. Salmond, and A. F. M. Smith (1993, April). Novelapproach to nonlinear/non-Gaussian Bayesian state estimation. Radarand Signal Processing, IEE Proceedings F 140 (2), 107–113.

Guttmann, A. and A. Conway (2001). Square lattice self-avoiding walksand polygons. Annals of Combinatorics 5 (3), 319–345.

Han, C. and B. P. Carlin (2001). Markov Chain Monte Carlo methodsfor computing Bayes factors: A comparative review. Journal of theAmerican Statistical Association 96, 1122–1132.

Heiss, F. and V. Winschel (2008). Likelihood approximation by numericalintegration on sparse grids. Journal of Econometrics 144 (1), 62–80.

Hoeffding, W. (1963). Probability inequalities for sums of boundedrandom variables. Journal of the American Statistical Associa-tion 58 (301), 13–30.

Hooper, M. (2013). Richard Price, Bayes’ Theorem, and God. Signifi-cance 10 (1), 36–39.


Jeffrey, H. (1961). Regression Analysis (3 ed.). Oxford, England: Oxford.

Jensen, I. (2004). Enumeration of self-avoiding walks on the square lat-tice. Journal of Physics A: Mathematical and General 37 (21), 5503.

Jerrum, M. and A. Sinclair (1996). The Markov chain Monte Carlomethod: an approach to approximate counting and integration. InD. Hochbaum (Ed.), Approximation Algorithms for NP-hard Problems,Boston, pp. 482–520. PWS Publishing.

Jerrum, M., L. G. Valiant, and V. V. Vazirani (1986). Random Genera-tion of Combinatorial Structures from a Uniform Distribution. Theor.Comput. Sci. 43, 169–188.

Kahn, H. and T. E. Harris (1951). Estimation of particle Transmission byRandom Sampling. National Bureau of Standards Applied MathematicsSeries 12, 27–30.

Koller, D. and N. Friedman (2009). Probabilistic Graphical Models: Prin-ciples and Techniques - Adaptive Computation and Machine Learning.Cambridge, Massachusetts: The MIT Press.

Kroese, D. P., T. Taimre, and Z. I. Botev (2011). Handbook of MonteCarlo methods. New York: John Wiley and Sons.

Lee, P. M. (2004). Bayesian Statistics - An Introduction. London:Arnold.

Levin, D. A., Y. Peres, and E. L. Wilmer (2009). Markov chains andmixing times. Providence, R.I. American Mathematical Society. Witha chapter on coupling from the past by James G. Propp and David B.Wilson.

McGrayne, S. (2011). The Theory that Would Not Die: How Bayes’Rule Cracked the Enigma Code, Hunted Down Russian Submarines,& Emerged Triumphant from Two Centuries of Controversy. London:Yale University Press.

Mitzenmacher, M. and E. Upfal (2005). Probability and computing : ran-domized algorithms and probabilistic analysis. New York: CambridgeUniversity Press.

36 Vaisman et al.

Morokoff, W. J. and R. E. Caflisch (1995). Quasi-Monte Carlo Integra-tion. Journal of Computational Physics 122 (2), 218–230.

Morris, B. and A. Sinclair (2004). Random walks on truncated cubes andsampling 0-1 knapsack solutions. SIAM Journal on Computing 34 (1),195–226.

Murray, I., D. J. C. MacKay, Z. Ghahramani, and J. Skilling (2005).Nested sampling for Potts models. pp. 947–954.

Neal, R. (2001). Annealed importance sampling. Statistics and Comput-ing 11 (2), 125–139.

Newman, M. and G. Barkema (1999). Monte Carlo Methods in StatisticalPhysics. Oxford, New York: Clarendon Press.

Noonan, J. (1998). New upper bounds for the connective constants ofself-avoiding walks. Journal of Statistical Physics 91 (5), 871–888.

O’Hagan, A. (1991). Bayes-Hermite quadrature. Journal of StatisticalPlanning and Inference 29 (3), 245–260.

O’Hagan, A. (1995). Fractional Bayes factors for model comparison.Journal of the Royal Statistical Society. Series B (Statistical Method-ology) 57 (1), 99–138.

Paulin, D. (2015). Concentration inequalities for Markov chains by mar-ton couplings and spectral methods. Electron. J. Probab. 20, 32 pp.

Robert, C. and G. Casella (2004). Monte Carlo Statistical Methods (2ed.). New York: Springer-Verlag.

Roberts, G. O., J. S. Rosenthal, et al. (2004). General state space Markovchains and MCMC algorithms. Probability Surveys 1, 20–71.

Rubinstein, R. Y. and D. P. Kroese (2017). Simulation and the MonteCarlo Method (3 ed.). New York: John Wiley & Sons.

Rubinstein, R. Y., A. Ridder, and R. Vaisman (2013). Fast SequentialMonte Carlo Methods for Counting and Optimization. New York: JohnWiley & Sons.


Russell, S. and P. Norvig (2009). Artificial Intelligence: A Modern Ap-proach (3 ed.). Englewood Cliffs (N.J.): Prentice Hall.

Skilling, J. (2006, 12). Nested sampling for general Bayesian computa-tion. Bayesian Anal. 1 (4), 833–859.

Vaisman, R., D. P. Kroese, and I. B. Gertsbakh (2016). Splitting sequen-tial Monte Carlo for efficient unreliability estimation of highly reliablenetworks. Structural Safety 63, 1 – 10.

Valiant, L. G. (1979). The complexity of enumeration and reliabilityproblems. SIAM Journal on Computing 8 (3), 410–421.

Willams, E. (1959). Regression Analysis. New York: John Wiley & Sons.

Documents

Stratified Splitting for Efficient Monte Carlo Integration · PDF fileStratified Splitting for Efficient Monte Carlo Integration ... curse of dimensionality, ... Chapter 5), the