Download pdf - Kernel Sequential Monte Carlo - WordPress.com · 11/25/2015 · Intractable likelihoods Intractable Likelihoods and Evidence intractable likelihoods arise in many models (e.g. nonconjugate

Outline Introduction Kernel SMC Implementation Details Evaluation Conclusion References

Kernel Sequential Monte Carlo

Ingmar Schuster∗ (Paris Dauphine)Heiko Strathmann∗ (University College London)

Brooks Paige (Oxford)Dino Sejdinovic (Oxford)

* equal contribution

April 25, 2016

1 / 37


Section 1

Outline

2 / 37


1 IntroductionImportance Sampling, PMC and SMCIntractable likelihoodsKernel emulators

2 Kernel SMC

3 Implementation Details

4 Evaluation

5 Conclusion

3 / 37


Section 2

Introduction

4 / 37


Importance Sampling, PMC and SMC

Importance Sampling estimators

Importance Sampling identity

H =

∫π(x)h(x)dx =

∫π(x)

q(x)h(x)q(x)dx

≈ 1

wΣ

N∑i=1

w(Xi )h(Xi )

where Xi ∼ q iid, w(X ) = π(X )/q(X ) called unnormalizedimportance weight, wΣ =

∑Ni=1 w(Xi )

PMC identity: for any law g over proposals

H =

∫∫π(x)

qt(x)h(x) dqt(x) dg(qt) =

1

wΣ

T∑t=1

N∑i=1

wt(Xi )h(Xi )

5 / 37



Proposal fatter than target

π(x)

q(x)

4 2 0 2 4

w(x

)=π(x

)/q(x)

6 / 37



Proposal thinner than target

π(x)

q(x)

4 2 0 2 4

w(x

)=π(x

)/q(x)

7 / 37



Population Monte Carlo Cappe et al. (2004)

Input: initial proposal density q0, unnormalized density π,population size N, sample size mOutput: lists P,W of m samples and weightsInitialize P = List()Initialize W = List()while len(P) ≤ m do

construct proposal distribution qtgenerate set of p samples Xt from qt and append it to P

for all X ∈ Xt append weights π(X )/qt(X ) to Wend while

8 / 37



Sequential Monte Carlo Samplers

Approximate integrals with respect to target distribution πT

Build upon Importance Sampling: approximate integral of hwrt density πT using samples following density q (undercertain conditions):∫

h(x)dπT (x) =

∫h(x)

πT (x)

q(x)dq(x)

Given prior π0, build sequence π0, . . . , πi , . . . πT such that

πi+1 is closer to πT than πi(δ(πi+1, πT ) < δ(πi , πT ) for some divergence δ)sample from πi can approximate πi+1 well usingimportance weight function w(·) = πi+1(·)/πi (·)

9 / 37




At i = 0

Using proposal density q0, generate particles{(w0,j ,X0,j)}Nj=1 where w0,j = π0(X0,j)/q0(X0,j)importance resampling, resulting in Nequally weighted particles {(1/N, X0,j)}Nj=1

rejuvenation move for each X0,j byMarkov Kernel leaving π0 invariant

At i > 0

approximate πi by {(πi (Xi−1,j)/πi−1(Xi−1,j),Xi−1,j)}Nj=1

resamplingrejuvenation leaving πi invariantif πi 6= πT , repeat

10 / 37



A visual SMC iteration

Target and proposal

πi(x)

πi−1(x)

Weighted samples

Resampling proportional to weights

4 2 0 2 4

MCMC rejuvenation

11 / 37




estimate evidence ZT of πT by

ZT ≈ Z0

T∏i=1

1

N

∑j

wi ,j

(aka normalizing constant, marginal likelihood)

Can be adaptive in rejuvenation steps without diminishingadaptation as required in adaptive MCMC

Will construct rejuvenation using RKHS-embedding ofparticles

12 / 37


Intractable likelihoods

Intractable Likelihoods and Evidence

intractable likelihoods arise in many models (e.g.nonconjugate latent variable models)

for unbiased likelihood estimates, SMC/PMC still valid

simple case: estimate likelihood using IS or SMC, leads to IS2

(Tran et al., 2013) and SMC2 (Chopin et al., 2011)

results in noisy Importance Weights, but approximation ofevidence (probability of model given data) is still valid (Tranet al., 2013, Lemma 3)

cannot easily use information on geometry of π for efficientinference (e.g. gradients unavailable)

13 / 37


Kernel emulators

Kernel emulators

In the following: adapt RKHS-based emulators to PMC andSMC in intractable likelihood settings for adapting to targetgeometry

Using pd kernel k(·, ·) we can

adapt to local covariance (Sejdinovic et al., 2014)use gradient information of infinite exponential familyapproximation to π (Strathmann et al., 2015)

Emulators used for constructing proposals qt and use

importance correction in PMCMetropolis-Hastings correction within in SMC rejuvenationmoves

for samples X ∼ qt

14 / 37


Kernel emulators

Kernel Emulators

Local covariance: let k′(y , x) = ∇yk(y , x) andµ(y) =

∫k′(y , x)dπ(x) then

K (y) =

∫(k′(y , x)− µ(y))2dπ(x)

Gradient emulationfit infinite exponential family approximation

q(y) = exp(f (y)− A(f ))

where f (y) = 〈f , k(y , ·)〉H is the inner product between naturalparameters f and sufficient statistics k(y , ·) in H by minimizing∫

(∇y log π(y)−∇y f (y))2dπ(y)

use gradients information of log q in proposals

15 / 37


Section 3

Kernel SMC

16 / 37


Kernel Gradient Importance sampling

Use N (·|X + δ1∇X log q(X ), δ2C ) proposals with importanceweighting in PMC

C is a fit to global covariance of target π

resulting in Kernel Gradient Importance Sampling (KGRIS)

variant of Gradient Importance Sampling (Schuster, 2015)

17 / 37


Kernel Adaptive SMC Sampler

Use artificial sequence of distributions leading from prior π0 toposterior πT

rejuvenation with MH moves using N (·|X , δK (X )) proposals

resulting in Kernel Adaptive SMC (KASMC)

similar to Adaptive SMC sampler, a special case when using alinear kernel (Fearnhead and Taylor, 2013)

18 / 37


KASMC versus ASMC

green: ASMC / KASMC with linear kernelred: KASMC with Gaussian RBF kernel

19 / 37


Section 4

Implementation Details

20 / 37


Construction of Target Sequence

For artificial distribution sequence we used geometric bridge

πi ∝ π1−ρi0 πρiT

where (ρi )Ti=1 is an increasing sequence satisfying ρT = 1

another standard choice in Bayesian Inference is addingdatapoints one after another

πi (X ) = π(X |d1, . . . , dbρiDc)

resulting in Iterated Batch Importance Sampling(Chopin, 2002, IBIS)

21 / 37


Stochastic approximation, variance reduction

Free scaling parameters can be tuned for optimal scaling ofMCMC using stochastic approximation framework of Andrieuand Thoms (2008)

asymptotically optimal acceptance rate for Random Walk MHis αopt = 0.234 (Rosenthal, 2011)tune single parameter δi by

δi+1 = δi + λi (αi − αopt)

for non-increasing λ1, . . . , λT

used Random Fourier Features (Rahimi and Recht, 2007) forefficient on-line updates of emulators

used weighted updates and Rao-Blackwellization for variancereduction in estimated emulators

22 / 37


Section 5

Evaluation

23 / 37


Synthetic nonlinear target (Banana)

Synthetic target: Banana distribution in 8 dimensions, i.e.Gaussian with twisted second dimension

20 15 10 5 0 5 10 15 20

4

2

0

2

4

6

8

24 / 37



Compare performance of Random-Walk rejuvenation withasymptotically optimal scaling (ν = 2.38/

√d), ASMC and

KASMC with Gaussian RBF kernel

Fixed learning rate of λ = 0.1 to adapt scale parameter usingstochastic approximation

Geometric bridge of length 20

30 Monte Carlo runs

Report Maximum Mean Discrepancy (MMD) using polynomialkernel of order 3: distance of moments up to order 3 betweenground truth samples and samples produced by each method

25 / 37



Figure: Improved convergence of all mixed moments up to order 3 ofKASMC compared to ASMC and RW-SMC. 26 / 37


Evidence approximation for intractable likelihoods

in classification using Gaussian Processes (GP), logistictransformation renders likelihood intractable

likelihood can be unbiasedly estimated using ImportanceSampling from EP approximation

estimate model evidence when using ARD kernel in the GP

particularly hard because noisy likelihoods means noisyimportance weights

ground truth by averaging evidence estimate over 20 longrunning SMC algorithms

27 / 37


Evidence approximation for intractable likelihoods

0 100 200 300 400 500

Number of particles

100

101

102

103Estimation variance

KASSASMC

Figure: Monte Carlo Variance, KASMC in blue, ASMC in green.

28 / 37


Stochastic volatility model with intractable likelihood

Stochastic volatility particularly challenging class of bayesianinverse problems

time series as a high-dimensional nuisance variable

models have to capture the non-linearities in the data(Barndorff-Nielsen and Shephard, 2001)

concentrate on the prediction of daily volatility of asset prices,reusing the model and dataset studied by Chopin et al. (2011)(nuisance of dimension d = 753)

report RMSE of target covariance estimate

29 / 37


KGRIS with Stochastic volatility

5 10 15 20 25 30 35 40 45 50

Population size

0.0052

0.0054

0.0056

0.0058

0.0060

0.0062

0.0064

RM

SEco

vari

ance

AMKGIS

30 / 37


Section 6

Conclusion

31 / 37


Conclusion (1)

Developed Kernel SMC framework

KSMC exploits kernel emulators of target structure

combines these with general SMC/PMC advantages formultimodal targets and evidence estimation

especially attractive when likelihoods are intractable

32 / 37


Conclusion (2)

evaluated on several challenging models where it was clearlyimproving statistical efficiency

KASMC exhibits better MMD for Bananaless MC variance than ASMC in evidence estimation for GPclassificationKGRIS clearly improves covariance estimates in StochasticVolatility model

33 / 37


Thanks!

34 / 37


Literature I

Andrieu, C. and Thoms, J. (2008). A tutorial on adaptive MCMC.Statistics and Computing, 18(November):343–373.

Barndorff-Nielsen, O. E. and Shephard, N. (2001). Non-GaussianOrnstern-Uhlenbeck-Based Models and Some of Their Uses inFinancial Economics. Journal of the Royal Statistical Society.Series B, 63(2):167–241.

Cappe, O., Guillin, a., Marin, J. M., and Robert, C. P. (2004).Population Monte Carlo. Journal of Computational andGraphical Statistics, 13(4):907–929.

Chopin, N. (2002). A sequential particle filter method for staticmodels. Biometrika, 89(3):539–552.

35 / 37


Literature II

Chopin, N., Jacob, P. E., and Papaspiliopoulos, O. (2011).SMCˆ2: an efficient algorithm for sequential analysis ofstate-space models. 0(1):1–27.

Fearnhead, P. and Taylor, B. M. (2013). An Adaptive SequentialMonte Carlo Sampler. Bayesian Analysis, (2):411–438.

Rahimi, A. and Recht, B. (2007). Random Features for LargeScale Kernel Machines. In Neural Information ProcessingSystems, number 1, pages 1–8.

Rosenthal, J. S. (2011). Optimal Proposal Distributions andAdaptive MCMC. In Handbook of Markov Chain Monte Carlo,chapter 4, pages 93–112. Chapman & Hall.

36 / 37


Literature III

Schuster, I. (2015). Consistency of Importance Sampling estimatesbased on dependent sample sets and an application to modelswith factorizing likelihoods. arXiv preprint, pages 1–14.

Sejdinovic, D., Strathmann, H., Garcia, M. L., Andrieu, C., andGretton, A. (2014). Kernel Adaptive Metropolis-Hastings. arXiv,32.

Strathmann, H., Sejdinovic, D., Livingstone, S., Szabo, Z., andGretton, A. (2015). Gradient-free Hamiltonian Monte Carlo withefficient Kernel Exponential Families. In Neural InformationProcessing Systems.

Tran, M.-N., Scharth, M., Pitt, M. K., and Kohn, R. (2013).Importance sampling squared for Bayesian inference in latentvariable models.

37 / 37