Outline Introduction Kernel SMC Implementation Details Evaluation Conclusion References
Kernel Sequential Monte Carlo
Ingmar Schuster∗ (Paris Dauphine)Heiko Strathmann∗ (University College London)
Brooks Paige (Oxford)Dino Sejdinovic (Oxford)
* equal contribution
April 25, 2016
1 / 37
Outline Introduction Kernel SMC Implementation Details Evaluation Conclusion References
Section 1
Outline
2 / 37
Outline Introduction Kernel SMC Implementation Details Evaluation Conclusion References
1 IntroductionImportance Sampling, PMC and SMCIntractable likelihoodsKernel emulators
2 Kernel SMC
3 Implementation Details
4 Evaluation
5 Conclusion
3 / 37
Outline Introduction Kernel SMC Implementation Details Evaluation Conclusion References
Section 2
Introduction
4 / 37
Outline Introduction Kernel SMC Implementation Details Evaluation Conclusion References
Importance Sampling, PMC and SMC
Importance Sampling estimators
Importance Sampling identity
H =
∫π(x)h(x)dx =
∫π(x)
q(x)h(x)q(x)dx
≈ 1
wΣ
N∑i=1
w(Xi )h(Xi )
where Xi ∼ q iid, w(X ) = π(X )/q(X ) called unnormalizedimportance weight, wΣ =
∑Ni=1 w(Xi )
PMC identity: for any law g over proposals
H =
∫∫π(x)
qt(x)h(x) dqt(x) dg(qt) =
1
wΣ
T∑t=1
N∑i=1
wt(Xi )h(Xi )
5 / 37
Outline Introduction Kernel SMC Implementation Details Evaluation Conclusion References
Importance Sampling, PMC and SMC
Proposal fatter than target
π(x)
q(x)
4 2 0 2 4
w(x
)=π(x
)/q(x)
6 / 37
Outline Introduction Kernel SMC Implementation Details Evaluation Conclusion References
Importance Sampling, PMC and SMC
Proposal thinner than target
π(x)
q(x)
4 2 0 2 4
w(x
)=π(x
)/q(x)
7 / 37
Outline Introduction Kernel SMC Implementation Details Evaluation Conclusion References
Importance Sampling, PMC and SMC
Population Monte Carlo Cappe et al. (2004)
Input: initial proposal density q0, unnormalized density π,population size N, sample size mOutput: lists P,W of m samples and weightsInitialize P = List()Initialize W = List()while len(P) ≤ m do
construct proposal distribution qtgenerate set of p samples Xt from qt and append it to P
for all X ∈ Xt append weights π(X )/qt(X ) to Wend while
8 / 37
Outline Introduction Kernel SMC Implementation Details Evaluation Conclusion References
Importance Sampling, PMC and SMC
Sequential Monte Carlo Samplers
Approximate integrals with respect to target distribution πT
Build upon Importance Sampling: approximate integral of hwrt density πT using samples following density q (undercertain conditions):∫
h(x)dπT (x) =
∫h(x)
πT (x)
q(x)dq(x)
Given prior π0, build sequence π0, . . . , πi , . . . πT such that
πi+1 is closer to πT than πi(δ(πi+1, πT ) < δ(πi , πT ) for some divergence δ)sample from πi can approximate πi+1 well usingimportance weight function w(·) = πi+1(·)/πi (·)
9 / 37
Outline Introduction Kernel SMC Implementation Details Evaluation Conclusion References
Importance Sampling, PMC and SMC
Sequential Monte Carlo Samplers
At i = 0
Using proposal density q0, generate particles{(w0,j ,X0,j)}Nj=1 where w0,j = π0(X0,j)/q0(X0,j)importance resampling, resulting in Nequally weighted particles {(1/N, X0,j)}Nj=1
rejuvenation move for each X0,j byMarkov Kernel leaving π0 invariant
At i > 0
approximate πi by {(πi (Xi−1,j)/πi−1(Xi−1,j),Xi−1,j)}Nj=1
resamplingrejuvenation leaving πi invariantif πi 6= πT , repeat
10 / 37
Outline Introduction Kernel SMC Implementation Details Evaluation Conclusion References
Importance Sampling, PMC and SMC
A visual SMC iteration
Target and proposal
πi(x)
πi−1(x)
Weighted samples
Resampling proportional to weights
4 2 0 2 4
MCMC rejuvenation
11 / 37
Outline Introduction Kernel SMC Implementation Details Evaluation Conclusion References
Importance Sampling, PMC and SMC
Sequential Monte Carlo Samplers
estimate evidence ZT of πT by
ZT ≈ Z0
T∏i=1
1
N
∑j
wi ,j
(aka normalizing constant, marginal likelihood)
Can be adaptive in rejuvenation steps without diminishingadaptation as required in adaptive MCMC
Will construct rejuvenation using RKHS-embedding ofparticles
12 / 37
Outline Introduction Kernel SMC Implementation Details Evaluation Conclusion References
Intractable likelihoods
Intractable Likelihoods and Evidence
intractable likelihoods arise in many models (e.g.nonconjugate latent variable models)
for unbiased likelihood estimates, SMC/PMC still valid
simple case: estimate likelihood using IS or SMC, leads to IS2
(Tran et al., 2013) and SMC2 (Chopin et al., 2011)
results in noisy Importance Weights, but approximation ofevidence (probability of model given data) is still valid (Tranet al., 2013, Lemma 3)
cannot easily use information on geometry of π for efficientinference (e.g. gradients unavailable)
13 / 37
Outline Introduction Kernel SMC Implementation Details Evaluation Conclusion References
Kernel emulators
Kernel emulators
In the following: adapt RKHS-based emulators to PMC andSMC in intractable likelihood settings for adapting to targetgeometry
Using pd kernel k(·, ·) we can
adapt to local covariance (Sejdinovic et al., 2014)use gradient information of infinite exponential familyapproximation to π (Strathmann et al., 2015)
Emulators used for constructing proposals qt and use
importance correction in PMCMetropolis-Hastings correction within in SMC rejuvenationmoves
for samples X ∼ qt
14 / 37
Outline Introduction Kernel SMC Implementation Details Evaluation Conclusion References
Kernel emulators
Kernel Emulators
Local covariance: let k′(y , x) = ∇yk(y , x) andµ(y) =
∫k′(y , x)dπ(x) then
K (y) =
∫(k′(y , x)− µ(y))2dπ(x)
Gradient emulationfit infinite exponential family approximation
q(y) = exp(f (y)− A(f ))
where f (y) = 〈f , k(y , ·)〉H is the inner product between naturalparameters f and sufficient statistics k(y , ·) in H by minimizing∫
(∇y log π(y)−∇y f (y))2dπ(y)
use gradients information of log q in proposals
15 / 37
Outline Introduction Kernel SMC Implementation Details Evaluation Conclusion References
Section 3
Kernel SMC
16 / 37
Outline Introduction Kernel SMC Implementation Details Evaluation Conclusion References
Kernel Gradient Importance sampling
Use N (·|X + δ1∇X log q(X ), δ2C ) proposals with importanceweighting in PMC
C is a fit to global covariance of target π
resulting in Kernel Gradient Importance Sampling (KGRIS)
variant of Gradient Importance Sampling (Schuster, 2015)
17 / 37
Outline Introduction Kernel SMC Implementation Details Evaluation Conclusion References
Kernel Adaptive SMC Sampler
Use artificial sequence of distributions leading from prior π0 toposterior πT
rejuvenation with MH moves using N (·|X , δK (X )) proposals
resulting in Kernel Adaptive SMC (KASMC)
similar to Adaptive SMC sampler, a special case when using alinear kernel (Fearnhead and Taylor, 2013)
18 / 37
Outline Introduction Kernel SMC Implementation Details Evaluation Conclusion References
KASMC versus ASMC
green: ASMC / KASMC with linear kernelred: KASMC with Gaussian RBF kernel
19 / 37
Outline Introduction Kernel SMC Implementation Details Evaluation Conclusion References
Section 4
Implementation Details
20 / 37
Outline Introduction Kernel SMC Implementation Details Evaluation Conclusion References
Construction of Target Sequence
For artificial distribution sequence we used geometric bridge
πi ∝ π1−ρi0 πρiT
where (ρi )Ti=1 is an increasing sequence satisfying ρT = 1
another standard choice in Bayesian Inference is addingdatapoints one after another
πi (X ) = π(X |d1, . . . , dbρiDc)
resulting in Iterated Batch Importance Sampling(Chopin, 2002, IBIS)
21 / 37
Outline Introduction Kernel SMC Implementation Details Evaluation Conclusion References
Stochastic approximation, variance reduction
Free scaling parameters can be tuned for optimal scaling ofMCMC using stochastic approximation framework of Andrieuand Thoms (2008)
asymptotically optimal acceptance rate for Random Walk MHis αopt = 0.234 (Rosenthal, 2011)tune single parameter δi by
δi+1 = δi + λi (αi − αopt)
for non-increasing λ1, . . . , λT
used Random Fourier Features (Rahimi and Recht, 2007) forefficient on-line updates of emulators
used weighted updates and Rao-Blackwellization for variancereduction in estimated emulators
22 / 37
Outline Introduction Kernel SMC Implementation Details Evaluation Conclusion References
Section 5
Evaluation
23 / 37
Outline Introduction Kernel SMC Implementation Details Evaluation Conclusion References
Synthetic nonlinear target (Banana)
Synthetic target: Banana distribution in 8 dimensions, i.e.Gaussian with twisted second dimension
20 15 10 5 0 5 10 15 20
4
2
0
2
4
6
8
24 / 37
Outline Introduction Kernel SMC Implementation Details Evaluation Conclusion References
Synthetic nonlinear target (Banana)
Compare performance of Random-Walk rejuvenation withasymptotically optimal scaling (ν = 2.38/
√d), ASMC and
KASMC with Gaussian RBF kernel
Fixed learning rate of λ = 0.1 to adapt scale parameter usingstochastic approximation
Geometric bridge of length 20
30 Monte Carlo runs
Report Maximum Mean Discrepancy (MMD) using polynomialkernel of order 3: distance of moments up to order 3 betweenground truth samples and samples produced by each method
25 / 37
Outline Introduction Kernel SMC Implementation Details Evaluation Conclusion References
Synthetic nonlinear target (Banana)
Figure: Improved convergence of all mixed moments up to order 3 ofKASMC compared to ASMC and RW-SMC. 26 / 37
Outline Introduction Kernel SMC Implementation Details Evaluation Conclusion References
Evidence approximation for intractable likelihoods
in classification using Gaussian Processes (GP), logistictransformation renders likelihood intractable
likelihood can be unbiasedly estimated using ImportanceSampling from EP approximation
estimate model evidence when using ARD kernel in the GP
particularly hard because noisy likelihoods means noisyimportance weights
ground truth by averaging evidence estimate over 20 longrunning SMC algorithms
27 / 37
Outline Introduction Kernel SMC Implementation Details Evaluation Conclusion References
Evidence approximation for intractable likelihoods
0 100 200 300 400 500
Number of particles
100
101
102
103Estimation variance
KASSASMC
Figure: Monte Carlo Variance, KASMC in blue, ASMC in green.
28 / 37
Outline Introduction Kernel SMC Implementation Details Evaluation Conclusion References
Stochastic volatility model with intractable likelihood
Stochastic volatility particularly challenging class of bayesianinverse problems
time series as a high-dimensional nuisance variable
models have to capture the non-linearities in the data(Barndorff-Nielsen and Shephard, 2001)
concentrate on the prediction of daily volatility of asset prices,reusing the model and dataset studied by Chopin et al. (2011)(nuisance of dimension d = 753)
report RMSE of target covariance estimate
29 / 37
Outline Introduction Kernel SMC Implementation Details Evaluation Conclusion References
KGRIS with Stochastic volatility
5 10 15 20 25 30 35 40 45 50
Population size
0.0052
0.0054
0.0056
0.0058
0.0060
0.0062
0.0064
RM
SEco
vari
ance
AMKGIS
30 / 37
Outline Introduction Kernel SMC Implementation Details Evaluation Conclusion References
Section 6
Conclusion
31 / 37
Outline Introduction Kernel SMC Implementation Details Evaluation Conclusion References
Conclusion (1)
Developed Kernel SMC framework
KSMC exploits kernel emulators of target structure
combines these with general SMC/PMC advantages formultimodal targets and evidence estimation
especially attractive when likelihoods are intractable
32 / 37
Outline Introduction Kernel SMC Implementation Details Evaluation Conclusion References
Conclusion (2)
evaluated on several challenging models where it was clearlyimproving statistical efficiency
KASMC exhibits better MMD for Bananaless MC variance than ASMC in evidence estimation for GPclassificationKGRIS clearly improves covariance estimates in StochasticVolatility model
33 / 37
Outline Introduction Kernel SMC Implementation Details Evaluation Conclusion References
Thanks!
34 / 37
Outline Introduction Kernel SMC Implementation Details Evaluation Conclusion References
Literature I
Andrieu, C. and Thoms, J. (2008). A tutorial on adaptive MCMC.Statistics and Computing, 18(November):343–373.
Barndorff-Nielsen, O. E. and Shephard, N. (2001). Non-GaussianOrnstern-Uhlenbeck-Based Models and Some of Their Uses inFinancial Economics. Journal of the Royal Statistical Society.Series B, 63(2):167–241.
Cappe, O., Guillin, a., Marin, J. M., and Robert, C. P. (2004).Population Monte Carlo. Journal of Computational andGraphical Statistics, 13(4):907–929.
Chopin, N. (2002). A sequential particle filter method for staticmodels. Biometrika, 89(3):539–552.
35 / 37
Outline Introduction Kernel SMC Implementation Details Evaluation Conclusion References
Literature II
Chopin, N., Jacob, P. E., and Papaspiliopoulos, O. (2011).SMCˆ2: an efficient algorithm for sequential analysis ofstate-space models. 0(1):1–27.
Fearnhead, P. and Taylor, B. M. (2013). An Adaptive SequentialMonte Carlo Sampler. Bayesian Analysis, (2):411–438.
Rahimi, A. and Recht, B. (2007). Random Features for LargeScale Kernel Machines. In Neural Information ProcessingSystems, number 1, pages 1–8.
Rosenthal, J. S. (2011). Optimal Proposal Distributions andAdaptive MCMC. In Handbook of Markov Chain Monte Carlo,chapter 4, pages 93–112. Chapman & Hall.
36 / 37
Outline Introduction Kernel SMC Implementation Details Evaluation Conclusion References
Literature III
Schuster, I. (2015). Consistency of Importance Sampling estimatesbased on dependent sample sets and an application to modelswith factorizing likelihoods. arXiv preprint, pages 1–14.
Sejdinovic, D., Strathmann, H., Garcia, M. L., Andrieu, C., andGretton, A. (2014). Kernel Adaptive Metropolis-Hastings. arXiv,32.
Strathmann, H., Sejdinovic, D., Livingstone, S., Szabo, Z., andGretton, A. (2015). Gradient-free Hamiltonian Monte Carlo withefficient Kernel Exponential Families. In Neural InformationProcessing Systems.
Tran, M.-N., Scharth, M., Pitt, M. K., and Kohn, R. (2013).Importance sampling squared for Bayesian inference in latentvariable models.
37 / 37