Kernel adaptive Sequential Monte Carlo · 11/3/2015 · Applied problem: infer locations of S = 3...

Outline Introduction Kernel Adaptive SMC (KASS) Implementation Details Evaluation Conclusion References

Kernel adaptive Sequential Monte Carlo

Ingmar Schuster (Paris Dauphine)Heiko Strathmann (University College London)

Brooks Paige (Oxford)Dino Sejdinovic (Oxford)

December 7, 2015

1 / 36

Section 1

Outline

2 / 36

1 Introduction

2 Kernel Adaptive SMC (KASS)

3 Implementation Details

4 Evaluation

5 Conclusion

3 / 36

Section 2

Introduction

4 / 36

Sequential Monte Carlo Samplers

Approximate integrals with respect to target distribution πT

Build upon Importance Sampling: approximate integral of hwrt density πT using samples following density q (undercertain conditions):∫

h(x)dπT (x) =

∫h(x)

πT (x)

q(x)dq(x)

Given prior π0, build sequence π0, . . . , πi , . . . πT such that

πi+1 is closer to πT than πi(δ(πi+1, πT ) < δ(πi , πT ) for some divergence δ)sample from πi can approximate πi+1 well usingimportance weight function w(·) = πi+1(·)/πi (·)

5 / 36

At i = 0

Using proposal density q0, generate particles{(w0,j ,X0,j)}Nj=1 where w0,j = π0(X0,j)/q0(X0,j)importance resampling, resulting in Nequally weighted particles {(1/N, X̄0,j)}Nj=1

rejuvenation move for each X̄0,j byMarkov Kernel leaving π0 invariant

At i > 0

approximate πi by {(πi (Xi−1,j)/πi−1(Xi−1,j),Xi−1,j)}Nj=1

resamplingrejuvenation leaving πi invariantif πi 6= πT , repeat

6 / 36

estimate evidence ZT of πT by

ZT ≈ Z0

T∏i=1

(aka normalizing constant, marginal likelihood)

Can be adaptive in rejuvenation steps without diminishingadaptation as required in adaptive MCMC

Will construct rejuvenation using RKHS-embedding ofparticles

7 / 36

Intractable Likelihoods and Evidence

in nonconjugate latent variable models, intractable likelihoodsarise

when likelihood can be estimated unbiasedly, SMC still valid

simple case: estimate likelihood using IS or SMC, leads to IS2

(Tran et al., 2013) and SMC2 (Chopin et al., 2011)

results in noisy Importance Weights, but evidenceapproximation is still valid (Tran et al., 2013, Lemma 3)

8 / 36

Nonlinear proposals based on positive definite Kernels

Kernel Adaptive Metropolis Hastings (KAMH) was introducedin Sejdinovic et al. (2014)

Given previous samples from target distribution π, draw newones more efficiently

Each sample mapped to functional in Reproducing KernelHilbert Space (RKHS) Hk using pd kernel k(·, ·)Fit Gaussian qk in Hk with

∫k(·, x)dπ(x) ≈ 1

n∑i=1

k(·,Xi )

∫k(·, x)⊗ k(·, x)dπ(x)− µ⊗ µ

9 / 36

Nonlinear proposals based on positive definite Kernels

Draw sample from qk and project back into original space, useas proposal in MH

KAMH set in adaptive MCMC, using vanishing adaptation(e.g. vanishing probability to use new samples for computingadaptive proposal)

Depending on used positive definite kernel, can adapt tononlinear targets

10 / 36

Section 3

Kernel Adaptive SMC (KASS)

11 / 36

Adaptive SMC Sampler

SMC works on a sequence of targets, so we use an artificialsequence of distributions leading from prior π0 to posterior πT

parameters of rejuvenation kernel can be adapted beforerejuvenation

Fearnhead and Taylor (2013) used global Gaussianapproximation as proposal in Metropolis Hastings rejuvenation

resulting in adaptive SMC sampler (ASMC)

12 / 36

Kernel adaptive rejuvenation

instead, we use RKHS-proposal projected into input space(in closed form)

given unweighted particles {X̄i}Ni=1, proposal at X̄j is

qKAMH(·|X̄j) = N (·|X̄j , ν2MX,X̄j

CM>X,X̄j

+ γ2I ))

where C = I − 1n11> is centering matrix and

MX,X̄j= 2[∇xk(x , X̄1)|x=X̄j

, ...,∇xk(x , X̄N)|x=X̄j]

results inASMC using linear kernel

k(X ,X ′) = X>X ′

locally adaptive fit using Gaussian RBF

k(X ,X ′) = exp

(−‖X − X ′‖2

)13 / 36

KASS versus ASMC

green: ASMC / KASS with linear kernelred: KASS with Gaussian RBF kernel

14 / 36

Related Work

Most direct relation to ASMC (which is a special case)

All SMC samplers related to Annealed Importance Samplingwhich however does not use resampling (Neal, 1998)

Local Adaptive Importance Sampling (Givens and Raferty,1996, LAIS) has similar locally adaptive effect

at each iteration compute pairwise distances betweenImportance Samplesuse k nearest neighbors for fitting local Gaussian proposalno resampling steps mean decrease in sampling efficiencywhich is exponential in dimensionality of problem

15 / 36

Section 4

Implementation Details

16 / 36

Construction of Target Sequence

For artificial distribution sequence we used geometric bridge

πi ∝ π1−ρi0 πρiT

where (ρi )Ti=1 is an increasing sequence satisfying ρT = 1

another standard choice in Bayesian Inference is addingdatapoints one after another

πi (X ) = π(X |d1, . . . , dbρiDc)

resulting in Iterated Batch Importance Sampling(Chopin, 2002, IBIS)

17 / 36

Stochastic approximation tuning of ν2

KASS’ free scaling parameter ν2 can be tuned for optimalscaling

Fearnhead and Taylor (2013) use auxiliary variable approachwith ESJD criterion

We used stochastic approximation framework of Andrieu andThoms (2008) instead

asymptotically optimal acceptance rate for Random Walkproposals is αopt = 0.234 (Rosenthal, 2011)after rejuvenation, Rao-Blackwellized estimator α̂i available byaveraging MH acceptance probabilitiestune ν2 by

ν2i+1 = ν2

i + λi (α̂i − αopt)

for non-increasing λ1, . . . , λT

18 / 36

Section 5

Evaluation

19 / 36

Synthetic nonlinear target (Banana)

Synthetic target: Banana distribution in 8 dimensions, i.e.Gaussian with twisted second dimension

20 15 10 5 0 5 10 15 20

20 / 36

Compare performance of Random-Walk rejuvenation withasymptotically optimal scaling (ν = 2.38/

√d), ASMC and

KASS with Gaussian RBF kernel

Fixed learning rate of λ = 0.1 to adapt scale parameter usingstochastic approximation

Geometric bridge of length 20

30 Monte Carlo runs

Report Maximum Mean Discrepancy (MMD) using polynomialkernel of order 3: distance of moments up to order 3 betweenground truth samples and samples produced by each method

21 / 36

0 100 200 300 400 500 600

Population size

ple ×107

KASSRWSMCASMC

Figure: Improved convergence of all mixed moments up to order 3 ofKASS compared to ASMC and RW-SMC.

22 / 36

Sensor network localization

Applied problem: infer locations of S = 3 sensors in a sensornetwork measuring distance to each other

Known position for B = 2 base sensors

Measurements successful with probability decayingexponentially in squared distance (otherwise unobserved)

Zi ,j ∼ Binom

(1, exp

(−‖xi − xj‖2

2 · 0.32

))Measurements corrupted by Gaussian noise

Yi ,j ∼

{N (‖xi − xj‖, 0.02) if Zi ,j = 1

Yi ,j = 0 else

23 / 36

run KASS and ASMC with geometric bridge of length 50 and10, 000 particles, fixed learning rate λi = 1

run KAMH for 50 · 10, 000 iterations, discard first half asburn-in, diminishing adaptation λi = 1/

initialize both algorithms with samples from prior

qualitative comparison of KASS and closest adaptive MCMCalgorithm KAMH

24 / 36

Sensor network localization: KAMH adaptive MCMC

−0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.80.0

1.0MCMC (KAMH)

Figure: Posterior samples of unknown sensor locations (in color) byKAMH. Set-up of the true sensor locations (black dots) and base sensors(black stars) causes uncertainty in posterior.

25 / 36

Sensor network localization: KASS adaptive SMC

−0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.80.0

1.0SMC (KASS)

Figure: Posterior samples of unknown sensor locations (in color) byKASS. Set-up of the true sensor locations (black dots) and base sensors(black stars) causes uncertainty in posterior.

26 / 36

MCMC algorithm not able to traverse all the modes withoutspecial care (e.g. Wormhole HMC by Lan et al., 2014)

KASS and ASMC perform similarly in this setup

with S = 2 (higher uncertainty), 1000 particles MMD of

0.76± 0.4 for KASS0.94± 0.7 for ASMC

27 / 36

Evidence approximation for intractable likelihoods

in classification using Gaussian Processes (GP), logistictransformation renders likelihood intractable

likelihood can be unbiasedly estimated using ImportanceSampling from EP approximation

estimate model evidence when using ARD kernel in the GP

particularly hard because noisy likelihoods means noisyimportance weights

ground truth by averaging evidence estimate over 20 longrunning SMC algorithms

28 / 36

Evidence approximation for intractable likelihoods

Figure: Ground truth in red, KASS in blue, ASMC in green.

29 / 36

Section 6

Conclusion

30 / 36

Conclusion (1)

Developed Kernel Adaptive SMC sampler for static models

KASS exploits local covariance of target throughRKHS-informed rejuvenation proposals

combines these with general SMC advantages for multimodaltargets and evidence estimation

especially attractive when likelihoods are intractable

31 / 36

Conclusion (2)

evaluated on a strongly twisted Banana where it was clearlybetter than ASMC

KASS enables exploring multiple modes in nonlinear sensor

KASS exhibits less variance than ASMC in evidenceestimation for GP classification

evidence approximation even in case of intractable likelihoods

32 / 36

Thanks!

33 / 36

Literature I

Andrieu, C. and Thoms, J. (2008). A tutorial on adaptive MCMC.Statistics and Computing, 18(November):343–373.

Chopin, N. (2002). A sequential particle filter method for staticmodels. Biometrika, 89(3):539–552.

Chopin, N., Jacob, P. E., and Papaspiliopoulos, O. (2011).SMCˆ2: an efficient algorithm for sequential analysis ofstate-space models. 0(1):1–27.

Fearnhead, P. and Taylor, B. M. (2013). An Adaptive SequentialMonte Carlo Sampler. Bayesian Analysis, (2):411–438.

Givens, G. H. and Raferty, A. E. (1996). Local AdaptiveImportance Sampling for Multivariate Densities with StrongNonlinear Relationships. Journal of the American StatisticalAssociation, 91(433):132–141.

34 / 36

Literature II

Lan, S., Streets, J., and Shahbaba, B. (2014). Wormholehamiltonian monte carlo. In Twenty-Eighth AAAI Conference onArtificial Intelligence.

Neal, R. (1998). Annealed Importance Sampling. Technical report,University of Toronto.

Rosenthal, J. S. (2011). Optimal Proposal Distributions andAdaptive MCMC. In Handbook of Markov Chain Monte Carlo,chapter 4, pages 93–112. Chapman & Hall.

Sejdinovic, D., Strathmann, H., Lomeli, M. G., Andrieu, C., andGretton, A. (2014). Kernel Adaptive Metropolis-Hastings. InInternational Conference on Machine Learning (ICML), pages1665–1673.

35 / 36

Literature III

Tran, M.-N., Scharth, M., Pitt, M. K., and Kohn, R. (2013).Importance sampling squared for Bayesian inference in latentvariable models. pages 1–39.

36 / 36

Kernel adaptive Sequential Monte Carlo · 11/3/2015 · Applied problem: infer locations of S = 3...

Documents

Observe Infer Author’s Argument?

5th - infer · Harvey and Goudvis discuss inferring in their book Strategies That Work. They state infer-ring is the bedrock of comprehension, not only in reading. We infer in many

Kernel Architecture : UNIX Kernel

Using Revealed Preferences to Infer Environmental Beneﬁts ...people.duke.edu/~lds5/Papers/Bennear_Stavins_Wagner_JRE.pdf · Using Revealed Preferences to Infer Environmental Beneﬁts:

Dual Sequential Prediction Models Linking Sequential ... Sequential.pdf · Dual Sequential Prediction Models Linking Sequential Recommendation and Information Dissemination Qitian

Linux Kernel Security Overview - Linux Kernel Developernamei.org/presentations/linux-kernel-security-kca09.pdf · Linux Kernel Security Overview Kernel Conference Australia ... Labeled

Project-Team SequeL Sequential Learning · Scientiﬁc Foundations .....3 3.1. Introduction 3 3.2. Markov decision problems 3 3.3. Statistical learning 5 3.3.1. Kernel methods for

Using Revealed Preferences to Infer Environmental Beneﬁts ... · Using Revealed Preferences to Infer Environmental ... the demand for state ﬁshing licenses is used to infer the

3rd infer strategies

Sequential Abstract State Machines Capture Sequential ...mapmf.pmfst.unist.hr/~milica/Matem_teorija_r/MTR_web/seqthesis.p… · We examine sequential algorithms and formulate a Sequential

Reading Comprehension Sample Lesson: Strategy: Infer and ...readingrecovery.clemson.edu/.../infer_lesson_transcript.pdfStrategy: Infer and Visualize Lesson: Infer Meaning in Poetry

Infer cocinas

Sequential Markov Chain Monte Carlo - arXiv · The proposed sequential MCMC is a population-based MCMC, where each chain is constructed via specifying a transition kernel Tt for updating

Infer Ential Statistics

3rd infer

KERNEL OF THE KERNEL - IslamicBlessings.comislamicblessings.com/upload/Kernel of the Kernel.pdf · SUNY Series in Islam Seyyed Hossein Nasr, editor Faghfoory: Kernel of the Kernel

I06B03 Observe then Infer

Social Tables - Infer Case Study

Kernel and Non Kernel Clauses

Inside the NVIDIA Ampere Architecture · 2020-05-20 · L2 DRAM NVLINK kernel buffer A kernel buffer B kernel buffer A kernel buffer B kernel buffer C kernel buffer D kernel buffer