1
pSVGD: Projected Stein Variational Gradient Descent A fast and scalable Bayesian inference method in high dimensions Data-informed Intrinsic Low-dimensionality Bayesian Inference and SVGD References P. Chen, O. Ghattas. Projected Stein variational gradient descent. Advances in Neural Information Processing Systems, 2020. P. Chen, K. Wu, O. Ghattas. Bayesian inference of heterogeneous epidemic models: Application to COVID-19 spread accounting for long-term care facilities. arXiv:2011.01058, 2020. P. Chen, K. Wu, J. Chen, T. O'Leary-Roseberry, O Ghattas. Projected Stein variational Newton: A fast and scalable Bayesian inference method in high dimensions. Advances in Neural Information Processing Systems, 2019. O. Zahm, T. Cui, K. Law, A. Spantini, Y. Marzouk. Certified dimension reduction in nonlinear Bayesian inverse problems, arXiv:1807.03712, 2018 Peng Chen and Omar Ghattas {peng, omar}@oden.utexas.edu The curse of dimensionality is a longstanding challenge in Bayesian inference in high dimensions. In particular the Stein variational gradient descent method (SVGD) suffers from this curse. To overcome this challenge, we propose the projected SVGD (pSVGD) method, which exploits the intrinsic low dimensionality of the data informed subspace stemming from ill-posedness of inference problems. We adaptively construct the subspace using a gradient information matrix of the log- likelihood, and apply pSVGD to the much lower-dimensional coefficients of the parameter projection. The method is demonstrated to be more accurate and efficient than SVGD, and scalable with respect to the number of parameters, samples, data points, and processor cores. Abstract p(x) posterior = 1 Z f (x) likelihood p 0 (x) prior . Bayes’ rule for parameter x ∈ℝ d x +1 m = x m + ϵ l ̂ ϕ* (x m ), m = 1,…, N, = 0,1,…, ̂ ϕ* (x m )= 1 N N n=1 x n log p(x n )k (x n , x m )+ x n k (x n , x m ), SVGD update for prior samples x 0 1 , …, x 0 N where is the approximate steepest direction ̂ ϕ* (x m ) where is, e.g., Gaussian kernel, collapsing in for big . k ( , ) d d Experiment I: PDE-constrained Inference Experiment II: Inference of COVID-19 S E I R H D 1 (1 - 1 ) I 1 1 1 1 1 1 (1 - 1 ) H 1 S E I R H D 2 (1 - 2 ) I 2 2 2 2 2 2 (1 - 2 ) H 2 C susceptible exposed infectious recovered hospitalized deceased contacts S E I H R D C 1 - 1 1 - 2 H = d ( x log f (x))( x log f (x)) T p(x)dx N m=1 x log f (x m )( x log f (x m )) T , Gradient information matrix of the log-likelihood Hψ i = λ i Γψ i , x 1 , …, x N where are adaptively updated at suitable iteration. Dimension Reduction by Projection Given prior covariance , solve generalized eigenvalue problem: Γ λ 1 λ 2 ≥⋯≥ λ r ≥⋯≥ 0. with eigenvalues P r x := r i=1 ψ i ψ T i x = Ψ r w, x ∈ℝ d , We project the high-dimensional pamameter to the subspace X r = span(ψ 1 , …, ψ r ) p r (x) := 1 Z r g(P r x)p 0 (x). and define a projected posterior with profile function g(P r x) g*(P r x)= X f (P r x + ξ )p 0 (ξ | P r x)d ξ D KL ( p | p r ) D KL ( p | p* r ) γ 2 d i=r+1 λ i , Optimal profile function in Kullback—Leibler divergence bounded by eigenvalues p 0 (ξ | P r x)= p 0 (P r x + ξ )/ p r 0 (P r x) with p r 0 (P r x)= X p 0 (P r x + ξ )d ξ . where the conditional/marginal density in complement subspace X SVGD vs pSVGD Input Layer Output Layer 1 11 12 13 14 15 16 17 18 19 20 2 3 4 5 6 7 8 9 10 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 Output Layer Input Layer 1 11 12 2 3 4 5 6 7 8 9 10 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 Projection + SVGD in + Prolongation X r Projection + SVGD in + Prolongation X +1 r π (w) posterior = 1 Z w g(Ψ r w) likelihood π 0 (w) prior Bayes’ rule for projection coefficient in data-informed dominant subspaces w ∈ℝ r w +1 m = w m + ϵ l ̂ ϕ* (w m ), m = 1,…, N, = 0,1,…, ̂ ϕ* (w m )= 1 N N n=1 w n log π (w n )k r (w n , w m )+ w n k r (w n , w m ), SVGD update for prior samples w 0 1 , …, w 0 N where is the approximate steepest direction where is the projected kernel in , ̂ ϕ* (w m ) k r ( , ) r w log π (w)= Ψ T r x log p r (P r x). −∇⋅ (e x u) = 0, in (0,1) 2 . x (0,), =(0.1Δ + I ) 2 PDE model with random field ODE model with stochastic process α i (t)= 1 2 ( tanh(g i (t)) + 1 ) . g i (t) (g* i , g ), g =(s g Δ t + s I I ) 1

pSVGD: Projected Stein Variational Gradient Descent...pSVGD: Projected Stein Variational Gradient Descent A fast and scalable Bayesian inference method in high dimensions Data-informed

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: pSVGD: Projected Stein Variational Gradient Descent...pSVGD: Projected Stein Variational Gradient Descent A fast and scalable Bayesian inference method in high dimensions Data-informed

pSVGD: Projected Stein Variational Gradient Descent A fast and scalable Bayesian inference method in high dimensions

Data-informed Intrinsic Low-dimensionality

Bayesian Inference and SVGD

References

P. Chen, O. Ghattas. Projected Stein variational gradient descent. Advances in Neural Information Processing Systems, 2020.

P. Chen, K. Wu, O. Ghattas. Bayesian inference of heterogeneous epidemic models: Application to COVID-19 spread accounting for long-term care facilities. arXiv:2011.01058, 2020.

P. Chen, K. Wu, J. Chen, T. O'Leary-Roseberry, O Ghattas. Projected Stein variational Newton: A fast and scalable Bayesian inference method in high dimensions. Advances in Neural Information Processing Systems, 2019.

O. Zahm, T. Cui, K. Law, A. Spantini, Y. Marzouk. Certified dimension reduction in nonlinear Bayesian inverse problems, arXiv:1807.03712, 2018

Peng Chen and Omar Ghattas {peng, omar}@oden.utexas.edu

The curse of dimensionality is a longstanding challenge in Bayesian inference in high dimensions. In particular the Stein variational gradient descent method (SVGD) suffers from this curse. To overcome this challenge, we propose the projected SVGD (pSVGD) method, which exploits the intrinsic low dimensionality of the data informed subspace stemming from ill-posedness of inference problems. We adaptively construct the subspace using a gradient information matrix of the log-likelihood, and apply pSVGD to the much lower-dimensional coefficients of the parameter projection. The method is demonstrated to be more accurate and efficient than SVGD, and scalable with respect to the

number of parameters, samples, data points, and processor cores.

Abstract

p(x)⏟posterior

=1Z

f(x)⏟likelihood

p0(x)⏟prior

.

Bayes’ rule for parameter x ∈ ℝd

xℓ+1m = xℓ

m + ϵlϕ*ℓ(xℓ

m), m = 1,…, N, ℓ = 0,1,…,

ϕ*ℓ(xℓm) =

1N

N

∑n=1

∇xℓnlog p(xℓ

n )k(xℓn , xℓ

m) + ∇xℓnk(xℓ

n , xℓm),

SVGD update for prior samples x01, …, x0

N

where is the approximate steepest directionϕ*ℓ(xℓm)

where is, e.g., Gaussian kernel, collapsing in for big .k( ⋅ , ⋅ ) ℝd d

Experiment I: PDE-constrained Inference Experiment II: Inference of COVID-19

S E I R

H D

σ1 (1 − π1)γ I1

ν1μ1

π1η1

β1

(1 − ν1)γH1

S E I R

H D

σ2 (1 − π2)γ I2

ν2μ2

π2η2

β2

(1 − ν2)γH2

C

susceptible

exposed

infectious

recovered

hospitalized

deceased

contacts

S

E

I

H

R

D

C

1 − α1

1 − α2

H = ∫ℝd

(∇x log f(x))(∇x log f(x))T p(x)dx

≈N

∑m=1

∇x log f(xm)(∇x log f(xm))T,

Gradient information matrix of the log-likelihood

Hψi = λiΓψi,

x1, …, xNwhere are adaptively updated at suitable iteration.

Dimension Reduction by Projection

Given prior covariance , solve generalized eigenvalue problem:Γ

λ1 ≥ λ2 ≥ ⋯ ≥ λr ≥ ⋯ ≥ 0.with eigenvalues

Prx :=r

∑i=1

ψiψTi x = Ψrw, ∀x ∈ ℝd,

We project the high-dimensional pamameter to the subspace Xr = span(ψ1, …, ψr)

pr(x) :=1Zr

g(Prx)p0(x) .

and define a projected posterior with profile function g(Prx)

g*(Prx) = ∫X⊥

f(Prx + ξ)p⊥0 (ξ |Prx)dξ ⟹ DKL(p |pr) ≥ DKL(p |p*r ) ≤

γ2

d

∑i=r+1

λi,

Optimal profile function in Kullback—Leibler divergence bounded by eigenvalues

p⊥0 (ξ |Prx) = p0(Prx + ξ)/pr

0(Prx) with pr0(Prx) = ∫X⊥

p0(Prx + ξ)dξ .

where the conditional/marginal density in complement subspace X⊥

SVGD vs pSVGDInput Layer

Output Layer

1

1112 1314151617181920

23456 78910

2122 2324252627282930

3132 3334353637383940

4142 4344454647484950

5152 5354555657585960

6162 6364656667686970

Output Layer

Input Layer

1

1112

23456 78910

1314

151617181920 21222324

2526

2728

2930 3132333435363738

Projection +

SVGD in +

Prolongation

Xℓr

Projection +

SVGD in +

Prolongation

Xℓ+1r

π(w)⏟

posterior

=1

Zwg(Ψrw)

likelihood

π0(w)⏟prior

Bayes’ rule for projection coefficient in data-informed dominant subspacesw ∈ ℝr

wℓ+1m = wℓ

m + ϵlϕ*ℓ(wℓ

m), m = 1,…, N, ℓ = 0,1,…,

ϕ*ℓ(wℓm) =

1N

N

∑n=1

∇wℓnlog π(wℓ

n )kr(wℓn , wℓ

m) + ∇wℓnkr(wℓ

n , wℓm),

SVGD update for prior samples w01, …, w0

N

where is the approximate steepest direction

where is the projected kernel in ,

ϕ*ℓ(wℓm)

kr( ⋅ , ⋅ ) ℝr ∇wlog π(w) = ΨTr ∇x log pr(Prx) .

−∇ ⋅ (ex ∇u) = 0, in (0,1)2 .

x ∈ 𝒩(0,𝒞), 𝒞 = (−0.1Δ + I )−2

PDE model with random field ODE model with stochastic process

αi(t) =12 (tanh(gi(t)) + 1) .

gi(t) ∈ 𝒩(g*i , 𝒞g), 𝒞g = (−sgΔt + sII )−1