Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Numerical Integration
Cheng Soon OngMarc Peter Deisenroth
December 2020
Ma
r
c Deis
e
n
r
o
t
h and Ch
e
n
g
S
oon
Ong
’s
Tu
t
or
ial
a
t
CM
Setting
x1 x2 x3xN
f (x1)
f (x2)
f (x3)
f (xN)
! Approximate∫ b
a
f(x)dx ≈N∑
n=1
wnf(xn), x ∈ R
! Nodes xn and corresponding function values f(xn)1
Numerical integration (quadrature)
x1 x2 x3xN
f (x1)
f (x2)
f (x3)
f (xN)
Key ideaApproximate f using an interpolating function that is easy to integrate(e.g., polynomial)
2
Quadrature approaches
x1 x2 x3xN
f (x1)
f (x2)
f (x3)
f (xN)
Quadrature Interpolant NodesNewton–Cotes low-degree polynomials equidistantGaussian orthogonal polynomials roots of polynomialBayesian Gaussian process user defined
3
Newton–Cotes Quadrature
Overview
a = x0 x1 x2
f (x0)
f (x1)
xN = b
f (x2)
f (xN)
! Equidistant nodes a = x0, . . . , xN = b Partition interval [a, b]! Approximate f in each partition with a low-degree polynomial
! Compute integral for each partition analytically and sum them up
4
Overview
a = x0 x1 x2
f (x0)
f (x1)
xN = b
f (x2)
f (xN)
! Equidistant nodes a = x0, . . . , xN = b Partition interval [a, b]! Approximate f in each partition with a low-degree polynomial! Compute integral for each partition analytically and sum them up
4
Trapezoidal rule
xn°1 xn xn+1
f(x)
! Partition [a, b] into N segments with equidistant nodesxn
! Locally linear approximation of f between nodes
5
Trapezoidal rule (2)
xn°1 xn xn+1
f(x)
! Area of a trapezoid with corners(xn, xn+1, f(xn+1), f(xn))
∫ xn+1
xn
f(x)dx ≈ h
2
(f(xn) + f(xn+1)
)
h := |xn+1 − xn| Distance between nodes
! Error O(h2)
! Full integral:∫ b
a
f(x)dx ≈ h
2
(f0 + 2f1 + · · ·+ 2fN−1 + fN
), fn := f(xn)
6
Trapezoidal rule (2)
xn°1 xn xn+1
f(x)
! Area of a trapezoid with corners(xn, xn+1, f(xn+1), f(xn))
∫ xn+1
xn
f(x)dx ≈ h
2
(f(xn) + f(xn+1)
)
h := |xn+1 − xn| Distance between nodes
! Error O(h2)
! Full integral:∫ b
a
f(x)dx ≈ h
2
(f0 + 2f1 + · · ·+ 2fN−1 + fN
), fn := f(xn)
6
Trapezoidal rule (2)
xn°1 xn xn+1
f(x)
! Area of a trapezoid with corners(xn, xn+1, f(xn+1), f(xn))
∫ xn+1
xn
f(x)dx ≈ h
2
(f(xn) + f(xn+1)
)
h := |xn+1 − xn| Distance between nodes
! Error O(h2)
! Full integral:∫ b
a
f(x)dx ≈ h
2
(f0 + 2f1 + · · ·+ 2fN−1 + fN
), fn := f(xn)
6
Simpson’s rule
xn°1 xn xn+1
f(x)
! Partition [a, b] into N segments with equidistant nodesxn
! Locally quadratic approximation of f connectingtriplets
(f(xn−1), f(xn), f(xn+1)
)
7
Simpson’s rule (2)
xn°1 xn xn+1
f(x)
! Area of segment:xn+1∫
xn−1
f(x)dx ≈ h
3(fn−1 + 4fn + fn+1)
h := |xn+1 − xn| Distance between nodes
! Error: O(h4)! Full integral:
∫ b
a
f(x)dx ≈ h
3(f0 + 4f1 + 2f2 + 4f3 + 2f4 + · · ·+ 4fN−2 + 2fN−1 + fN
)
8
Simpson’s rule (2)
xn°1 xn xn+1
f(x)
! Area of segment:xn+1∫
xn−1
f(x)dx ≈ h
3(fn−1 + 4fn + fn+1)
h := |xn+1 − xn| Distance between nodes
! Error: O(h4)
! Full integral:∫ b
a
f(x)dx ≈ h
3(f0 + 4f1 + 2f2 + 4f3 + 2f4 + · · ·+ 4fN−2 + 2fN−1 + fN
)
8
Simpson’s rule (2)
xn°1 xn xn+1
f(x)
! Area of segment:xn+1∫
xn−1
f(x)dx ≈ h
3(fn−1 + 4fn + fn+1)
h := |xn+1 − xn| Distance between nodes
! Error: O(h4)! Full integral:
∫ b
a
f(x)dx ≈ h
3(f0 + 4f1 + 2f2 + 4f3 + 2f4 + · · ·+ 4fN−2 + 2fN−1 + fN
)
8
Example
∫ 1
0
exp(−x2−sin(3x)2)dx
xn°1 xn xn+1
f(x)
Observed function valuesTrue functionSimpson’s ruleTrapezoidal rule
! Simpson’s rule yields better approximations! Very good approximations obtained fairly quickly
9
Example
∫ 1
0
exp(−x2−sin(3x)2)dx
xn°1 xn xn+1
f(x)
Observed function valuesTrue functionSimpson’s ruleTrapezoidal rule
0 20 40 60 80 100Number of nodes
10°7
10°6
10°5
10°4
10°3
10°2
10°1
Inte
grat
ion
erro
r
TrapezoidalSimpson
! Simpson’s rule yields better approximations! Very good approximations obtained fairly quickly
9
Summary: Newton–Cotes quadrature
! Approximate integrand between equidistant nodeswith a low-degree polynomial (up to degree 6)
! Trapezoidal rule: linear interpolation! Simpson’s rule: quadratic interpolation
Better approximation and smaller integration error
xn°1 xn xn+1
f(x)
Observed function valuesTrue functionSimpson’s ruleTrapezoidal rule
10
Gaussian Quadrature
Gaussian quadrature
! Named after Carl Friedrich Gauß
! Quadrature scheme that no longer relies on equidistant nodes Higher accuracy! Central approximation
∫ b
a
f(x)w(x)dx ≈N∑
n=1
wnf(xn)
! Weight function w(x) ≥ 0 (and some other integration-related properties, whichare satisfied if w(x) is a pdf)
! Goal: Find nodes xn and weights wn, so that the approximation error is minimized
11
Gaussian quadrature
! Named after Carl Friedrich Gauß! Quadrature scheme that no longer relies on equidistant nodes Higher accuracy
! Central approximation∫ b
a
f(x)w(x)dx ≈N∑
n=1
wnf(xn)
! Weight function w(x) ≥ 0 (and some other integration-related properties, whichare satisfied if w(x) is a pdf)
! Goal: Find nodes xn and weights wn, so that the approximation error is minimized
11
Gaussian quadrature
! Named after Carl Friedrich Gauß! Quadrature scheme that no longer relies on equidistant nodes Higher accuracy! Central approximation
∫ b
a
f(x)w(x)dx ≈N∑
n=1
wnf(xn)
! Weight function w(x) ≥ 0 (and some other integration-related properties, whichare satisfied if w(x) is a pdf)
! Goal: Find nodes xn and weights wn, so that the approximation error is minimized
11
Gaussian quadrature
! Named after Carl Friedrich Gauß! Quadrature scheme that no longer relies on equidistant nodes Higher accuracy! Central approximation
∫ b
a
f(x)w(x)dx ≈N∑
n=1
wnf(xn)
! Weight function w(x) ≥ 0 (and some other integration-related properties, whichare satisfied if w(x) is a pdf)
! Goal: Find nodes xn and weights wn, so that the approximation error is minimized
11
Gaussian quadrature
! Named after Carl Friedrich Gauß! Quadrature scheme that no longer relies on equidistant nodes Higher accuracy! Central approximation
∫ b
a
f(x)w(x)dx ≈N∑
n=1
wnf(xn)
! Weight function w(x) ≥ 0 (and some other integration-related properties, whichare satisfied if w(x) is a pdf)
! Goal: Find nodes xn and weights wn, so that the approximation error is minimized
11
Central idea
! Quadrature nodes xn are the roots of a family of orthogonal polynomials
Nodes no longer equidistant! Exact if f is a polynomial of degree ≤ 2N − 1, i.e.,
∫ b
a
f(x)w(x)dx =N∑
n=1
wnf(xn)
Integral can be computed exactly by evaluating f N times at the optimallocations xn (roots of an orthogonal polynomial) with corresponding optimalweights wn
More accurate than Newton–Cotes for the same number of evaluations (withsome memory overhead)
12
Central idea
! Quadrature nodes xn are the roots of a family of orthogonal polynomialsNodes no longer equidistant
! Exact if f is a polynomial of degree ≤ 2N − 1, i.e.,∫ b
a
f(x)w(x)dx =N∑
n=1
wnf(xn)
Integral can be computed exactly by evaluating f N times at the optimallocations xn (roots of an orthogonal polynomial) with corresponding optimalweights wn
More accurate than Newton–Cotes for the same number of evaluations (withsome memory overhead)
12
Central idea
! Quadrature nodes xn are the roots of a family of orthogonal polynomialsNodes no longer equidistant
! Exact if f is a polynomial of degree ≤ 2N − 1, i.e.,∫ b
a
f(x)w(x)dx =N∑
n=1
wnf(xn)
Integral can be computed exactly by evaluating f N times at the optimallocations xn (roots of an orthogonal polynomial) with corresponding optimalweights wn
More accurate than Newton–Cotes for the same number of evaluations (withsome memory overhead)
12
Central idea
! Quadrature nodes xn are the roots of a family of orthogonal polynomialsNodes no longer equidistant
! Exact if f is a polynomial of degree ≤ 2N − 1, i.e.,∫ b
a
f(x)w(x)dx =N∑
n=1
wnf(xn)
Integral can be computed exactly by evaluating f N times at the optimallocations xn (roots of an orthogonal polynomial) with corresponding optimalweights wn
More accurate than Newton–Cotes for the same number of evaluations (withsome memory overhead)
12
Example: Gauß–Hermite quadrature
! Solve∫
f(x) exp(−x2)
w(x)
dx =
∫f(x)
√2πN
(x∣∣0, 1
)dx = Ex∼N (0,1)[
√2πf(x)]
! With change-of-variables trick Expectation w.r.t. a Gaussian measure
Ex∼N (µ,σ2)[f(x)] ≈1√π
N∑
n=1
wnf(√2σxn + µ).
13
Example: Gauß–Hermite quadrature
! Solve∫
f(x) exp(−x2)
w(x)
dx =
∫f(x)
√2πN
(x∣∣0, 1
)dx = Ex∼N (0,1)[
√2πf(x)]
! With change-of-variables trick Expectation w.r.t. a Gaussian measure
Ex∼N (µ,σ2)[f(x)] ≈1√π
N∑
n=1
wnf(√2σxn + µ).
13
Example: Gauß–Hermite quadrature (2)
! Follow general approximation scheme∫
f(x) exp(−x2)
w(x)
dx ≈N∑
n=1
wnf(xn)
! Nodes x1, . . . , xN are the roots of Hermite polynomial
HN(x) := (−1)n exp(x2
2
) dn
dxnexp(−x2)
! Weights wn are
wn :=2N−1N !
√π
N2H2N−1(xn)
14
Example: Gauß–Hermite quadrature (2)
! Follow general approximation scheme∫
f(x) exp(−x2)
w(x)
dx ≈N∑
n=1
wnf(xn)
! Nodes x1, . . . , xN are the roots of Hermite polynomial
HN(x) := (−1)n exp(x2
2
) dn
dxnexp(−x2)
! Weights wn are
wn :=2N−1N !
√π
N2H2N−1(xn)
14
Example: Gauß–Hermite quadrature (2)
! Follow general approximation scheme∫
f(x) exp(−x2)
w(x)
dx ≈N∑
n=1
wnf(xn)
! Nodes x1, . . . , xN are the roots of Hermite polynomial
HN(x) := (−1)n exp(x2
2
) dn
dxnexp(−x2)
! Weights wn are
wn :=2N−1N !
√π
N2H2N−1(xn)
14
Overview (Stoer & Bulirsch, 2002)
∫ b
a
w(x)f(x)dx ≈N∑
n=1
wnf(xn)
[a, b] w(x) Orthogonal polynomial[−1, 1] 1 Legendre polynomials[−1, 1] (1− x2)−
12 Chebychev polynomials
[0,∞] exp(−x) Laguerre polynomials[−∞,∞] exp(−x2) Hermite polynomials
15
Application areas
! Probabilities for rectangular bivariate/trivariate Gaussian and t distributions(Genz, 2004)
! Integrating out (marginalizing) a few hyper-parameters in a latent-variable model(INLA; Rue et al., 2009)
! Predictions with a Gaussian process classifier (GPFlow; Matthews et al., 2017)
16
Summary: Gaussian quadrature
! Orthogonal polynomials to approximate f! Nodes are the roots of the polynomial! Higher accuracy than Newton–Cotes! Method of choice for low-dimensional problems (1–3 dimensions)
! Can’t naturally deal with noisy observations! Only works in low dimensions! Approaches that scale better with dimensionality
Bayesian quadrature (up to ≈ 10 dimensions)Monte Carlo estimation (high dimensions)
17
Summary: Gaussian quadrature
! Orthogonal polynomials to approximate f! Nodes are the roots of the polynomial! Higher accuracy than Newton–Cotes! Method of choice for low-dimensional problems (1–3 dimensions)! Can’t naturally deal with noisy observations! Only works in low dimensions
! Approaches that scale better with dimensionalityBayesian quadrature (up to ≈ 10 dimensions)Monte Carlo estimation (high dimensions)
17
Summary: Gaussian quadrature
! Orthogonal polynomials to approximate f! Nodes are the roots of the polynomial! Higher accuracy than Newton–Cotes! Method of choice for low-dimensional problems (1–3 dimensions)! Can’t naturally deal with noisy observations! Only works in low dimensions! Approaches that scale better with dimensionality
Bayesian quadrature (up to ≈ 10 dimensions)Monte Carlo estimation (high dimensions)
17
Bayesian Quadrature
Bayesian quadrature: Setting and key idea
Z :=
∫f(x)p(x)dx = Ex∼p[f(x)]
! Function f is expensive to evaluate! Integration in moderate (≤ 10) dimensions! Deal with noisy function observations
Key idea! Phrase quadrature as a statistical inference problem
Probabilistic numerics (e.g., Hennig et al., 2015; Briol et al., 2015)! Estimate distribution on Z using a dataset D :=
{(x1, f(x1)), . . . , (xN , f(xN))
}
18
Bayesian quadrature: Setting and key idea
Z :=
∫f(x)p(x)dx = Ex∼p[f(x)]
! Function f is expensive to evaluate! Integration in moderate (≤ 10) dimensions! Deal with noisy function observations
Key idea! Phrase quadrature as a statistical inference problem
Probabilistic numerics (e.g., Hennig et al., 2015; Briol et al., 2015)! Estimate distribution on Z using a dataset D :=
{(x1, f(x1)), . . . , (xN , f(xN))
}
18
Bayesian quadrature: How it works
Z :=
∫f(x)p(x)dx = Ex∼p[f(x)]
! Estimate distribution on Z using a datasetD :=
{(x1, f(x1)), . . . , (xN , f(xN))
}
! Place (Gaussian process) prior distribution on fand determine the posterior via Bayes’ theorem(Diaconis 1988; O’Hagan 1991; Rasmussen &Ghahramani 2003)
Distribution on f induces a distribution on Z! Generalizes to noisy function observations
y = f(x) + ε
°3 °2 °1 0 1 2 3x
°3
°2
°1
0
1
2
3
f(x)
ObservationsIntegrandModel
19
Bayesian quadrature: How it works
Z :=
∫f(x)p(x)dx = Ex∼p[f(x)]
! Estimate distribution on Z using a datasetD :=
{(x1, f(x1)), . . . , (xN , f(xN))
}
! Place (Gaussian process) prior distribution on fand determine the posterior via Bayes’ theorem(Diaconis 1988; O’Hagan 1991; Rasmussen &Ghahramani 2003)
Distribution on f induces a distribution on Z
! Generalizes to noisy function observationsy = f(x) + ε
°3 °2 °1 0 1 2 3x
°3
°2
°1
0
1
2
3
f(x)
ObservationsIntegrandModel
19
Bayesian quadrature: How it works
Z :=
∫f(x)p(x)dx = Ex∼p[f(x)]
! Estimate distribution on Z using a datasetD :=
{(x1, f(x1)), . . . , (xN , f(xN))
}
! Place (Gaussian process) prior distribution on fand determine the posterior via Bayes’ theorem(Diaconis 1988; O’Hagan 1991; Rasmussen &Ghahramani 2003)
Distribution on f induces a distribution on Z! Generalizes to noisy function observations
y = f(x) + ε
°3 °2 °1 0 1 2 3x
°3
°2
°1
0
1
2
3
f(x)
ObservationsIntegrandModel
19
Bayesian quadrature: Details
Z :=
∫f(x)p(x)dx), f ∼ GP (0, k)
! Exploit linearity of the integral (integral of a GP is another GP)
p(Z) = p
(∫f(x)p(x)dx)
)= N
(Z∣∣µZ , σ
2Z
)
µZ =
∫µpost(x)p(x)dx = Ex[µpost(x)]
σ2Z =
∫∫kpost(x,x
′)p(x)p(x′)dxdx′ = Ex,x′ [kpost(x,x′)]
20
Bayesian quadrature: Details
Z :=
∫f(x)p(x)dx), f ∼ GP (0, k)
! Exploit linearity of the integral (integral of a GP is another GP)
p(Z) = p
(∫f(x)p(x)dx)
)= N
(Z∣∣µZ , σ
2Z
)
µZ =
∫µpost(x)p(x)dx = Ex[µpost(x)]
σ2Z =
∫∫kpost(x,x
′)p(x)p(x′)dxdx′ = Ex,x′ [kpost(x,x′)]
20
Bayesian quadrature: Details
Z :=
∫f(x)p(x)dx), f ∼ GP (0, k)
! Exploit linearity of the integral (integral of a GP is another GP)
p(Z) = p
(∫f(x)p(x)dx)
)= N
(Z∣∣µZ , σ
2Z
)
µZ =
∫µpost(x)p(x)dx = Ex[µpost(x)]
σ2Z =
∫∫kpost(x,x
′)p(x)p(x′)dxdx′ = Ex,x′ [kpost(x,x′)]
20
Bayesian quadrature: Details
Z :=
∫f(x)p(x)dx), f ∼ GP (0, k)
! Exploit linearity of the integral (integral of a GP is another GP)
p(Z) = p
(∫f(x)p(x)dx)
)= N
(Z∣∣µZ , σ
2Z
)
µZ =
∫µpost(x)p(x)dx = Ex[µpost(x)]
σ2Z =
∫∫kpost(x,x
′)p(x)p(x′)dxdx′ = Ex,x′ [kpost(x,x′)]
20
Bayesian quadrature: Details
Z :=
∫f(x)p(x)dx), f ∼ GP (0, k)
! Exploit linearity of the integral (integral of a GP is another GP)
p(Z) = p
(∫f(x)p(x)dx)
)= N
(Z∣∣µZ , σ
2Z
)
µZ =
∫µpost(x)p(x)dx = Ex[µpost(x)]
σ2Z =
∫∫kpost(x,x
′)p(x)p(x′)dxdx′ = Ex,x′ [kpost(x,x′)]
20
Bayesian quadrature: Mean
Ef [Z] = µZ =
expectedpredictive mean
Ex∼p[µpost(x)]
µpost(x) = k(x,X)K−1y=:α
, K := k(X,X)
Ef [Z] =
=:z#
∫k(x,X)p(x)dxα = z$α
zn =
∫k(x,xn)p(x)dx = Ex∼p[k(x,xn)]
Z =
∫f(x)p(x)dx
f ∼ GP (0, k)
p(Z) = N(Z∣∣µZ , σ
2Z
)
Training data: X,y
21
Bayesian quadrature: Mean
Ef [Z] = µZ =
expectedpredictive mean
Ex∼p[µpost(x)]
µpost(x) = k(x,X)K−1y=:α
, K := k(X,X)
Ef [Z] =
=:z#
∫k(x,X)p(x)dxα = z$α
zn =
∫k(x,xn)p(x)dx = Ex∼p[k(x,xn)]
Z =
∫f(x)p(x)dx
f ∼ GP (0, k)
p(Z) = N(Z∣∣µZ , σ
2Z
)
Training data: X,y
21
Bayesian quadrature: Mean
Ef [Z] = µZ =
expectedpredictive mean
Ex∼p[µpost(x)]
µpost(x) = k(x,X)K−1y=:α
, K := k(X,X)
Ef [Z] =
=:z#
∫k(x,X)p(x)dxα = z$α
zn =
∫k(x,xn)p(x)dx = Ex∼p[k(x,xn)]
Z =
∫f(x)p(x)dx
f ∼ GP (0, k)
p(Z) = N(Z∣∣µZ , σ
2Z
)
Training data: X,y
21
Bayesian quadrature: Variance
Vf [Z] = σ2Z =
expected posterior covariance
Ex,x′∼p[kpost(x,x′)]
=
∫∫k(x,x′)
prior covariance
− k(x,X)K−1k(X,x′)
information from training data
p(x)p(x′)dxdx′
=
∫∫k(x,x′)p(x)p(x′)dxdx′ −
∫k(x,X)p(x)dx
=z#
K−1
∫k(X,x′)p(x′)dx′
=z′
= Ex,x′ [k(x,x′)]− z$K−1z′
= Ex,x′ [k(x,x′)]− Ex[k(x,X)]K−1Ex′ [k(X,x′)]
22
Bayesian quadrature: Variance
Vf [Z] = σ2Z =
expected posterior covariance
Ex,x′∼p[kpost(x,x′)]
=
∫∫k(x,x′)
prior covariance
− k(x,X)K−1k(X,x′)
information from training data
p(x)p(x′)dxdx′
=
∫∫k(x,x′)p(x)p(x′)dxdx′ −
∫k(x,X)p(x)dx
=z#
K−1
∫k(X,x′)p(x′)dx′
=z′
= Ex,x′ [k(x,x′)]− z$K−1z′
= Ex,x′ [k(x,x′)]− Ex[k(x,X)]K−1Ex′ [k(X,x′)]
22
Bayesian quadrature: Variance
Vf [Z] = σ2Z =
expected posterior covariance
Ex,x′∼p[kpost(x,x′)]
=
∫∫k(x,x′)
prior covariance
− k(x,X)K−1k(X,x′)
information from training data
p(x)p(x′)dxdx′
=
∫∫k(x,x′)p(x)p(x′)dxdx′
−∫
k(x,X)p(x)dx
=z#
K−1
∫k(X,x′)p(x′)dx′
=z′
= Ex,x′ [k(x,x′)]
− z$K−1z′
= Ex,x′ [k(x,x′)]− Ex[k(x,X)]K−1Ex′ [k(X,x′)]
22
Bayesian quadrature: Variance
Vf [Z] = σ2Z =
expected posterior covariance
Ex,x′∼p[kpost(x,x′)]
=
∫∫k(x,x′)
prior covariance
− k(x,X)K−1k(X,x′)
information from training data
p(x)p(x′)dxdx′
=
∫∫k(x,x′)p(x)p(x′)dxdx′ −
∫k(x,X)p(x)dx
=z#
K−1
∫k(X,x′)p(x′)dx′
=z′
= Ex,x′ [k(x,x′)]− z$
K−1z′
= Ex,x′ [k(x,x′)]− Ex[k(x,X)]K−1Ex′ [k(X,x′)]
22
Bayesian quadrature: Variance
Vf [Z] = σ2Z =
expected posterior covariance
Ex,x′∼p[kpost(x,x′)]
=
∫∫k(x,x′)
prior covariance
− k(x,X)K−1k(X,x′)
information from training data
p(x)p(x′)dxdx′
=
∫∫k(x,x′)p(x)p(x′)dxdx′ −
∫k(x,X)p(x)dx
=z#
K−1
∫k(X,x′)p(x′)dx′
=z′
= Ex,x′ [k(x,x′)]− z$K−1
z′
= Ex,x′ [k(x,x′)]− Ex[k(x,X)]K−1Ex′ [k(X,x′)]
22
Bayesian quadrature: Variance
Vf [Z] = σ2Z =
expected posterior covariance
Ex,x′∼p[kpost(x,x′)]
=
∫∫k(x,x′)
prior covariance
− k(x,X)K−1k(X,x′)
information from training data
p(x)p(x′)dxdx′
=
∫∫k(x,x′)p(x)p(x′)dxdx′ −
∫k(x,X)p(x)dx
=z#
K−1
∫k(X,x′)p(x′)dx′
=z′
= Ex,x′ [k(x,x′)]− z$K−1z′
= Ex,x′ [k(x,x′)]− Ex[k(x,X)]K−1Ex′ [k(X,x′)]
22
Bayesian quadrature: Variance
Vf [Z] = σ2Z =
expected posterior covariance
Ex,x′∼p[kpost(x,x′)]
=
∫∫k(x,x′)
prior covariance
− k(x,X)K−1k(X,x′)
information from training data
p(x)p(x′)dxdx′
=
∫∫k(x,x′)p(x)p(x′)dxdx′ −
∫k(x,X)p(x)dx
=z#
K−1
∫k(X,x′)p(x′)dx′
=z′
= Ex,x′ [k(x,x′)]− z$K−1z′
= Ex,x′ [k(x,x′)]− Ex[k(x,X)]K−1Ex′ [k(X,x′)]
22
Computing kernel expectations
Ex∼p[k(x,X)], Ex,x′∼p[k(x,x′)]
! Solve a different (easier) integration problem
Input distribution pKernel k Gaussian non-Gaussian
RBF/polynomial/
trigonometricanalytical
analytical viaimportance-sampling
trick
otherwise Monte Carlo(numerical integration)
Monte Carlo(numerical integration)
23
Computing kernel expectations
Ex∼p[k(x,X)], Ex,x′∼p[k(x,x′)]
! Solve a different (easier) integration problem
Input distribution pKernel k Gaussian non-Gaussian
RBF/polynomial/
trigonometricanalytical
analytical viaimportance-sampling
trick
otherwise Monte Carlo(numerical integration)
Monte Carlo(numerical integration)
23
Kernel expectations in other areas
Ex∼p[k(x,X)], Ex,x′∼p[k(x,x′)]
! Kernel MMD(e.g., Gretton et al., 2012)
! Time-series analysis with Gaussian processes(e.g., Girard et al., 2003)
! Deep Gaussian processes(e.g., Damianou & Lawrence, 2013)
! Model-based RL with Gaussian processes(e.g., Deisenroth & Rasmussen, 2011)
X
Y
Dependence witness and sample
−1.5 −1 −0.5 0 0.5 1 1.5−1.5
−1
−0.5
0
0.5
1
1.5
−0.04
−0.03
−0.02
−0.01
0
0.01
0.02
0.03
0.04
0.05
from Gretton et al. (2012)
from Salimbeni et al. (2019)
from Girard et al. (2003)
from Deisenroth &Rasmussen (2011)
24
Kernel expectations in other areas
Ex∼p[k(x,X)], Ex,x′∼p[k(x,x′)]
! Kernel MMD(e.g., Gretton et al., 2012)
! Time-series analysis with Gaussian processes(e.g., Girard et al., 2003)
! Deep Gaussian processes(e.g., Damianou & Lawrence, 2013)
! Model-based RL with Gaussian processes(e.g., Deisenroth & Rasmussen, 2011)
X
Y
Dependence witness and sample
−1.5 −1 −0.5 0 0.5 1 1.5−1.5
−1
−0.5
0
0.5
1
1.5
−0.04
−0.03
−0.02
−0.01
0
0.01
0.02
0.03
0.04
0.05
from Gretton et al. (2012)
from Salimbeni et al. (2019)
from Girard et al. (2003)
from Deisenroth &Rasmussen (2011)
24
Kernel expectations in other areas
Ex∼p[k(x,X)], Ex,x′∼p[k(x,x′)]
! Kernel MMD(e.g., Gretton et al., 2012)
! Time-series analysis with Gaussian processes(e.g., Girard et al., 2003)
! Deep Gaussian processes(e.g., Damianou & Lawrence, 2013)
! Model-based RL with Gaussian processes(e.g., Deisenroth & Rasmussen, 2011)
X
Y
Dependence witness and sample
−1.5 −1 −0.5 0 0.5 1 1.5−1.5
−1
−0.5
0
0.5
1
1.5
−0.04
−0.03
−0.02
−0.01
0
0.01
0.02
0.03
0.04
0.05
from Gretton et al. (2012)
from Salimbeni et al. (2019)
from Girard et al. (2003)
from Deisenroth &Rasmussen (2011)
24
Kernel expectations in other areas
Ex∼p[k(x,X)], Ex,x′∼p[k(x,x′)]
! Kernel MMD(e.g., Gretton et al., 2012)
! Time-series analysis with Gaussian processes(e.g., Girard et al., 2003)
! Deep Gaussian processes(e.g., Damianou & Lawrence, 2013)
! Model-based RL with Gaussian processes(e.g., Deisenroth & Rasmussen, 2011)
X
Y
Dependence witness and sample
−1.5 −1 −0.5 0 0.5 1 1.5−1.5
−1
−0.5
0
0.5
1
1.5
−0.04
−0.03
−0.02
−0.01
0
0.01
0.02
0.03
0.04
0.05
from Gretton et al. (2012)
from Salimbeni et al. (2019)
from Girard et al. (2003)
from Deisenroth &Rasmussen (2011)
24
Iterative procedure: Where to measure f next?
! Define an acquisition function (similar to Bayesian optimization)
! Example: Choose next node xn+1 so that the variance of the estimator is reducedmaximally (e.g., O’Hagan, 1991; Gunter et al., 2014)
xn+1 = argmaxx∗
currentvariance
V[Z|D]−
newvariance
Ey∗
[V[Z|D∪{(x∗, y∗)}]
]
25
Iterative procedure: Where to measure f next?
! Define an acquisition function (similar to Bayesian optimization)! Example: Choose next node xn+1 so that the variance of the estimator is reduced
maximally (e.g., O’Hagan, 1991; Gunter et al., 2014)
xn+1 = argmaxx∗
currentvariance
V[Z|D]−
newvariance
Ey∗
[V[Z|D∪{(x∗, y∗)}]
]
25
Example with EmuKit (Paleyes et al., 2019)
Compute
Z =
∫ 3
−3
e−x2−sin2(3x)dx
°3 °2 °1 0 1 2 3x
0.0
0.2
0.4
0.6
0.8
1.0
f(x)
26
Example with EmuKit (Paleyes et al., 2019)
Compute
Z =
∫ 3
−3
e−x2−sin2(3x)dx
! Fit Gaussian process to observationsf(x1), . . . , f(xn) at nodes x1, . . . , xn
°3 °2 °1 0 1 2 3x
°3
°2
°1
0
1
2
3
f(x)
ObservationsIntegrandModel
27
Example with EmuKit (Paleyes et al., 2019)
Compute
Z =
∫ 3
−3
e−x2−sin2(3x)dx
! Fit Gaussian process to observationsf(x1), . . . , f(xn) at nodes x1, . . . , xn
! Determine p(Z)
°4 °2 0 2 4Z
0.00
0.05
0.10
0.15
0.20
0.25
p(Z
)
p(Z)
E[Z]
True Z
28
Example with EmuKit (Paleyes et al., 2019)
Compute
Z =
∫ 3
−3
e−x2−sin2(3x)dx
! Fit Gaussian process to observationsf(x1), . . . , f(xn) at nodes x1, . . . , xn
! Determine p(Z)! Find and include new measurement
1. Find optimal node xn+1 by maximizingan acquisition function
2. Evaluate integrand at xn+1
3. Update GP with(xn+1, f(xn+1)
) °3 °2 °1 0 1 2 3x
°3
°2
°1
0
1
2
3
f(x)
ObservationsIntegrandModel
29
Example with EmuKit (Paleyes et al., 2019)
Compute
Z =
∫ 3
−3
e−x2−sin2(3x)dx
! Fit Gaussian process to observationsf(x1), . . . , f(xn) at nodes x1, . . . , xn
! Determine p(Z)! Find and include new measurement! Compute updated p(Z)
°4 °2 0 2 4Z
0.00
0.05
0.10
0.15
0.20
0.25
0.30
p(Z
)
Initial p(Z)
New p(Z)
Initial E[Z]
New E[Z]
True Z
30
Example with EmuKit (Paleyes et al., 2019)
Compute
Z =
∫ 3
−3
e−x2−sin2(3x)dx
! Fit Gaussian process to observationsf(x1), . . . , f(xn) at nodes x1, . . . , xn
! Determine p(Z)! Find and include new measurement! Compute updated p(Z)! Repeat °3 °2 °1 0 1 2 3
x
°0.2
0.0
0.2
0.4
0.6
0.8
1.0
f(x)
ObservationsIntegrandModel
31
Example with EmuKit (Paleyes et al., 2019)
Compute
Z =
∫ 3
−3
e−x2−sin2(3x)dx
! Fit Gaussian process to observationsf(x1), . . . , f(xn) at nodes x1, . . . , xn
! Determine p(Z)! Find and include new measurement! Compute updated p(Z)! Repeat
°4 °2 0 2 4Z
0.0
0.2
0.4
0.6
0.8
1.0p(
Z)
Initial p(Z)
New p(Z)
Initial E[Z]
New E[Z]
True Z
32
Summary
! Central approximation∫
f(x)dx ≈N∑
n=1
wnf(xn)
! Newton–Cotes: Equidistant nodes xn, low-degreepolynomial approximation of f
! Gaussian quadrature: Nodes xn as the roots ofinterpolating orthogonal polynomials of f
! Bayesian quadrature: Integration as a statistical inferenceproblem; Global approximation of f using a Gaussianprocess; scales to moderate dimensions
a = x0 x1 x2
f (x0)
f (x1)
xN = b
f (x2)
f (xN)
°3 °2 °1 0 1 2 3x
°0.2
0.0
0.2
0.4
0.6
0.8
1.0
f(x)
ObservationsIntegrandModel
Numerical integration is a really good idea in low dimensions33
References
Briol, F.-X., Oates, C., Girolami, M., and Osborne, M. A. (2015). Frank–Wolfe Bayesian Quadrature:Probabilistic Integration with Theoretical Guarantees. In Advances in Neural Information ProcessingSystems.
Cutler, M. and How, J. P. (2015). Efficient Reinforcement Learning for Robots using Informative SimulatedPriors. In Proceedings of the International Conference on Robotics and Automation.
Damianou, A. and Lawrence, N. D. (2013). Deep Gaussian Processes. In Proceedings of the InternationalConference on Artificial Intelligence and Statistics.
Deisenroth, M. P., Fox, D., and Rasmussen, C. E. (2015). Gaussian Processes for Data-Efficient Learning inRobotics and Control. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(2):408–423.
Deisenroth, M. P. and Mohamed, S. (2012). Expectation Propagation in Gaussian Process DynamicalSystems. In Advances in Neural Information Processing Systems, pages 2618–2626.
34
References (cont.)
Deisenroth, M. P. and Rasmussen, C. E. (2011). PILCO: A Model-Based and Data-Efficient Approach toPolicy Search. In Proceedings of the International Conference on Machine Learning.
Deisenroth, M. P., Turner, R., Huber, M., Hanebeck, U. D., and Rasmussen, C. E. (2012). Robust Filteringand Smoothing with Gaussian Processes. IEEE Transactions on Automatic Control, 57(7):1865–1871.
Diaconis, P. (1988). Bayesian Numerical Analysis. Statistical Decision Theory and Related Topics IV,1:163–175.
Eleftheriadis, S., Nicholson, T. F. W., Deisenroth, M. P., and Hensman, J. (2017). Identification of GaussianProcess State Space Models. In Advances in Neural Information Processing Systems.
Genz, A. (2004). Numerical Computation of Rectangular Bivariate and Trivariate Normal and t Probabilities.Statistics and Computing, 14:251–260.
35
References (cont.)
Girard, A., Rasmussen, C. E., Quiñonero Candela, J., and Murray-Smith, R. (2003). Gaussian Process Priorswith Uncertain Inputs—Application to Multiple-Step Ahead Time Series Forecasting. In Advances inNeural Information Processing Systems.
Gretton, A., Borgwardt, K. M., Rasch, M. J., Schölkopf, B., and Smola, A. (2012). A Kernel Two-SampleTest. Journal of Machine Learning Research, 13(25):723–773.
Gunter, T., Osborne, M. A., Garnett, R., Hennig, P., and Roberts, S. J. (2014). Sampling for Inference inProbabilistic Models with Fast Bayesian Quadrature. In Advances in Neural Information ProcessingSystems.
Hennig, P., Osborne, M. A., and Girolami, M. (2015). Probabilistic numerics and uncertainty incomputations. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences,471:20150142.
Ko, J. and Fox, D. (2009). GP-BayesFilters: Bayesian Filtering using Gaussian Process Prediction andObservation Models. Autonomous Robots, 27(1):75–90.
36
References (cont.)
O’Hagan, A. (1991). Bayes-Hermite Quadrature. Journal of Statistical Planning and Inference, 29:245–260.Paleyes, A., Pullin, M., Mahsereci, M., Lawrence, N., and González, J. (2019). Emulation of Physical
Processes with Emukit. In Second Workshop on Machine Learning and the Physical Sciences, NeurIPS.Salimbeni, H. and Deisenroth, M. P. (2017). Doubly Stochastic Variational Inference for Deep Gaussian
Processes. In Advances in Neural Information Processing Systems.Salimbeni, H., Dutordoir, V., Hensman, J., and Deisenroth, M. P. (2019). Deep Gaussian Processes with
Importance-Weighted Variational Inference. In Proceedings of the International Conference on MachineLearning.
Stoer, J. and Bulirsch, R. (2002). Introduction to Numerical Analysis. Texts in Applied Mathematics.Springer-Verlag, 3rd edition.
37