Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Transport methods for nonlinear filtering andlikelihood-free inference
Youssef Marzouk, joint work with Ricardo Baptista,Alessio Spantini, & Olivier Zahm
Department of Aeronautics and AstronauticsCenter for Computational Science and Engineering
Statistics and Data Science CenterMassachusetts Institute of Technology
http://uqgroup.mit.edu
Sea Ice Modeling and Data Assimilation (SIMDA) Seminar
6 October 2020
Marzouk SIMDA seminar 1 / 43
http://uqgroup.mit.edu
Main ideas #1
I Online Bayesian inference in dynamical models with sequential datais central to our proposed workI Data assimilation, in the broadest sense
I New nonlinear ensemble schemes, based on transport, can provide aconsistent approach to inference in general non-Gaussian settingsI Generalizations of the ensemble Kalman filter (EnKF)
I These schemes are methods for likelihood-free inference (LFI) orapproximate Bayesian computation (ABC), and have broaderutility!
Marzouk SIMDA seminar 2 / 43
Main ideas #2
I Application to high-dimensional and disparate data requiresdetecting and exploiting low-dimensional structureI Many varieties of structure, e.g.: sparsity and conditional independence(especially given disparate data sources), smoothness/low rank,piecewise smoothness, hierarchical low rank, multiscale structure
I Structure in parameters/state and in data!I Relate also to data-driven discretizations of inverse problems
I These new inference methods can also facilitate optimal experimentaldesign
Marzouk SIMDA seminar 3 / 43
Deterministic couplings of probability measures
⇡(✓) p(r)
⇡̃(r)
T (✓)
T̃ (✓)
Figure 2-1: Illustration of exact and inexact transformations coming from T and T̃respectively. The exact map pushes the target measure ⇡ to the standard Gaussianreference p while the approximate map only captures some of the structure in ⇡,producing an approximation p̃ to the reference Gaussian.
µr does not contain any point masses and the cost function c(✓, T (✓)) is quadratic.Details of the existence and uniqueness proofs can also be found in [102].
Being a form of regularization, the cost function in (2.2) defines the form andstructure of the optimal transport map. For illustration, consider the case when✓ ⇠ N(0, I) and r ⇠ N(0,⌃) for some covariance matrix ⌃. In this Gaussian example,the transport map will be linear: r
i.d.= ⌃1/2✓, where ⌃1/2 is any one of the many
square roots of ⌃. Two possible matrix square roots are the Cholesky factor, and theeigenvalue square root. Interestingly, when the cost is given by
cEig(✓, T (✓)) = k✓ � T (✓)k2, (2.3)
the optimal square root, ⌃1/2, will be defined by the eigenvalue decomposition of ⌃,but when the cost is given by the limit of a a weighted quadratic defined by
cRos(✓, T (✓)) = limt!0
DX
k=1
tk�1|✓k � Tk(✓)|, (2.4)
the optimal square root, ⌃1/2, will be defined by the Cholesky decomposition of ⌃.In the more general nonlinear and non-Gaussian setting, this latter cost is shown by[22] and [15] to yield the well-known Rosenblatt transformation from [91].
The Cholesky factor is a special case of the Rosenblatt transformation, which it-self is just a multivariate generalization of using cumulative distribution functions totransform between univariate random variables (i.e., the “CDF trick”). Importantly,the lower triangular structure present in the Cholesky factor, which makes inverting
31
⇡(✓) p(r)
⇡̃(r)
T (✓)
T̃ (✓)
Figure 2-1: Illustration of exact and inexact transformations coming from T and T̃respectively. The exact map pushes the target measure ⇡ to the standard Gaussianreference p while the approximate map only captures some of the structure in ⇡,producing an approximation p̃ to the reference Gaussian.
µr does not contain any point masses and the cost function c(✓, T (✓)) is quadratic.Details of the existence and uniqueness proofs can also be found in [102].
Being a form of regularization, the cost function in (2.2) defines the form andstructure of the optimal transport map. For illustration, consider the case when✓ ⇠ N(0, I) and r ⇠ N(0,⌃) for some covariance matrix ⌃. In this Gaussian example,the transport map will be linear: r
i.d.= ⌃1/2✓, where ⌃1/2 is any one of the many
square roots of ⌃. Two possible matrix square roots are the Cholesky factor, and theeigenvalue square root. Interestingly, when the cost is given by
cEig(✓, T (✓)) = k✓ � T (✓)k2, (2.3)
the optimal square root, ⌃1/2, will be defined by the eigenvalue decomposition of ⌃,but when the cost is given by the limit of a a weighted quadratic defined by
cRos(✓, T (✓)) = limt!0
DX
k=1
tk�1|✓k � Tk(✓)|, (2.4)
the optimal square root, ⌃1/2, will be defined by the Cholesky decomposition of ⌃.In the more general nonlinear and non-Gaussian setting, this latter cost is shown by[22] and [15] to yield the well-known Rosenblatt transformation from [91].
The Cholesky factor is a special case of the Rosenblatt transformation, which it-self is just a multivariate generalization of using cumulative distribution functions totransform between univariate random variables (i.e., the “CDF trick”). Importantly,the lower triangular structure present in the Cholesky factor, which makes inverting
31
T
η π
Core idea
I Choose a reference distribution η (e.g., standard Gaussian)I Seek a transport map T : Rn → Rn such that T]η = π
Marzouk SIMDA seminar 4 / 43
Deterministic couplings of probability measures
⇡(✓) p(r)
⇡̃(r)
T (✓)
T̃ (✓)
Figure 2-1: Illustration of exact and inexact transformations coming from T and T̃respectively. The exact map pushes the target measure ⇡ to the standard Gaussianreference p while the approximate map only captures some of the structure in ⇡,producing an approximation p̃ to the reference Gaussian.
µr does not contain any point masses and the cost function c(✓, T (✓)) is quadratic.Details of the existence and uniqueness proofs can also be found in [102].
Being a form of regularization, the cost function in (2.2) defines the form andstructure of the optimal transport map. For illustration, consider the case when✓ ⇠ N(0, I) and r ⇠ N(0,⌃) for some covariance matrix ⌃. In this Gaussian example,the transport map will be linear: r
i.d.= ⌃1/2✓, where ⌃1/2 is any one of the many
square roots of ⌃. Two possible matrix square roots are the Cholesky factor, and theeigenvalue square root. Interestingly, when the cost is given by
cEig(✓, T (✓)) = k✓ � T (✓)k2, (2.3)
the optimal square root, ⌃1/2, will be defined by the eigenvalue decomposition of ⌃,but when the cost is given by the limit of a a weighted quadratic defined by
cRos(✓, T (✓)) = limt!0
DX
k=1
tk�1|✓k � Tk(✓)|, (2.4)
the optimal square root, ⌃1/2, will be defined by the Cholesky decomposition of ⌃.In the more general nonlinear and non-Gaussian setting, this latter cost is shown by[22] and [15] to yield the well-known Rosenblatt transformation from [91].
The Cholesky factor is a special case of the Rosenblatt transformation, which it-self is just a multivariate generalization of using cumulative distribution functions totransform between univariate random variables (i.e., the “CDF trick”). Importantly,the lower triangular structure present in the Cholesky factor, which makes inverting
31
⇡(✓) p(r)
⇡̃(r)
T (✓)
T̃ (✓)
Figure 2-1: Illustration of exact and inexact transformations coming from T and T̃respectively. The exact map pushes the target measure ⇡ to the standard Gaussianreference p while the approximate map only captures some of the structure in ⇡,producing an approximation p̃ to the reference Gaussian.
µr does not contain any point masses and the cost function c(✓, T (✓)) is quadratic.Details of the existence and uniqueness proofs can also be found in [102].
Being a form of regularization, the cost function in (2.2) defines the form andstructure of the optimal transport map. For illustration, consider the case when✓ ⇠ N(0, I) and r ⇠ N(0,⌃) for some covariance matrix ⌃. In this Gaussian example,the transport map will be linear: r
i.d.= ⌃1/2✓, where ⌃1/2 is any one of the many
square roots of ⌃. Two possible matrix square roots are the Cholesky factor, and theeigenvalue square root. Interestingly, when the cost is given by
cEig(✓, T (✓)) = k✓ � T (✓)k2, (2.3)
the optimal square root, ⌃1/2, will be defined by the eigenvalue decomposition of ⌃,but when the cost is given by the limit of a a weighted quadratic defined by
cRos(✓, T (✓)) = limt!0
DX
k=1
tk�1|✓k � Tk(✓)|, (2.4)
the optimal square root, ⌃1/2, will be defined by the Cholesky decomposition of ⌃.In the more general nonlinear and non-Gaussian setting, this latter cost is shown by[22] and [15] to yield the well-known Rosenblatt transformation from [91].
The Cholesky factor is a special case of the Rosenblatt transformation, which it-self is just a multivariate generalization of using cumulative distribution functions totransform between univariate random variables (i.e., the “CDF trick”). Importantly,the lower triangular structure present in the Cholesky factor, which makes inverting
31
S =T−1
η π
Core idea
I Choose a reference distribution η (e.g., standard Gaussian)I Seek a transport map T : Rn → Rn such that T]η = πI Equivalently, find S = T−1 such that S]π = η
Marzouk SIMDA seminar 4 / 43
Deterministic couplings of probability measures
⇡(✓) p(r)
⇡̃(r)
T (✓)
T̃ (✓)
Figure 2-1: Illustration of exact and inexact transformations coming from T and T̃respectively. The exact map pushes the target measure ⇡ to the standard Gaussianreference p while the approximate map only captures some of the structure in ⇡,producing an approximation p̃ to the reference Gaussian.
µr does not contain any point masses and the cost function c(✓, T (✓)) is quadratic.Details of the existence and uniqueness proofs can also be found in [102].
Being a form of regularization, the cost function in (2.2) defines the form andstructure of the optimal transport map. For illustration, consider the case when✓ ⇠ N(0, I) and r ⇠ N(0,⌃) for some covariance matrix ⌃. In this Gaussian example,the transport map will be linear: r
i.d.= ⌃1/2✓, where ⌃1/2 is any one of the many
square roots of ⌃. Two possible matrix square roots are the Cholesky factor, and theeigenvalue square root. Interestingly, when the cost is given by
cEig(✓, T (✓)) = k✓ � T (✓)k2, (2.3)
the optimal square root, ⌃1/2, will be defined by the eigenvalue decomposition of ⌃,but when the cost is given by the limit of a a weighted quadratic defined by
cRos(✓, T (✓)) = limt!0
DX
k=1
tk�1|✓k � Tk(✓)|, (2.4)
the optimal square root, ⌃1/2, will be defined by the Cholesky decomposition of ⌃.In the more general nonlinear and non-Gaussian setting, this latter cost is shown by[22] and [15] to yield the well-known Rosenblatt transformation from [91].
The Cholesky factor is a special case of the Rosenblatt transformation, which it-self is just a multivariate generalization of using cumulative distribution functions totransform between univariate random variables (i.e., the “CDF trick”). Importantly,the lower triangular structure present in the Cholesky factor, which makes inverting
31
⇡(✓) p(r)
⇡̃(r)
T (✓)
T̃ (✓)
Figure 2-1: Illustration of exact and inexact transformations coming from T and T̃respectively. The exact map pushes the target measure ⇡ to the standard Gaussianreference p while the approximate map only captures some of the structure in ⇡,producing an approximation p̃ to the reference Gaussian.
µr does not contain any point masses and the cost function c(✓, T (✓)) is quadratic.Details of the existence and uniqueness proofs can also be found in [102].
Being a form of regularization, the cost function in (2.2) defines the form andstructure of the optimal transport map. For illustration, consider the case when✓ ⇠ N(0, I) and r ⇠ N(0,⌃) for some covariance matrix ⌃. In this Gaussian example,the transport map will be linear: r
i.d.= ⌃1/2✓, where ⌃1/2 is any one of the many
square roots of ⌃. Two possible matrix square roots are the Cholesky factor, and theeigenvalue square root. Interestingly, when the cost is given by
cEig(✓, T (✓)) = k✓ � T (✓)k2, (2.3)
the optimal square root, ⌃1/2, will be defined by the eigenvalue decomposition of ⌃,but when the cost is given by the limit of a a weighted quadratic defined by
cRos(✓, T (✓)) = limt!0
DX
k=1
tk�1|✓k � Tk(✓)|, (2.4)
the optimal square root, ⌃1/2, will be defined by the Cholesky decomposition of ⌃.In the more general nonlinear and non-Gaussian setting, this latter cost is shown by[22] and [15] to yield the well-known Rosenblatt transformation from [91].
The Cholesky factor is a special case of the Rosenblatt transformation, which it-self is just a multivariate generalization of using cumulative distribution functions totransform between univariate random variables (i.e., the “CDF trick”). Importantly,the lower triangular structure present in the Cholesky factor, which makes inverting
31
T
η π
Core idea
I Choose a reference distribution η (e.g., standard Gaussian)I Seek a transport map T : Rn → Rn such that T]η = πI Equivalently, find S = T−1 such that S]π = ηI In principle, enables exact (independent, unweighted) sampling!
Marzouk SIMDA seminar 4 / 43
Deterministic couplings of probability measures
⇡(✓) p(r)
⇡̃(r)
T (✓)
T̃ (✓)
Figure 2-1: Illustration of exact and inexact transformations coming from T and T̃respectively. The exact map pushes the target measure ⇡ to the standard Gaussianreference p while the approximate map only captures some of the structure in ⇡,producing an approximation p̃ to the reference Gaussian.
µr does not contain any point masses and the cost function c(✓, T (✓)) is quadratic.Details of the existence and uniqueness proofs can also be found in [102].
Being a form of regularization, the cost function in (2.2) defines the form andstructure of the optimal transport map. For illustration, consider the case when✓ ⇠ N(0, I) and r ⇠ N(0,⌃) for some covariance matrix ⌃. In this Gaussian example,the transport map will be linear: r
i.d.= ⌃1/2✓, where ⌃1/2 is any one of the many
square roots of ⌃. Two possible matrix square roots are the Cholesky factor, and theeigenvalue square root. Interestingly, when the cost is given by
cEig(✓, T (✓)) = k✓ � T (✓)k2, (2.3)
the optimal square root, ⌃1/2, will be defined by the eigenvalue decomposition of ⌃,but when the cost is given by the limit of a a weighted quadratic defined by
cRos(✓, T (✓)) = limt!0
DX
k=1
tk�1|✓k � Tk(✓)|, (2.4)
the optimal square root, ⌃1/2, will be defined by the Cholesky decomposition of ⌃.In the more general nonlinear and non-Gaussian setting, this latter cost is shown by[22] and [15] to yield the well-known Rosenblatt transformation from [91].
The Cholesky factor is a special case of the Rosenblatt transformation, which it-self is just a multivariate generalization of using cumulative distribution functions totransform between univariate random variables (i.e., the “CDF trick”). Importantly,the lower triangular structure present in the Cholesky factor, which makes inverting
31
⇡(✓) p(r)
⇡̃(r)
T (✓)
T̃ (✓)
Figure 2-1: Illustration of exact and inexact transformations coming from T and T̃respectively. The exact map pushes the target measure ⇡ to the standard Gaussianreference p while the approximate map only captures some of the structure in ⇡,producing an approximation p̃ to the reference Gaussian.
µr does not contain any point masses and the cost function c(✓, T (✓)) is quadratic.Details of the existence and uniqueness proofs can also be found in [102].
Being a form of regularization, the cost function in (2.2) defines the form andstructure of the optimal transport map. For illustration, consider the case when✓ ⇠ N(0, I) and r ⇠ N(0,⌃) for some covariance matrix ⌃. In this Gaussian example,the transport map will be linear: r
i.d.= ⌃1/2✓, where ⌃1/2 is any one of the many
square roots of ⌃. Two possible matrix square roots are the Cholesky factor, and theeigenvalue square root. Interestingly, when the cost is given by
cEig(✓, T (✓)) = k✓ � T (✓)k2, (2.3)
the optimal square root, ⌃1/2, will be defined by the eigenvalue decomposition of ⌃,but when the cost is given by the limit of a a weighted quadratic defined by
cRos(✓, T (✓)) = limt!0
DX
k=1
tk�1|✓k � Tk(✓)|, (2.4)
the optimal square root, ⌃1/2, will be defined by the Cholesky decomposition of ⌃.In the more general nonlinear and non-Gaussian setting, this latter cost is shown by[22] and [15] to yield the well-known Rosenblatt transformation from [91].
The Cholesky factor is a special case of the Rosenblatt transformation, which it-self is just a multivariate generalization of using cumulative distribution functions totransform between univariate random variables (i.e., the “CDF trick”). Importantly,the lower triangular structure present in the Cholesky factor, which makes inverting
31
T
η π
Core idea
I Choose a reference distribution η (e.g., standard Gaussian)I Seek a transport map T : Rn → Rn such that T]η = πI Equivalently, find S = T−1 such that S]π = ηI Satisfying these conditions only approximately can still be useful!
Marzouk SIMDA seminar 4 / 43
Choice of transport map
Consider the triangular Knothe-Rosenblatt rearrangement on Rd
S(x) =
S1(x1)S2(x1, x2)
...Sd(x1, x2, . . . , xd)
1 Unique S s.t.S]π = η exists under mild conditions on π and η2 Map is easily invertible and Jacobian ∇S is simple to evaluate3 Monotonicity is essentially one-dimensional: ∂xkS
k > 04 Each component Sk characterizes one marginal conditional
π(x) = π(x1)π(x2|x1) · · ·π(xd |x1, . . . , xd−1)
Marzouk SIMDA seminar 5 / 43
Ubiquity of triangular maps
Many “flows” proposed in ML are special cases of triangular maps, e.g.,
I NICE: Nonlinear independent component estimation [Dinh et al. 2015]
Sk(x1, . . . , xk) = µk(xx
Ubiquity of triangular maps
Many “flows” proposed in ML are special cases of triangular maps, e.g.,
I NICE: Nonlinear independent component estimation [Dinh et al. 2015]
Sk(x1, . . . , xk) = µk(xx
How to construct triangular maps?
Some past work: “maps from densities,” i.e., variational inference withthe direct map T [Moselhy & M 2012]
minT∈T h4
DKL(T] η ||π ) = minT∈T h4
DKL( η ||T−1] π )
I π is the “target” density on Rn; η is, e.g., N (0, In)I T h4 is a set of monotone lower triangular maps
I T h→∞4 contains the Knothe–Rosenblatt rearrangementI Expectation is with respect to the reference measure η
I Compute via, e.g., Monte Carlo, sparse quadrature
I Use unnormalized evaluations of π and its gradientsI No MCMC or importance samplingI In general non-convex, unless π is log-concave
Marzouk SIMDA seminar 7 / 43
How to construct triangular maps?
Some past work: “maps from densities,” i.e., variational inference withthe direct map T [Moselhy & M 2012]
minT∈T h4
DKL(T] η ||π ) = minT∈T h4
DKL( η ||T−1] π )
I π is the “target” density on Rn; η is, e.g., N (0, In)I T h4 is a set of monotone lower triangular maps
I T h→∞4 contains the Knothe–Rosenblatt rearrangementI Expectation is with respect to the reference measure η
I Compute via, e.g., Monte Carlo, sparse quadrature
I Use unnormalized evaluations of π and its gradientsI No MCMC or importance samplingI In general non-convex, unless π is log-concave
Marzouk SIMDA seminar 7 / 43
Illustrative example
minTEη[− logπ ◦ T −
∑
k
log ∂xkTk ]
I Parameterized map T ∈ T h4 ⊂ T4I Optimize over coefficients of
parameterizationI Use gradient-based optimizationI The posterior is in the tail of the reference
3 0 3
3
0
3
6
9
12
15
18
Marzouk SIMDA seminar 7 / 43
Illustrative example
minTEη[− logπ ◦ T −
∑
k
log ∂xkTk ]
I Parameterized map T ∈ T h4 ⊂ T4I Optimize over coefficients of
parameterizationI Use gradient-based optimizationI The posterior is in the tail of the reference
3 0 3
3
0
3
6
9
12
15
18
Marzouk SIMDA seminar 7 / 43
Illustrative example
minTEη[− logπ ◦ T −
∑
k
log ∂xkTk ]
I Parameterized map T ∈ T h4 ⊂ T4I Optimize over coefficients of
parameterizationI Use gradient-based optimizationI The posterior is in the tail of the reference
3 0 3
3
0
3
6
9
12
15
18
Marzouk SIMDA seminar 7 / 43
Illustrative example
minTEη[− logπ ◦ T −
∑
k
log ∂xkTk ]
I Parameterized map T ∈ T h4 ⊂ T4I Optimize over coefficients of
parameterizationI Use gradient-based optimizationI The posterior is in the tail of the reference
3 0 3
3
0
3
6
9
12
15
18
Marzouk SIMDA seminar 7 / 43
How to construct triangular maps?
Alternative formulation: “maps from samples” [Parno PhD thesis, 2014]
I Given samples (xi)Mi=1 ∼ π: find components via convex (wrt Sk)constrained minimization:
minS
DKL(π||S]η) ⇔ minSk : ∂kSk>0
Eπ[12Sk(x1:k)2 − log ∂kSk(x1:k)
]∀k
I Approximate Eπ given i.i.d. samples from π: KL minimizationequivalent to maximum likelihood estimation
Ŝk ∈ arg minSk∈Sh4,k
1M
M∑
i=1
(12Sk(xi1:k)
2 − log ∂kSk(xi1:k))
Marzouk SIMDA seminar 8 / 43
Low-dimensional structure of transport maps
An underlying challenge: maps in high dimensionsI Major bottleneck: representation of the map, e.g., cardinality of the
map basisI How to make the construction/representation of high-dimensional
transports tractable?
Main ideas:1 Exploit Markov structure of the target distribution
I Leads to sparsity and/or decomposability of transport maps [Spantini,Bigoni, & M JMLR 2018]
2 Exploit low rank structureI Common in inverse problems. Near-identity or “lazy” maps [Brennan etal. NeurIPS 2020, Zahm et al. 2018]
Marzouk SIMDA seminar 9 / 43
Low-dimensional structure of transport maps
An underlying challenge: maps in high dimensionsI Major bottleneck: representation of the map, e.g., cardinality of the
map basisI How to make the construction/representation of high-dimensional
transports tractable?
Main ideas:1 Exploit Markov structure of the target distribution
I Leads to sparsity and/or decomposability of transport maps [Spantini,Bigoni, & M JMLR 2018]
2 Exploit low rank structureI Common in inverse problems. Near-identity or “lazy” maps [Brennan etal. NeurIPS 2020, Zahm et al. 2018]
Marzouk SIMDA seminar 9 / 43
Remainder of this talk: sequential and “likelihood-free”
Can transport help solve sequential inference problems where keydensity functions cannot be evaluated?
[image: NCAR]
Marzouk SIMDA seminar 10 / 43
Formalize as Bayesian filtering
I Nonlinear/non-Gaussian state-space model:I Transition density πZk |Zk−1I Observation density (likelihood) πYk |Zk
Z0 Z1 Z2 Z3 ZN
Y0 Y1 Y2 Y3 YN
1
I Focus on recursively approximating the filtering distribution:πZk | y0:k → πZk+1 | y0:k+1 (marginals of the full Bayesian solution)
Marzouk SIMDA seminar 11 / 43
Problem setting
I Consider the filtering of state-space models with:1 High-dimensional states2 Challenging nonlinear dynamics3 Intractable transition kernels: can only obtain forecast samples, i.e.,
draws from πZk+1 | zk4 Limited model evaluations, e.g., small ensemble sizes5 Sparse and local observations
I These constraints reflect challenges faced in the sea ice predictionproblem. . .
Marzouk SIMDA seminar 12 / 43
Ensemble Kalman filter
I State-of-the-art results (in terms of tracking) are often obtained withthe ensemble Kalman filter (EnKF)
πZk−1|Y 0:k−1
forecast step analysis step
Bayesian inference
πZk|Y 0:k−1 πZk|Y 0:k
I Move samples via an affine transformation; no weights or resampling!I Yet ultimately inconsistent: does not converge to the true posterior
Can we improve and generalize the EnKF while preserving scalability?
Marzouk SIMDA seminar 13 / 43
Ensemble Kalman filter
I State-of-the-art results (in terms of tracking) are often obtained withthe ensemble Kalman filter (EnKF)
πZk−1|Y 0:k−1
forecast step analysis step
Bayesian inference
πZk|Y 0:k−1 πZk|Y 0:k
I Move samples via an affine transformation; no weights or resampling!I Yet ultimately inconsistent: does not converge to the true posterior
Can we improve and generalize the EnKF while preserving scalability?
Marzouk SIMDA seminar 13 / 43
Assimilation step
At any assimilation time k , we have a Bayesian inference problem:
xi
πX|Y =y∗πXprior posterior
I πX is the forecast distribution on RnI πY|X is the likelihood of the observations Y ∈ RdI πX|Y=y∗ is the filtering distribution for a realization y∗ of the data
Goal: sample the posterior given only prior samples x1, . . . , xM andthe ability to simulate data yi |xi
Marzouk SIMDA seminar 14 / 43
Inference as transportation of measure
I Seek a map T that pushes forward prior to posterior
(x1, . . . , xM) ∼ πX =⇒ (T (x1), . . . ,T (xM)) ∼ πX|Y=y∗
I The map induces a coupling between prior and posterior measures
xi
T (xi)
πX|Y =y∗πX
transport map
How to construct a “good” coupling from very few prior samples?
Marzouk SIMDA seminar 15 / 43
Consider the joint distribution of state and observations
T (y,x)
πX|Y =y∗πX
πY ,Xjoint
I Construct a map T from the joint distribution πY,X to the posteriorI T can be computed via convex optimization given samples from πY,XI Sample πY,X using the forecast ensemble and the likelihood
(yi , xi) yi ∼ πY|X=xiI Intuition: a generalization of the “perturbed observation” EnKF
Marzouk SIMDA seminar 16 / 43
Couple the joint distribution with a standard normal
T : Rd+n → Rn
πX|Y =y∗
πY ,Xjoint
We can find T by computing a Knothe–Rosenblatt (KR) rearrange-ment S between πY,X and N (0, Id+n)
N (0, Id+n)
S : Rd+n → Rd+nπY ,Xjoint
I We will show how to derive T from S . . .Marzouk SIMDA seminar 17 / 43
Triangular maps enable conditional simulation
S(x1, . . . , xm) =
S1(x1)
S2(x1, x2)...Sm(x1, x2, . . . , xm)
I Each component Sk links marginal conditionals of π and ηI For instance, if η = N (0, I), then for all x1, . . . , xk−1 ∈ Rk−1
ξ 7→ Sk(x1, . . . , xk−1, ξ) pushes πXk |X1:k−1(ξ|x1:k−1) to N (0, 1)
I Simulate the conditional πXk |X1:k−1 by inverting a 1-D mapξ 7→ Sk(x1:k−1, ξ) at Gaussian samples (need triangular structure)
Marzouk SIMDA seminar 18 / 43
Filtering: the analysis map
I We are interested in the KR map S that pushes πY,X to N (0, Id+n)I The KR map immediately has a block structure
S(y, x) =
[SY(y)SX(y, x)
],
which suggests two properties:
SX pushes πY,X to N (0, In)
ξ 7→ SX(y∗, ξ) pushes πX|Y=y∗ to N (0, In)
I The analysis map that pushes πY,X to πX|Y=y∗ is then given by
T (y, x) = SX(y∗, ·)−1 ◦ SX(y, x)
Marzouk SIMDA seminar 19 / 43
A likelihood-free inference algorithm with maps
T (y,x)
πX|Y =y∗πX
πY ,Xjoint
xi
πY |X=xi
Transport map ensemble filter1 Compute forecast ensemble x1, . . . , xM2 Generate samples (yi , xi) from πY,X with yi ∼ πY|X=xi3 Build an estimator T̂ of T4 Compute analysis ensemble as xai = T̂ (yi , xi) for i = 1, . . . ,M
Marzouk SIMDA seminar 20 / 43
Estimator for the analysis map
I Recall the form of S :
S(y, x) =
[SY(y)SX(y, x)
], S] πY,X = N (0, Id+n).
I We propose a simple estimator T̂ of T :
T̂ (y, x) = ŜX(y∗, ·)−1 ◦ ŜX(y, x),
where Ŝ is a maximum likelihood estimator of SI This is simply the earlier “maps from samples” approach!
Marzouk SIMDA seminar 21 / 43
Map parameterizations
Ŝk ∈ arg minSk∈Sh4,k
1M
M∑
i=1
(12Sk(xi)2 − log ∂kSk(xi)
)
I Optimization is not needed for nonlinear separable parameterizationsof the form Ŝk(x1:k) = αxk + g(x1:k−1) (just linear regression)
I Connection to EnKF: a linear parameterization of Ŝk yields aparticular form of EnKF with “perturbed observations”
I Choice of approximation space allows control of the bias and varianceof ŜI Richer parameterizations yield less bias, but potentially higher variance
Marzouk SIMDA seminar 22 / 43
Example: Lorenz-63
Simple example: three-dimensional Lorenz-63 system
dX1dt
= σ(X2 − X1),
dX2dt
= X1(ρ− X3)− X2dX3dt
= X1X2 − βX3
I Chaotic setting: ρ = 28, σ = 10, β = 8/3I Fully observed, with additive Gaussian observation noiseEj ∼ N (0, 22)
I Assimilation interval ∆t = 0.1I Results computed over 2000 assimilation cycles, following spin-up
I Map parameterizations: Sk(x1:k) =∑
i≤k Ψi(xi), with Ψi = linear+ {RBFs or sigmoids }
Marzouk SIMDA seminar 23 / 43
Example: Lorenz-63
Mean “tracking” error vs. ensemble size and choice of map
Marzouk SIMDA seminar 24 / 43
Example: Lorenz-63
What about comparison to the true Bayesian solution?
Marzouk SIMDA seminar 25 / 43
“Localize” the map in high dimensions
I Regularize the estimator Ŝ of S by imposing sparsity, e.g.,
Ŝ(x1, . . . , x4) =
Ŝ1(x1)
Ŝ2(x1, x2)
Ŝ3(x1, x2, x3)
Ŝ4(x1,x2, x3, x4)
I The sparsity of the kth component of S depends on the sparsity ofthe marginal conditional function πXk |X1:k−1(xk |x1:k−1)
I Localization heuristic: let each Ŝk depend on variables (xj)j
Lorenz-96 in chaotic regime (40-dimensional state)
I A hard test-case configuration [Bengtsson et al. 2003]:
dXjdt
= (Xj+1 − Xj−2)Xj−1 − Xj + F , j = 1, . . . , 40
Yj = Xj + Ej , j = 1, 3, 5 . . . , 39
I F = 8 (chaotic) and Ej ∼ N (0, 0.5) (small noise for PF)I Time between observations: ∆obs = 0.4 (large)I Results computed over 2000 assimilation cycles, following spin-up
Lorenz-96: “hard” case
I The nonlinear filter is ≈ 25% more accurate in RMSE than EnKF
Marzouk SIMDA seminar 28 / 43
Lorenz-96: “hard” case
I The nonlinear filter is ≈ 25% more accurate in RMSE than EnKF
Marzouk SIMDA seminar 28 / 43
Lorenz-96: “hard” case
Marzouk SIMDA seminar 29 / 43
Lorenz-96: non-Gaussian noise
I A heavy-tailed noise configuration:
dXjdt
= (Xj+1 − Xj−2)Xj−1 − Xj + F , j = 1, . . . , 40
Yj = Xj + Ej , j = 1, 5, 9, 13, . . . , 37
I F = 8 (chaotic) and Ej ∼ Laplace(λ = 1)I Time between observations: ∆obs = 0.1I Results computed over 2000 assimilation cycles, following spin-up
Marzouk SIMDA seminar 30 / 43
Lorenz-96: non-Gaussian noise
Marzouk SIMDA seminar 31 / 43
Lorenz-96: details on the filtering approximation
1
I Impose sparsity of the map with a 5-way interaction model (above)I Separable and nonlinear parameterization of each component
Ŝk(xj1 , . . . , xjp , xk) = ψ(xj1) + . . .+ ψ(xjp ) + ψ̃(xk),
where ψ(x) = a0 + a1 · x +∑
i>1 ai exp(−(x − ci)2/σ).I More general parameterizations are of course possible
Marzouk SIMDA seminar 32 / 43
Lorenz-96: tracking performance of the filter
I Simple and and localized nonlinearities have significant impact
Marzouk SIMDA seminar 33 / 43
Remarks and questions
I Nonlinear generalization of the EnKF: move the ensemblemembers via local nonlinear transport maps, no weights or degeneracy
I Learn non-Gaussian features via nonlinear continuous transport andconvex optimization
I Choice of map basis and sparsity provide regularization (e.g.,localization)
I In principle, filter is consistent as Sh4 is enriched and M →∞. Butwhat is a good choice of Sh4 for any fixed ensemble size M?
I Are there better estimators than maximum likelihood? What are thefinite-sample statistical properties of candidate estimators, andproperties of the associated optimization problems?
I How to relate map structure/parameterization to the underlyingdynamics, observation operators, and data?
Marzouk SIMDA seminar 34 / 43
Remarks and questions
I Nonlinear generalization of the EnKF: move the ensemblemembers via local nonlinear transport maps, no weights or degeneracy
I Learn non-Gaussian features via nonlinear continuous transport andconvex optimization
I Choice of map basis and sparsity provide regularization (e.g.,localization)
I In principle, filter is consistent as Sh4 is enriched and M →∞. Butwhat is a good choice of Sh4 for any fixed ensemble size M?
I Are there better estimators than maximum likelihood? What are thefinite-sample statistical properties of candidate estimators, andproperties of the associated optimization problems?
I How to relate map structure/parameterization to the underlyingdynamics, observation operators, and data?
Marzouk SIMDA seminar 34 / 43
Sparsity of triangular maps
Theorem [Spantini et al. 2018]
Conditional independence properties of π (encoded by graph G) define alower bound on the sparsity of S such that S]η = π
G of π Sparsity of S
State-space model Sparsity of S
Main idea: Discover sparse structure using ATM algorithmMarzouk SIMDA seminar 35 / 43
Adaptive transport map (ATM) algorithm (Baptista et al. 2020)
Goal: Approximate map given n i.i.d. samples from π
Greedy enrichment procedureI Look for sparse expansion f (x) =
∑α∈Λ cαψα(x)
I Use tensor-product Hermite functions ψα(x) = Pαj (x) exp(−‖x‖2/2)I Add one element to set of active multi-indices Λt at a timeI Restrict Λt to be downward closedI Search for new features in the reduced margin of Λt
Marzouk SIMDA seminar 36 / 43
Adaptive transport map (ATM) algorithm (Baptista et al. 2020)
Goal: Approximate map given n i.i.d. samples from π
Initialize Λt = ∅ (i.e., f0 = 0)For t = 0, . . . ,m
1 Find reduced margin ΛRMt of Λt2 Add new feature Λt+1 = Λt ∪ α∗t+1 for α∗t+1 ∈ ΛRMt3 Update approximation ft+1 = argminf ∈span(ψΛt+1 ) Lk(f )
In practice, we can choose m (# of features) via cross-validation
Marzouk SIMDA seminar 36 / 43
Numerical example: Lorenz-96 data
I Distribution of state at a fixed time starting from a Gaussian initialcondition
Log-like and m vs n1
2
3
4
5
6
7
8
9
Sparsity of S: n = 3161 2 3 4 5 6 7 8 9 10 11 12 13 14 15
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Conditional independence
Takeaways:I ATM discovers conditional independence structure in the stateI Natural semi-parametric method that gradually increases m with n
Marzouk SIMDA seminar 37 / 43
Back to conditional maps: link to ABC
“Analysis” step of the ensemble filtering scheme is an instance oflikelihood-free inference or approximate Bayesian computation(ABC):
I Central idea: only need to simulate from πY,X in order to constructT̂ and thus draw samples from πX|Y=y∗
I The map T̂ has some remarkable properties. . .
Marzouk SIMDA seminar 38 / 43
Back to conditional maps: link to ABC
“Analysis” step of the ensemble filtering scheme is an instance oflikelihood-free inference or approximate Bayesian computation(ABC):
I Central idea: only need to simulate from πY,X in order to constructT̂ and thus draw samples from πX|Y=y∗
I The map T̂ has some remarkable properties. . .
Marzouk SIMDA seminar 38 / 43
Compare two approaches for posterior sampling
Marzouk SIMDA seminar 39 / 43
Compare two approaches for posterior sampling
X|y∗ ∼ ŜX(y∗, ·)−1] η X|y∗ ∼ T̂]πy,x forT̂ = ŜX(y∗, ·)−1 ◦ ŜX(·, ·)
I Propagating the joint prior through composed maps has lower error!
Marzouk SIMDA seminar 39 / 43
Composed maps for ABC
I Simple and incomplete examples:
I Remove dependence of the mean on y: ŜX (y, x) = x− E[X|Y = y]I If E[X|Y = y] ≈ c+ βy, we have the EnKF
I Remove dependence of the mean and variance on y:ŜX (y, x) = Var(X|Y = y)−1/2(x− E[X|Y = y])
I The transport map framework offers a much more general set ofpossibilities, which in principle includes the exact transformation
I Current work: Alternative optimization objectives (hence mapestimators T̂ ) and tailored parameterizations
I Also useful for optimal Bayesian experimental design!
Marzouk SIMDA seminar 40 / 43
Composed maps for ABC
I Simple and incomplete examples:
I Remove dependence of the mean on y: ŜX (y, x) = x− E[X|Y = y]I If E[X|Y = y] ≈ c+ βy, we have the EnKF
I Remove dependence of the mean and variance on y:ŜX (y, x) = Var(X|Y = y)−1/2(x− E[X|Y = y])
I The transport map framework offers a much more general set ofpossibilities, which in principle includes the exact transformation
I Current work: Alternative optimization objectives (hence mapestimators T̂ ) and tailored parameterizations
I Also useful for optimal Bayesian experimental design!
Marzouk SIMDA seminar 40 / 43
SIMDA plans, connections, and discussion
I Nonlinear ensemble algorithms for filtering, smoothing, and jointstate-parameter inference (quite general)
I Generative modeling with transport maps: will we need to buildpriors—spatiotemporal statistical models for sea ice thickness—fromdata?
I Likelihood-free inference problems:I Are there sea ice problems where we will need to extract conditionalrelationships from large data sets? Conditional density estimation,conditional simulation.
I High-dimensional and disparate sources of data
I Are there strong/essential non-Gaussianities in the sea ice problem?Tail behaviors?
I Explore cost-accuracy tradeoffs: balance posterior fidelity withcomputational effort
I Use conditional density estimates in optimal experimental design
Marzouk SIMDA seminar 41 / 43
Conclusions
Thanks for your attention!
Marzouk SIMDA seminar 42 / 43
References
I R. Baptista, O. Zahm, Y. Marzouk. “ An adaptive transport framework for jointand conditional density estimation.” arXiv:2009.10303, 2020.
I J. Zech, Y. Marzouk. “Sparse approximation of triangular transports on boundeddomains.” arXiv:2006.06994, 2020.
I N. Kovachki, R. Baptista, B. Hosseini and Y. Marzouk, “Conditional sampling withmonotone GANs,” arXiv:2006.06755, 2020.
I A. Spantini, R. Baptista, Y. Marzouk. “Coupling techniques for nonlinear ensemblefiltering.” arXiv:1907.00389, 2020.
I M. Brennan, D. Bigoni, O. Zahm, A. Spantini, Y. Marzouk. “Greedy inferencewith structure-exploiting lazy maps.” NeurIPS 2020, arXiv:1906.00031.
I O. Zahm, T. Cui, K. Law, A. Spantini, Y. Marzouk. “Certified dimensionreduction in nonlinear Bayesian inverse problems.” arXiv:1807.03712, 2019.
I A. Spantini, D. Bigoni, Y. Marzouk. “Inference via low-dimensional couplings.”JMLR 19(66): 1–71, 2018.
I M. Parno, Y. Marzouk, “Transport map accelerated Markov chain Monte Carlo.”SIAM JUQ 6: 645–682, 2018.
I R. Morrison, R. Baptista, Y. Marzouk. “Beyond normality: learning sparseprobabilistic graphical models in the non-Gaussian setting.” NeurIPS 2017.
Marzouk SIMDA seminar 43 / 43