Transport methods for nonlinear filtering and likelihood-free ...muri/seminar/simda-oct-ymarz...Marzouk SIMDA seminar 4 / 43 Choiceoftransportmap ConsiderthetriangularKnothe-RosenblattrearrangementonRd

Transport methods for nonlinear filtering andlikelihood-free inference

Youssef Marzouk, joint work with Ricardo Baptista,Alessio Spantini, & Olivier Zahm

Department of Aeronautics and AstronauticsCenter for Computational Science and Engineering

Statistics and Data Science CenterMassachusetts Institute of Technology

http://uqgroup.mit.edu

Sea Ice Modeling and Data Assimilation (SIMDA) Seminar

6 October 2020

Marzouk SIMDA seminar 1 / 43

http://uqgroup.mit.edu

Main ideas #1

I Online Bayesian inference in dynamical models with sequential datais central to our proposed workI Data assimilation, in the broadest sense

I New nonlinear ensemble schemes, based on transport, can provide aconsistent approach to inference in general non-Gaussian settingsI Generalizations of the ensemble Kalman filter (EnKF)

I These schemes are methods for likelihood-free inference (LFI) orapproximate Bayesian computation (ABC), and have broaderutility!


Main ideas #2

I Application to high-dimensional and disparate data requiresdetecting and exploiting low-dimensional structureI Many varieties of structure, e.g.: sparsity and conditional independence(especially given disparate data sources), smoothness/low rank,piecewise smoothness, hierarchical low rank, multiscale structure

I Structure in parameters/state and in data!I Relate also to data-driven discretizations of inverse problems

I These new inference methods can also facilitate optimal experimentaldesign


Deterministic couplings of probability measures

⇡(✓) p(r)

⇡̃(r)

T (✓)

T̃ (✓)

Figure 2-1: Illustration of exact and inexact transformations coming from T and T̃respectively. The exact map pushes the target measure ⇡ to the standard Gaussianreference p while the approximate map only captures some of the structure in ⇡,producing an approximation p̃ to the reference Gaussian.

µr does not contain any point masses and the cost function c(✓, T (✓)) is quadratic.Details of the existence and uniqueness proofs can also be found in [102].

Being a form of regularization, the cost function in (2.2) defines the form andstructure of the optimal transport map. For illustration, consider the case when✓ ⇠ N(0, I) and r ⇠ N(0,⌃) for some covariance matrix ⌃. In this Gaussian example,the transport map will be linear: r

i.d.= ⌃1/2✓, where ⌃1/2 is any one of the many

square roots of ⌃. Two possible matrix square roots are the Cholesky factor, and theeigenvalue square root. Interestingly, when the cost is given by

cEig(✓, T (✓)) = k✓ � T (✓)k2, (2.3)

the optimal square root, ⌃1/2, will be defined by the eigenvalue decomposition of ⌃,but when the cost is given by the limit of a a weighted quadratic defined by

cRos(✓, T (✓)) = limt!0

DX

k=1

tk�1|✓k � Tk(✓)|, (2.4)

the optimal square root, ⌃1/2, will be defined by the Cholesky decomposition of ⌃.In the more general nonlinear and non-Gaussian setting, this latter cost is shown by[22] and [15] to yield the well-known Rosenblatt transformation from [91].

The Cholesky factor is a special case of the Rosenblatt transformation, which it-self is just a multivariate generalization of using cumulative distribution functions totransform between univariate random variables (i.e., the “CDF trick”). Importantly,the lower triangular structure present in the Cholesky factor, which makes inverting

31

⇡(✓) p(r)

⇡̃(r)

T (✓)

T̃ (✓)






cEig(✓, T (✓)) = k✓ � T (✓)k2, (2.3)


cRos(✓, T (✓)) = limt!0

DX

k=1

tk�1|✓k � Tk(✓)|, (2.4)



31

T

η π

Core idea

I Choose a reference distribution η (e.g., standard Gaussian)I Seek a transport map T : Rn → Rn such that T]η = π



⇡(✓) p(r)

⇡̃(r)

T (✓)

T̃ (✓)






cEig(✓, T (✓)) = k✓ � T (✓)k2, (2.3)


cRos(✓, T (✓)) = limt!0

DX

k=1

tk�1|✓k � Tk(✓)|, (2.4)



31

⇡(✓) p(r)

⇡̃(r)

T (✓)

T̃ (✓)






cEig(✓, T (✓)) = k✓ � T (✓)k2, (2.3)


cRos(✓, T (✓)) = limt!0

DX

k=1

tk�1|✓k � Tk(✓)|, (2.4)



31

S =T−1

η π

Core idea

I Choose a reference distribution η (e.g., standard Gaussian)I Seek a transport map T : Rn → Rn such that T]η = πI Equivalently, find S = T−1 such that S]π = η



⇡(✓) p(r)

⇡̃(r)

T (✓)

T̃ (✓)






cEig(✓, T (✓)) = k✓ � T (✓)k2, (2.3)


cRos(✓, T (✓)) = limt!0

DX

k=1

tk�1|✓k � Tk(✓)|, (2.4)



31

⇡(✓) p(r)

⇡̃(r)

T (✓)

T̃ (✓)






cEig(✓, T (✓)) = k✓ � T (✓)k2, (2.3)


cRos(✓, T (✓)) = limt!0

DX

k=1

tk�1|✓k � Tk(✓)|, (2.4)



31

T

η π

Core idea

I Choose a reference distribution η (e.g., standard Gaussian)I Seek a transport map T : Rn → Rn such that T]η = πI Equivalently, find S = T−1 such that S]π = ηI In principle, enables exact (independent, unweighted) sampling!



⇡(✓) p(r)

⇡̃(r)

T (✓)

T̃ (✓)






cEig(✓, T (✓)) = k✓ � T (✓)k2, (2.3)


cRos(✓, T (✓)) = limt!0

DX

k=1

tk�1|✓k � Tk(✓)|, (2.4)



31

⇡(✓) p(r)

⇡̃(r)

T (✓)

T̃ (✓)






cEig(✓, T (✓)) = k✓ � T (✓)k2, (2.3)


cRos(✓, T (✓)) = limt!0

DX

k=1

tk�1|✓k � Tk(✓)|, (2.4)



31

T

η π

Core idea

I Choose a reference distribution η (e.g., standard Gaussian)I Seek a transport map T : Rn → Rn such that T]η = πI Equivalently, find S = T−1 such that S]π = ηI Satisfying these conditions only approximately can still be useful!


Choice of transport map

Consider the triangular Knothe-Rosenblatt rearrangement on Rd

S(x) =

S1(x1)S2(x1, x2)

...Sd(x1, x2, . . . , xd)

1 Unique S s.t.S]π = η exists under mild conditions on π and η2 Map is easily invertible and Jacobian ∇S is simple to evaluate3 Monotonicity is essentially one-dimensional: ∂xkS

k > 04 Each component Sk characterizes one marginal conditional

π(x) = π(x1)π(x2|x1) · · ·π(xd |x1, . . . , xd−1)


Ubiquity of triangular maps

Many “flows” proposed in ML are special cases of triangular maps, e.g.,

I NICE: Nonlinear independent component estimation [Dinh et al. 2015]

Sk(x1, . . . , xk) = µk(xx

How to construct triangular maps?

Some past work: “maps from densities,” i.e., variational inference withthe direct map T [Moselhy & M 2012]

minT∈T h4

DKL(T] η ||π ) = minT∈T h4

DKL( η ||T−1] π )

I π is the “target” density on Rn; η is, e.g., N (0, In)I T h4 is a set of monotone lower triangular maps

I T h→∞4 contains the Knothe–Rosenblatt rearrangementI Expectation is with respect to the reference measure η

I Compute via, e.g., Monte Carlo, sparse quadrature

I Use unnormalized evaluations of π and its gradientsI No MCMC or importance samplingI In general non-convex, unless π is log-concave


Illustrative example

minTEη[− logπ ◦ T −

∑

k

log ∂xkTk ]

I Parameterized map T ∈ T h4 ⊂ T4I Optimize over coefficients of

parameterizationI Use gradient-based optimizationI The posterior is in the tail of the reference

3 0 3

3

0

3

6

9

12

15

18


How to construct triangular maps?

Alternative formulation: “maps from samples” [Parno PhD thesis, 2014]

I Given samples (xi)Mi=1 ∼ π: find components via convex (wrt Sk)constrained minimization:

minS

DKL(π||S]η) ⇔ minSk : ∂kSk>0

Eπ[12Sk(x1:k)2 − log ∂kSk(x1:k)

]∀k

I Approximate Eπ given i.i.d. samples from π: KL minimizationequivalent to maximum likelihood estimation

Ŝk ∈ arg minSk∈Sh4,k

1M

M∑

i=1

(12Sk(xi1:k)

2 − log ∂kSk(xi1:k))


Low-dimensional structure of transport maps

An underlying challenge: maps in high dimensionsI Major bottleneck: representation of the map, e.g., cardinality of the

map basisI How to make the construction/representation of high-dimensional

transports tractable?

Main ideas:1 Exploit Markov structure of the target distribution

I Leads to sparsity and/or decomposability of transport maps [Spantini,Bigoni, & M JMLR 2018]

2 Exploit low rank structureI Common in inverse problems. Near-identity or “lazy” maps [Brennan etal. NeurIPS 2020, Zahm et al. 2018]


Remainder of this talk: sequential and “likelihood-free”

Can transport help solve sequential inference problems where keydensity functions cannot be evaluated?

[image: NCAR]


Formalize as Bayesian filtering

I Nonlinear/non-Gaussian state-space model:I Transition density πZk |Zk−1I Observation density (likelihood) πYk |Zk

Z0 Z1 Z2 Z3 ZN

Y0 Y1 Y2 Y3 YN

1

I Focus on recursively approximating the filtering distribution:πZk | y0:k → πZk+1 | y0:k+1 (marginals of the full Bayesian solution)


Problem setting

I Consider the filtering of state-space models with:1 High-dimensional states2 Challenging nonlinear dynamics3 Intractable transition kernels: can only obtain forecast samples, i.e.,

draws from πZk+1 | zk4 Limited model evaluations, e.g., small ensemble sizes5 Sparse and local observations

I These constraints reflect challenges faced in the sea ice predictionproblem. . .


Ensemble Kalman filter

I State-of-the-art results (in terms of tracking) are often obtained withthe ensemble Kalman filter (EnKF)

πZk−1|Y 0:k−1

forecast step analysis step

Bayesian inference

πZk|Y 0:k−1 πZk|Y 0:k

I Move samples via an affine transformation; no weights or resampling!I Yet ultimately inconsistent: does not converge to the true posterior

Can we improve and generalize the EnKF while preserving scalability?


Assimilation step

At any assimilation time k , we have a Bayesian inference problem:

xi

πX|Y =y∗πXprior posterior

I πX is the forecast distribution on RnI πY|X is the likelihood of the observations Y ∈ RdI πX|Y=y∗ is the filtering distribution for a realization y∗ of the data

Goal: sample the posterior given only prior samples x1, . . . , xM andthe ability to simulate data yi |xi


Inference as transportation of measure

I Seek a map T that pushes forward prior to posterior

(x1, . . . , xM) ∼ πX =⇒ (T (x1), . . . ,T (xM)) ∼ πX|Y=y∗

I The map induces a coupling between prior and posterior measures

xi

T (xi)

πX|Y =y∗πX

transport map

How to construct a “good” coupling from very few prior samples?


Consider the joint distribution of state and observations

T (y,x)

πX|Y =y∗πX

πY ,Xjoint

I Construct a map T from the joint distribution πY,X to the posteriorI T can be computed via convex optimization given samples from πY,XI Sample πY,X using the forecast ensemble and the likelihood

(yi , xi) yi ∼ πY|X=xiI Intuition: a generalization of the “perturbed observation” EnKF


Couple the joint distribution with a standard normal

T : Rd+n → Rn

πX|Y =y∗

πY ,Xjoint

We can find T by computing a Knothe–Rosenblatt (KR) rearrange-ment S between πY,X and N (0, Id+n)

N (0, Id+n)

S : Rd+n → Rd+nπY ,Xjoint

I We will show how to derive T from S . . .Marzouk SIMDA seminar 17 / 43

Triangular maps enable conditional simulation

S(x1, . . . , xm) =

S1(x1)

S2(x1, x2)...Sm(x1, x2, . . . , xm)

I Each component Sk links marginal conditionals of π and ηI For instance, if η = N (0, I), then for all x1, . . . , xk−1 ∈ Rk−1

ξ 7→ Sk(x1, . . . , xk−1, ξ) pushes πXk |X1:k−1(ξ|x1:k−1) to N (0, 1)

I Simulate the conditional πXk |X1:k−1 by inverting a 1-D mapξ 7→ Sk(x1:k−1, ξ) at Gaussian samples (need triangular structure)


Filtering: the analysis map

I We are interested in the KR map S that pushes πY,X to N (0, Id+n)I The KR map immediately has a block structure

S(y, x) =

[SY(y)SX(y, x)

],

which suggests two properties:

SX pushes πY,X to N (0, In)

ξ 7→ SX(y∗, ξ) pushes πX|Y=y∗ to N (0, In)

I The analysis map that pushes πY,X to πX|Y=y∗ is then given by

T (y, x) = SX(y∗, ·)−1 ◦ SX(y, x)


A likelihood-free inference algorithm with maps

T (y,x)

πX|Y =y∗πX

πY ,Xjoint

xi

πY |X=xi

Transport map ensemble filter1 Compute forecast ensemble x1, . . . , xM2 Generate samples (yi , xi) from πY,X with yi ∼ πY|X=xi3 Build an estimator T̂ of T4 Compute analysis ensemble as xai = T̂ (yi , xi) for i = 1, . . . ,M


Estimator for the analysis map

I Recall the form of S :

S(y, x) =

[SY(y)SX(y, x)

], S] πY,X = N (0, Id+n).

I We propose a simple estimator T̂ of T :

T̂ (y, x) = ŜX(y∗, ·)−1 ◦ ŜX(y, x),

where Ŝ is a maximum likelihood estimator of SI This is simply the earlier “maps from samples” approach!


Map parameterizations

Ŝk ∈ arg minSk∈Sh4,k

1M

M∑

i=1

(12Sk(xi)2 − log ∂kSk(xi)

)

I Optimization is not needed for nonlinear separable parameterizationsof the form Ŝk(x1:k) = αxk + g(x1:k−1) (just linear regression)

I Connection to EnKF: a linear parameterization of Ŝk yields aparticular form of EnKF with “perturbed observations”

I Choice of approximation space allows control of the bias and varianceof ŜI Richer parameterizations yield less bias, but potentially higher variance


Example: Lorenz-63

Simple example: three-dimensional Lorenz-63 system

dX1dt

= σ(X2 − X1),

dX2dt

= X1(ρ− X3)− X2dX3dt

= X1X2 − βX3

I Chaotic setting: ρ = 28, σ = 10, β = 8/3I Fully observed, with additive Gaussian observation noiseEj ∼ N (0, 22)

I Assimilation interval ∆t = 0.1I Results computed over 2000 assimilation cycles, following spin-up

I Map parameterizations: Sk(x1:k) =∑

i≤k Ψi(xi), with Ψi = linear+ {RBFs or sigmoids }


Example: Lorenz-63

Mean “tracking” error vs. ensemble size and choice of map


Example: Lorenz-63

What about comparison to the true Bayesian solution?


“Localize” the map in high dimensions

I Regularize the estimator Ŝ of S by imposing sparsity, e.g.,

Ŝ(x1, . . . , x4) =

Ŝ1(x1)

Ŝ2(x1, x2)

Ŝ3(x1, x2, x3)

Ŝ4(x1,x2, x3, x4)

I The sparsity of the kth component of S depends on the sparsity ofthe marginal conditional function πXk |X1:k−1(xk |x1:k−1)

I Localization heuristic: let each Ŝk depend on variables (xj)j

Lorenz-96 in chaotic regime (40-dimensional state)

I A hard test-case configuration [Bengtsson et al. 2003]:

dXjdt

= (Xj+1 − Xj−2)Xj−1 − Xj + F , j = 1, . . . , 40

Yj = Xj + Ej , j = 1, 3, 5 . . . , 39

I F = 8 (chaotic) and Ej ∼ N (0, 0.5) (small noise for PF)I Time between observations: ∆obs = 0.4 (large)I Results computed over 2000 assimilation cycles, following spin-up

Lorenz-96: “hard” case

I The nonlinear filter is ≈ 25% more accurate in RMSE than EnKF


Lorenz-96: “hard” case


Lorenz-96: non-Gaussian noise

I A heavy-tailed noise configuration:

dXjdt

= (Xj+1 − Xj−2)Xj−1 − Xj + F , j = 1, . . . , 40

Yj = Xj + Ej , j = 1, 5, 9, 13, . . . , 37

I F = 8 (chaotic) and Ej ∼ Laplace(λ = 1)I Time between observations: ∆obs = 0.1I Results computed over 2000 assimilation cycles, following spin-up


Lorenz-96: non-Gaussian noise


Lorenz-96: details on the filtering approximation

1

I Impose sparsity of the map with a 5-way interaction model (above)I Separable and nonlinear parameterization of each component

Ŝk(xj1 , . . . , xjp , xk) = ψ(xj1) + . . .+ ψ(xjp ) + ψ̃(xk),

where ψ(x) = a0 + a1 · x +∑

i>1 ai exp(−(x − ci)2/σ).I More general parameterizations are of course possible


Lorenz-96: tracking performance of the filter

I Simple and and localized nonlinearities have significant impact


Remarks and questions

I Nonlinear generalization of the EnKF: move the ensemblemembers via local nonlinear transport maps, no weights or degeneracy

I Learn non-Gaussian features via nonlinear continuous transport andconvex optimization

I Choice of map basis and sparsity provide regularization (e.g.,localization)

I In principle, filter is consistent as Sh4 is enriched and M →∞. Butwhat is a good choice of Sh4 for any fixed ensemble size M?

I Are there better estimators than maximum likelihood? What are thefinite-sample statistical properties of candidate estimators, andproperties of the associated optimization problems?

I How to relate map structure/parameterization to the underlyingdynamics, observation operators, and data?


Sparsity of triangular maps

Theorem [Spantini et al. 2018]

Conditional independence properties of π (encoded by graph G) define alower bound on the sparsity of S such that S]η = π

G of π Sparsity of S

State-space model Sparsity of S

Main idea: Discover sparse structure using ATM algorithmMarzouk SIMDA seminar 35 / 43

Adaptive transport map (ATM) algorithm (Baptista et al. 2020)

Goal: Approximate map given n i.i.d. samples from π

Greedy enrichment procedureI Look for sparse expansion f (x) =

∑α∈Λ cαψα(x)

I Use tensor-product Hermite functions ψα(x) = Pαj (x) exp(−‖x‖2/2)I Add one element to set of active multi-indices Λt at a timeI Restrict Λt to be downward closedI Search for new features in the reduced margin of Λt


Adaptive transport map (ATM) algorithm (Baptista et al. 2020)

Goal: Approximate map given n i.i.d. samples from π

Initialize Λt = ∅ (i.e., f0 = 0)For t = 0, . . . ,m

1 Find reduced margin ΛRMt of Λt2 Add new feature Λt+1 = Λt ∪ α∗t+1 for α∗t+1 ∈ ΛRMt3 Update approximation ft+1 = argminf ∈span(ψΛt+1 ) Lk(f )

In practice, we can choose m (# of features) via cross-validation


Numerical example: Lorenz-96 data

I Distribution of state at a fixed time starting from a Gaussian initialcondition

Log-like and m vs n1

2

3

4

5

6

7

8

9

Sparsity of S: n = 3161 2 3 4 5 6 7 8 9 10 11 12 13 14 15

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

Conditional independence

Takeaways:I ATM discovers conditional independence structure in the stateI Natural semi-parametric method that gradually increases m with n


Back to conditional maps: link to ABC

“Analysis” step of the ensemble filtering scheme is an instance oflikelihood-free inference or approximate Bayesian computation(ABC):

I Central idea: only need to simulate from πY,X in order to constructT̂ and thus draw samples from πX|Y=y∗

I The map T̂ has some remarkable properties. . .


Compare two approaches for posterior sampling


Compare two approaches for posterior sampling

X|y∗ ∼ ŜX(y∗, ·)−1] η X|y∗ ∼ T̂]πy,x forT̂ = ŜX(y∗, ·)−1 ◦ ŜX(·, ·)

I Propagating the joint prior through composed maps has lower error!


Composed maps for ABC

I Simple and incomplete examples:

I Remove dependence of the mean on y: ŜX (y, x) = x− E[X|Y = y]I If E[X|Y = y] ≈ c+ βy, we have the EnKF

I Remove dependence of the mean and variance on y:ŜX (y, x) = Var(X|Y = y)−1/2(x− E[X|Y = y])

I The transport map framework offers a much more general set ofpossibilities, which in principle includes the exact transformation

I Current work: Alternative optimization objectives (hence mapestimators T̂ ) and tailored parameterizations

I Also useful for optimal Bayesian experimental design!


SIMDA plans, connections, and discussion

I Nonlinear ensemble algorithms for filtering, smoothing, and jointstate-parameter inference (quite general)

I Generative modeling with transport maps: will we need to buildpriors—spatiotemporal statistical models for sea ice thickness—fromdata?

I Likelihood-free inference problems:I Are there sea ice problems where we will need to extract conditionalrelationships from large data sets? Conditional density estimation,conditional simulation.

I High-dimensional and disparate sources of data

I Are there strong/essential non-Gaussianities in the sea ice problem?Tail behaviors?

I Explore cost-accuracy tradeoffs: balance posterior fidelity withcomputational effort

I Use conditional density estimates in optimal experimental design


Conclusions

Thanks for your attention!


References

I R. Baptista, O. Zahm, Y. Marzouk. “ An adaptive transport framework for jointand conditional density estimation.” arXiv:2009.10303, 2020.

I J. Zech, Y. Marzouk. “Sparse approximation of triangular transports on boundeddomains.” arXiv:2006.06994, 2020.

I N. Kovachki, R. Baptista, B. Hosseini and Y. Marzouk, “Conditional sampling withmonotone GANs,” arXiv:2006.06755, 2020.

I A. Spantini, R. Baptista, Y. Marzouk. “Coupling techniques for nonlinear ensemblefiltering.” arXiv:1907.00389, 2020.

I M. Brennan, D. Bigoni, O. Zahm, A. Spantini, Y. Marzouk. “Greedy inferencewith structure-exploiting lazy maps.” NeurIPS 2020, arXiv:1906.00031.

I O. Zahm, T. Cui, K. Law, A. Spantini, Y. Marzouk. “Certified dimensionreduction in nonlinear Bayesian inverse problems.” arXiv:1807.03712, 2019.

I A. Spantini, D. Bigoni, Y. Marzouk. “Inference via low-dimensional couplings.”JMLR 19(66): 1–71, 2018.

I M. Parno, Y. Marzouk, “Transport map accelerated Markov chain Monte Carlo.”SIAM JUQ 6: 645–682, 2018.

I R. Morrison, R. Baptista, Y. Marzouk. “Beyond normality: learning sparseprobabilistic graphical models in the non-Gaussian setting.” NeurIPS 2017.


Documents

Transport methods for nonlinear filtering and likelihood-free ...muri/seminar/simda-oct-ymarz...Marzouk SIMDA seminar 4 / 43 Choiceoftransportmap ConsiderthetriangularKnothe-RosenblattrearrangementonRd