60
Transport methods for nonlinear filtering and likelihood-free inference Youssef Marzouk, joint work with Ricardo Baptista, Alessio Spantini, & Olivier Zahm Department of Aeronautics and Astronautics Center for Computational Science and Engineering Statistics and Data Science Center Massachusetts Institute of Technology http://uqgroup.mit.edu Sea Ice Modeling and Data Assimilation (SIMDA) Seminar 6 October 2020 Marzouk SIMDA seminar 1 / 43

Transport methods for nonlinear filtering and likelihood-free ...muri/seminar/simda-oct-ymarz...Marzouk SIMDA seminar 4 / 43 Choiceoftransportmap ConsiderthetriangularKnothe-RosenblattrearrangementonRd

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

  • Transport methods for nonlinear filtering andlikelihood-free inference

    Youssef Marzouk, joint work with Ricardo Baptista,Alessio Spantini, & Olivier Zahm

    Department of Aeronautics and AstronauticsCenter for Computational Science and Engineering

    Statistics and Data Science CenterMassachusetts Institute of Technology

    http://uqgroup.mit.edu

    Sea Ice Modeling and Data Assimilation (SIMDA) Seminar

    6 October 2020

    Marzouk SIMDA seminar 1 / 43

    http://uqgroup.mit.edu

  • Main ideas #1

    I Online Bayesian inference in dynamical models with sequential datais central to our proposed workI Data assimilation, in the broadest sense

    I New nonlinear ensemble schemes, based on transport, can provide aconsistent approach to inference in general non-Gaussian settingsI Generalizations of the ensemble Kalman filter (EnKF)

    I These schemes are methods for likelihood-free inference (LFI) orapproximate Bayesian computation (ABC), and have broaderutility!

    Marzouk SIMDA seminar 2 / 43

  • Main ideas #2

    I Application to high-dimensional and disparate data requiresdetecting and exploiting low-dimensional structureI Many varieties of structure, e.g.: sparsity and conditional independence(especially given disparate data sources), smoothness/low rank,piecewise smoothness, hierarchical low rank, multiscale structure

    I Structure in parameters/state and in data!I Relate also to data-driven discretizations of inverse problems

    I These new inference methods can also facilitate optimal experimentaldesign

    Marzouk SIMDA seminar 3 / 43

  • Deterministic couplings of probability measures

    ⇡(✓) p(r)

    ⇡̃(r)

    T (✓)

    T̃ (✓)

    Figure 2-1: Illustration of exact and inexact transformations coming from T and T̃respectively. The exact map pushes the target measure ⇡ to the standard Gaussianreference p while the approximate map only captures some of the structure in ⇡,producing an approximation p̃ to the reference Gaussian.

    µr does not contain any point masses and the cost function c(✓, T (✓)) is quadratic.Details of the existence and uniqueness proofs can also be found in [102].

    Being a form of regularization, the cost function in (2.2) defines the form andstructure of the optimal transport map. For illustration, consider the case when✓ ⇠ N(0, I) and r ⇠ N(0,⌃) for some covariance matrix ⌃. In this Gaussian example,the transport map will be linear: r

    i.d.= ⌃1/2✓, where ⌃1/2 is any one of the many

    square roots of ⌃. Two possible matrix square roots are the Cholesky factor, and theeigenvalue square root. Interestingly, when the cost is given by

    cEig(✓, T (✓)) = k✓ � T (✓)k2, (2.3)

    the optimal square root, ⌃1/2, will be defined by the eigenvalue decomposition of ⌃,but when the cost is given by the limit of a a weighted quadratic defined by

    cRos(✓, T (✓)) = limt!0

    DX

    k=1

    tk�1|✓k � Tk(✓)|, (2.4)

    the optimal square root, ⌃1/2, will be defined by the Cholesky decomposition of ⌃.In the more general nonlinear and non-Gaussian setting, this latter cost is shown by[22] and [15] to yield the well-known Rosenblatt transformation from [91].

    The Cholesky factor is a special case of the Rosenblatt transformation, which it-self is just a multivariate generalization of using cumulative distribution functions totransform between univariate random variables (i.e., the “CDF trick”). Importantly,the lower triangular structure present in the Cholesky factor, which makes inverting

    31

    ⇡(✓) p(r)

    ⇡̃(r)

    T (✓)

    T̃ (✓)

    Figure 2-1: Illustration of exact and inexact transformations coming from T and T̃respectively. The exact map pushes the target measure ⇡ to the standard Gaussianreference p while the approximate map only captures some of the structure in ⇡,producing an approximation p̃ to the reference Gaussian.

    µr does not contain any point masses and the cost function c(✓, T (✓)) is quadratic.Details of the existence and uniqueness proofs can also be found in [102].

    Being a form of regularization, the cost function in (2.2) defines the form andstructure of the optimal transport map. For illustration, consider the case when✓ ⇠ N(0, I) and r ⇠ N(0,⌃) for some covariance matrix ⌃. In this Gaussian example,the transport map will be linear: r

    i.d.= ⌃1/2✓, where ⌃1/2 is any one of the many

    square roots of ⌃. Two possible matrix square roots are the Cholesky factor, and theeigenvalue square root. Interestingly, when the cost is given by

    cEig(✓, T (✓)) = k✓ � T (✓)k2, (2.3)

    the optimal square root, ⌃1/2, will be defined by the eigenvalue decomposition of ⌃,but when the cost is given by the limit of a a weighted quadratic defined by

    cRos(✓, T (✓)) = limt!0

    DX

    k=1

    tk�1|✓k � Tk(✓)|, (2.4)

    the optimal square root, ⌃1/2, will be defined by the Cholesky decomposition of ⌃.In the more general nonlinear and non-Gaussian setting, this latter cost is shown by[22] and [15] to yield the well-known Rosenblatt transformation from [91].

    The Cholesky factor is a special case of the Rosenblatt transformation, which it-self is just a multivariate generalization of using cumulative distribution functions totransform between univariate random variables (i.e., the “CDF trick”). Importantly,the lower triangular structure present in the Cholesky factor, which makes inverting

    31

    T

    η π

    Core idea

    I Choose a reference distribution η (e.g., standard Gaussian)I Seek a transport map T : Rn → Rn such that T]η = π

    Marzouk SIMDA seminar 4 / 43

  • Deterministic couplings of probability measures

    ⇡(✓) p(r)

    ⇡̃(r)

    T (✓)

    T̃ (✓)

    Figure 2-1: Illustration of exact and inexact transformations coming from T and T̃respectively. The exact map pushes the target measure ⇡ to the standard Gaussianreference p while the approximate map only captures some of the structure in ⇡,producing an approximation p̃ to the reference Gaussian.

    µr does not contain any point masses and the cost function c(✓, T (✓)) is quadratic.Details of the existence and uniqueness proofs can also be found in [102].

    Being a form of regularization, the cost function in (2.2) defines the form andstructure of the optimal transport map. For illustration, consider the case when✓ ⇠ N(0, I) and r ⇠ N(0,⌃) for some covariance matrix ⌃. In this Gaussian example,the transport map will be linear: r

    i.d.= ⌃1/2✓, where ⌃1/2 is any one of the many

    square roots of ⌃. Two possible matrix square roots are the Cholesky factor, and theeigenvalue square root. Interestingly, when the cost is given by

    cEig(✓, T (✓)) = k✓ � T (✓)k2, (2.3)

    the optimal square root, ⌃1/2, will be defined by the eigenvalue decomposition of ⌃,but when the cost is given by the limit of a a weighted quadratic defined by

    cRos(✓, T (✓)) = limt!0

    DX

    k=1

    tk�1|✓k � Tk(✓)|, (2.4)

    the optimal square root, ⌃1/2, will be defined by the Cholesky decomposition of ⌃.In the more general nonlinear and non-Gaussian setting, this latter cost is shown by[22] and [15] to yield the well-known Rosenblatt transformation from [91].

    The Cholesky factor is a special case of the Rosenblatt transformation, which it-self is just a multivariate generalization of using cumulative distribution functions totransform between univariate random variables (i.e., the “CDF trick”). Importantly,the lower triangular structure present in the Cholesky factor, which makes inverting

    31

    ⇡(✓) p(r)

    ⇡̃(r)

    T (✓)

    T̃ (✓)

    Figure 2-1: Illustration of exact and inexact transformations coming from T and T̃respectively. The exact map pushes the target measure ⇡ to the standard Gaussianreference p while the approximate map only captures some of the structure in ⇡,producing an approximation p̃ to the reference Gaussian.

    µr does not contain any point masses and the cost function c(✓, T (✓)) is quadratic.Details of the existence and uniqueness proofs can also be found in [102].

    Being a form of regularization, the cost function in (2.2) defines the form andstructure of the optimal transport map. For illustration, consider the case when✓ ⇠ N(0, I) and r ⇠ N(0,⌃) for some covariance matrix ⌃. In this Gaussian example,the transport map will be linear: r

    i.d.= ⌃1/2✓, where ⌃1/2 is any one of the many

    square roots of ⌃. Two possible matrix square roots are the Cholesky factor, and theeigenvalue square root. Interestingly, when the cost is given by

    cEig(✓, T (✓)) = k✓ � T (✓)k2, (2.3)

    the optimal square root, ⌃1/2, will be defined by the eigenvalue decomposition of ⌃,but when the cost is given by the limit of a a weighted quadratic defined by

    cRos(✓, T (✓)) = limt!0

    DX

    k=1

    tk�1|✓k � Tk(✓)|, (2.4)

    the optimal square root, ⌃1/2, will be defined by the Cholesky decomposition of ⌃.In the more general nonlinear and non-Gaussian setting, this latter cost is shown by[22] and [15] to yield the well-known Rosenblatt transformation from [91].

    The Cholesky factor is a special case of the Rosenblatt transformation, which it-self is just a multivariate generalization of using cumulative distribution functions totransform between univariate random variables (i.e., the “CDF trick”). Importantly,the lower triangular structure present in the Cholesky factor, which makes inverting

    31

    S =T−1

    η π

    Core idea

    I Choose a reference distribution η (e.g., standard Gaussian)I Seek a transport map T : Rn → Rn such that T]η = πI Equivalently, find S = T−1 such that S]π = η

    Marzouk SIMDA seminar 4 / 43

  • Deterministic couplings of probability measures

    ⇡(✓) p(r)

    ⇡̃(r)

    T (✓)

    T̃ (✓)

    Figure 2-1: Illustration of exact and inexact transformations coming from T and T̃respectively. The exact map pushes the target measure ⇡ to the standard Gaussianreference p while the approximate map only captures some of the structure in ⇡,producing an approximation p̃ to the reference Gaussian.

    µr does not contain any point masses and the cost function c(✓, T (✓)) is quadratic.Details of the existence and uniqueness proofs can also be found in [102].

    Being a form of regularization, the cost function in (2.2) defines the form andstructure of the optimal transport map. For illustration, consider the case when✓ ⇠ N(0, I) and r ⇠ N(0,⌃) for some covariance matrix ⌃. In this Gaussian example,the transport map will be linear: r

    i.d.= ⌃1/2✓, where ⌃1/2 is any one of the many

    square roots of ⌃. Two possible matrix square roots are the Cholesky factor, and theeigenvalue square root. Interestingly, when the cost is given by

    cEig(✓, T (✓)) = k✓ � T (✓)k2, (2.3)

    the optimal square root, ⌃1/2, will be defined by the eigenvalue decomposition of ⌃,but when the cost is given by the limit of a a weighted quadratic defined by

    cRos(✓, T (✓)) = limt!0

    DX

    k=1

    tk�1|✓k � Tk(✓)|, (2.4)

    the optimal square root, ⌃1/2, will be defined by the Cholesky decomposition of ⌃.In the more general nonlinear and non-Gaussian setting, this latter cost is shown by[22] and [15] to yield the well-known Rosenblatt transformation from [91].

    The Cholesky factor is a special case of the Rosenblatt transformation, which it-self is just a multivariate generalization of using cumulative distribution functions totransform between univariate random variables (i.e., the “CDF trick”). Importantly,the lower triangular structure present in the Cholesky factor, which makes inverting

    31

    ⇡(✓) p(r)

    ⇡̃(r)

    T (✓)

    T̃ (✓)

    Figure 2-1: Illustration of exact and inexact transformations coming from T and T̃respectively. The exact map pushes the target measure ⇡ to the standard Gaussianreference p while the approximate map only captures some of the structure in ⇡,producing an approximation p̃ to the reference Gaussian.

    µr does not contain any point masses and the cost function c(✓, T (✓)) is quadratic.Details of the existence and uniqueness proofs can also be found in [102].

    Being a form of regularization, the cost function in (2.2) defines the form andstructure of the optimal transport map. For illustration, consider the case when✓ ⇠ N(0, I) and r ⇠ N(0,⌃) for some covariance matrix ⌃. In this Gaussian example,the transport map will be linear: r

    i.d.= ⌃1/2✓, where ⌃1/2 is any one of the many

    square roots of ⌃. Two possible matrix square roots are the Cholesky factor, and theeigenvalue square root. Interestingly, when the cost is given by

    cEig(✓, T (✓)) = k✓ � T (✓)k2, (2.3)

    the optimal square root, ⌃1/2, will be defined by the eigenvalue decomposition of ⌃,but when the cost is given by the limit of a a weighted quadratic defined by

    cRos(✓, T (✓)) = limt!0

    DX

    k=1

    tk�1|✓k � Tk(✓)|, (2.4)

    the optimal square root, ⌃1/2, will be defined by the Cholesky decomposition of ⌃.In the more general nonlinear and non-Gaussian setting, this latter cost is shown by[22] and [15] to yield the well-known Rosenblatt transformation from [91].

    The Cholesky factor is a special case of the Rosenblatt transformation, which it-self is just a multivariate generalization of using cumulative distribution functions totransform between univariate random variables (i.e., the “CDF trick”). Importantly,the lower triangular structure present in the Cholesky factor, which makes inverting

    31

    T

    η π

    Core idea

    I Choose a reference distribution η (e.g., standard Gaussian)I Seek a transport map T : Rn → Rn such that T]η = πI Equivalently, find S = T−1 such that S]π = ηI In principle, enables exact (independent, unweighted) sampling!

    Marzouk SIMDA seminar 4 / 43

  • Deterministic couplings of probability measures

    ⇡(✓) p(r)

    ⇡̃(r)

    T (✓)

    T̃ (✓)

    Figure 2-1: Illustration of exact and inexact transformations coming from T and T̃respectively. The exact map pushes the target measure ⇡ to the standard Gaussianreference p while the approximate map only captures some of the structure in ⇡,producing an approximation p̃ to the reference Gaussian.

    µr does not contain any point masses and the cost function c(✓, T (✓)) is quadratic.Details of the existence and uniqueness proofs can also be found in [102].

    Being a form of regularization, the cost function in (2.2) defines the form andstructure of the optimal transport map. For illustration, consider the case when✓ ⇠ N(0, I) and r ⇠ N(0,⌃) for some covariance matrix ⌃. In this Gaussian example,the transport map will be linear: r

    i.d.= ⌃1/2✓, where ⌃1/2 is any one of the many

    square roots of ⌃. Two possible matrix square roots are the Cholesky factor, and theeigenvalue square root. Interestingly, when the cost is given by

    cEig(✓, T (✓)) = k✓ � T (✓)k2, (2.3)

    the optimal square root, ⌃1/2, will be defined by the eigenvalue decomposition of ⌃,but when the cost is given by the limit of a a weighted quadratic defined by

    cRos(✓, T (✓)) = limt!0

    DX

    k=1

    tk�1|✓k � Tk(✓)|, (2.4)

    the optimal square root, ⌃1/2, will be defined by the Cholesky decomposition of ⌃.In the more general nonlinear and non-Gaussian setting, this latter cost is shown by[22] and [15] to yield the well-known Rosenblatt transformation from [91].

    The Cholesky factor is a special case of the Rosenblatt transformation, which it-self is just a multivariate generalization of using cumulative distribution functions totransform between univariate random variables (i.e., the “CDF trick”). Importantly,the lower triangular structure present in the Cholesky factor, which makes inverting

    31

    ⇡(✓) p(r)

    ⇡̃(r)

    T (✓)

    T̃ (✓)

    Figure 2-1: Illustration of exact and inexact transformations coming from T and T̃respectively. The exact map pushes the target measure ⇡ to the standard Gaussianreference p while the approximate map only captures some of the structure in ⇡,producing an approximation p̃ to the reference Gaussian.

    µr does not contain any point masses and the cost function c(✓, T (✓)) is quadratic.Details of the existence and uniqueness proofs can also be found in [102].

    Being a form of regularization, the cost function in (2.2) defines the form andstructure of the optimal transport map. For illustration, consider the case when✓ ⇠ N(0, I) and r ⇠ N(0,⌃) for some covariance matrix ⌃. In this Gaussian example,the transport map will be linear: r

    i.d.= ⌃1/2✓, where ⌃1/2 is any one of the many

    square roots of ⌃. Two possible matrix square roots are the Cholesky factor, and theeigenvalue square root. Interestingly, when the cost is given by

    cEig(✓, T (✓)) = k✓ � T (✓)k2, (2.3)

    the optimal square root, ⌃1/2, will be defined by the eigenvalue decomposition of ⌃,but when the cost is given by the limit of a a weighted quadratic defined by

    cRos(✓, T (✓)) = limt!0

    DX

    k=1

    tk�1|✓k � Tk(✓)|, (2.4)

    the optimal square root, ⌃1/2, will be defined by the Cholesky decomposition of ⌃.In the more general nonlinear and non-Gaussian setting, this latter cost is shown by[22] and [15] to yield the well-known Rosenblatt transformation from [91].

    The Cholesky factor is a special case of the Rosenblatt transformation, which it-self is just a multivariate generalization of using cumulative distribution functions totransform between univariate random variables (i.e., the “CDF trick”). Importantly,the lower triangular structure present in the Cholesky factor, which makes inverting

    31

    T

    η π

    Core idea

    I Choose a reference distribution η (e.g., standard Gaussian)I Seek a transport map T : Rn → Rn such that T]η = πI Equivalently, find S = T−1 such that S]π = ηI Satisfying these conditions only approximately can still be useful!

    Marzouk SIMDA seminar 4 / 43

  • Choice of transport map

    Consider the triangular Knothe-Rosenblatt rearrangement on Rd

    S(x) =

    S1(x1)S2(x1, x2)

    ...Sd(x1, x2, . . . , xd)

    1 Unique S s.t.S]π = η exists under mild conditions on π and η2 Map is easily invertible and Jacobian ∇S is simple to evaluate3 Monotonicity is essentially one-dimensional: ∂xkS

    k > 04 Each component Sk characterizes one marginal conditional

    π(x) = π(x1)π(x2|x1) · · ·π(xd |x1, . . . , xd−1)

    Marzouk SIMDA seminar 5 / 43

  • Ubiquity of triangular maps

    Many “flows” proposed in ML are special cases of triangular maps, e.g.,

    I NICE: Nonlinear independent component estimation [Dinh et al. 2015]

    Sk(x1, . . . , xk) = µk(xx

  • Ubiquity of triangular maps

    Many “flows” proposed in ML are special cases of triangular maps, e.g.,

    I NICE: Nonlinear independent component estimation [Dinh et al. 2015]

    Sk(x1, . . . , xk) = µk(xx

  • How to construct triangular maps?

    Some past work: “maps from densities,” i.e., variational inference withthe direct map T [Moselhy & M 2012]

    minT∈T h4

    DKL(T] η ||π ) = minT∈T h4

    DKL( η ||T−1] π )

    I π is the “target” density on Rn; η is, e.g., N (0, In)I T h4 is a set of monotone lower triangular maps

    I T h→∞4 contains the Knothe–Rosenblatt rearrangementI Expectation is with respect to the reference measure η

    I Compute via, e.g., Monte Carlo, sparse quadrature

    I Use unnormalized evaluations of π and its gradientsI No MCMC or importance samplingI In general non-convex, unless π is log-concave

    Marzouk SIMDA seminar 7 / 43

  • How to construct triangular maps?

    Some past work: “maps from densities,” i.e., variational inference withthe direct map T [Moselhy & M 2012]

    minT∈T h4

    DKL(T] η ||π ) = minT∈T h4

    DKL( η ||T−1] π )

    I π is the “target” density on Rn; η is, e.g., N (0, In)I T h4 is a set of monotone lower triangular maps

    I T h→∞4 contains the Knothe–Rosenblatt rearrangementI Expectation is with respect to the reference measure η

    I Compute via, e.g., Monte Carlo, sparse quadrature

    I Use unnormalized evaluations of π and its gradientsI No MCMC or importance samplingI In general non-convex, unless π is log-concave

    Marzouk SIMDA seminar 7 / 43

  • Illustrative example

    minTEη[− logπ ◦ T −

    k

    log ∂xkTk ]

    I Parameterized map T ∈ T h4 ⊂ T4I Optimize over coefficients of

    parameterizationI Use gradient-based optimizationI The posterior is in the tail of the reference

    3 0 3

    3

    0

    3

    6

    9

    12

    15

    18

    Marzouk SIMDA seminar 7 / 43

  • Illustrative example

    minTEη[− logπ ◦ T −

    k

    log ∂xkTk ]

    I Parameterized map T ∈ T h4 ⊂ T4I Optimize over coefficients of

    parameterizationI Use gradient-based optimizationI The posterior is in the tail of the reference

    3 0 3

    3

    0

    3

    6

    9

    12

    15

    18

    Marzouk SIMDA seminar 7 / 43

  • Illustrative example

    minTEη[− logπ ◦ T −

    k

    log ∂xkTk ]

    I Parameterized map T ∈ T h4 ⊂ T4I Optimize over coefficients of

    parameterizationI Use gradient-based optimizationI The posterior is in the tail of the reference

    3 0 3

    3

    0

    3

    6

    9

    12

    15

    18

    Marzouk SIMDA seminar 7 / 43

  • Illustrative example

    minTEη[− logπ ◦ T −

    k

    log ∂xkTk ]

    I Parameterized map T ∈ T h4 ⊂ T4I Optimize over coefficients of

    parameterizationI Use gradient-based optimizationI The posterior is in the tail of the reference

    3 0 3

    3

    0

    3

    6

    9

    12

    15

    18

    Marzouk SIMDA seminar 7 / 43

  • How to construct triangular maps?

    Alternative formulation: “maps from samples” [Parno PhD thesis, 2014]

    I Given samples (xi)Mi=1 ∼ π: find components via convex (wrt Sk)constrained minimization:

    minS

    DKL(π||S]η) ⇔ minSk : ∂kSk>0

    Eπ[12Sk(x1:k)2 − log ∂kSk(x1:k)

    ]∀k

    I Approximate Eπ given i.i.d. samples from π: KL minimizationequivalent to maximum likelihood estimation

    Ŝk ∈ arg minSk∈Sh4,k

    1M

    M∑

    i=1

    (12Sk(xi1:k)

    2 − log ∂kSk(xi1:k))

    Marzouk SIMDA seminar 8 / 43

  • Low-dimensional structure of transport maps

    An underlying challenge: maps in high dimensionsI Major bottleneck: representation of the map, e.g., cardinality of the

    map basisI How to make the construction/representation of high-dimensional

    transports tractable?

    Main ideas:1 Exploit Markov structure of the target distribution

    I Leads to sparsity and/or decomposability of transport maps [Spantini,Bigoni, & M JMLR 2018]

    2 Exploit low rank structureI Common in inverse problems. Near-identity or “lazy” maps [Brennan etal. NeurIPS 2020, Zahm et al. 2018]

    Marzouk SIMDA seminar 9 / 43

  • Low-dimensional structure of transport maps

    An underlying challenge: maps in high dimensionsI Major bottleneck: representation of the map, e.g., cardinality of the

    map basisI How to make the construction/representation of high-dimensional

    transports tractable?

    Main ideas:1 Exploit Markov structure of the target distribution

    I Leads to sparsity and/or decomposability of transport maps [Spantini,Bigoni, & M JMLR 2018]

    2 Exploit low rank structureI Common in inverse problems. Near-identity or “lazy” maps [Brennan etal. NeurIPS 2020, Zahm et al. 2018]

    Marzouk SIMDA seminar 9 / 43

  • Remainder of this talk: sequential and “likelihood-free”

    Can transport help solve sequential inference problems where keydensity functions cannot be evaluated?

    [image: NCAR]

    Marzouk SIMDA seminar 10 / 43

  • Formalize as Bayesian filtering

    I Nonlinear/non-Gaussian state-space model:I Transition density πZk |Zk−1I Observation density (likelihood) πYk |Zk

    Z0 Z1 Z2 Z3 ZN

    Y0 Y1 Y2 Y3 YN

    1

    I Focus on recursively approximating the filtering distribution:πZk | y0:k → πZk+1 | y0:k+1 (marginals of the full Bayesian solution)

    Marzouk SIMDA seminar 11 / 43

  • Problem setting

    I Consider the filtering of state-space models with:1 High-dimensional states2 Challenging nonlinear dynamics3 Intractable transition kernels: can only obtain forecast samples, i.e.,

    draws from πZk+1 | zk4 Limited model evaluations, e.g., small ensemble sizes5 Sparse and local observations

    I These constraints reflect challenges faced in the sea ice predictionproblem. . .

    Marzouk SIMDA seminar 12 / 43

  • Ensemble Kalman filter

    I State-of-the-art results (in terms of tracking) are often obtained withthe ensemble Kalman filter (EnKF)

    πZk−1|Y 0:k−1

    forecast step analysis step

    Bayesian inference

    πZk|Y 0:k−1 πZk|Y 0:k

    I Move samples via an affine transformation; no weights or resampling!I Yet ultimately inconsistent: does not converge to the true posterior

    Can we improve and generalize the EnKF while preserving scalability?

    Marzouk SIMDA seminar 13 / 43

  • Ensemble Kalman filter

    I State-of-the-art results (in terms of tracking) are often obtained withthe ensemble Kalman filter (EnKF)

    πZk−1|Y 0:k−1

    forecast step analysis step

    Bayesian inference

    πZk|Y 0:k−1 πZk|Y 0:k

    I Move samples via an affine transformation; no weights or resampling!I Yet ultimately inconsistent: does not converge to the true posterior

    Can we improve and generalize the EnKF while preserving scalability?

    Marzouk SIMDA seminar 13 / 43

  • Assimilation step

    At any assimilation time k , we have a Bayesian inference problem:

    xi

    πX|Y =y∗πXprior posterior

    I πX is the forecast distribution on RnI πY|X is the likelihood of the observations Y ∈ RdI πX|Y=y∗ is the filtering distribution for a realization y∗ of the data

    Goal: sample the posterior given only prior samples x1, . . . , xM andthe ability to simulate data yi |xi

    Marzouk SIMDA seminar 14 / 43

  • Inference as transportation of measure

    I Seek a map T that pushes forward prior to posterior

    (x1, . . . , xM) ∼ πX =⇒ (T (x1), . . . ,T (xM)) ∼ πX|Y=y∗

    I The map induces a coupling between prior and posterior measures

    xi

    T (xi)

    πX|Y =y∗πX

    transport map

    How to construct a “good” coupling from very few prior samples?

    Marzouk SIMDA seminar 15 / 43

  • Consider the joint distribution of state and observations

    T (y,x)

    πX|Y =y∗πX

    πY ,Xjoint

    I Construct a map T from the joint distribution πY,X to the posteriorI T can be computed via convex optimization given samples from πY,XI Sample πY,X using the forecast ensemble and the likelihood

    (yi , xi) yi ∼ πY|X=xiI Intuition: a generalization of the “perturbed observation” EnKF

    Marzouk SIMDA seminar 16 / 43

  • Couple the joint distribution with a standard normal

    T : Rd+n → Rn

    πX|Y =y∗

    πY ,Xjoint

    We can find T by computing a Knothe–Rosenblatt (KR) rearrange-ment S between πY,X and N (0, Id+n)

    N (0, Id+n)

    S : Rd+n → Rd+nπY ,Xjoint

    I We will show how to derive T from S . . .Marzouk SIMDA seminar 17 / 43

  • Triangular maps enable conditional simulation

    S(x1, . . . , xm) =

    S1(x1)

    S2(x1, x2)...Sm(x1, x2, . . . , xm)

    I Each component Sk links marginal conditionals of π and ηI For instance, if η = N (0, I), then for all x1, . . . , xk−1 ∈ Rk−1

    ξ 7→ Sk(x1, . . . , xk−1, ξ) pushes πXk |X1:k−1(ξ|x1:k−1) to N (0, 1)

    I Simulate the conditional πXk |X1:k−1 by inverting a 1-D mapξ 7→ Sk(x1:k−1, ξ) at Gaussian samples (need triangular structure)

    Marzouk SIMDA seminar 18 / 43

  • Filtering: the analysis map

    I We are interested in the KR map S that pushes πY,X to N (0, Id+n)I The KR map immediately has a block structure

    S(y, x) =

    [SY(y)SX(y, x)

    ],

    which suggests two properties:

    SX pushes πY,X to N (0, In)

    ξ 7→ SX(y∗, ξ) pushes πX|Y=y∗ to N (0, In)

    I The analysis map that pushes πY,X to πX|Y=y∗ is then given by

    T (y, x) = SX(y∗, ·)−1 ◦ SX(y, x)

    Marzouk SIMDA seminar 19 / 43

  • A likelihood-free inference algorithm with maps

    T (y,x)

    πX|Y =y∗πX

    πY ,Xjoint

    xi

    πY |X=xi

    Transport map ensemble filter1 Compute forecast ensemble x1, . . . , xM2 Generate samples (yi , xi) from πY,X with yi ∼ πY|X=xi3 Build an estimator T̂ of T4 Compute analysis ensemble as xai = T̂ (yi , xi) for i = 1, . . . ,M

    Marzouk SIMDA seminar 20 / 43

  • Estimator for the analysis map

    I Recall the form of S :

    S(y, x) =

    [SY(y)SX(y, x)

    ], S] πY,X = N (0, Id+n).

    I We propose a simple estimator T̂ of T :

    T̂ (y, x) = ŜX(y∗, ·)−1 ◦ ŜX(y, x),

    where Ŝ is a maximum likelihood estimator of SI This is simply the earlier “maps from samples” approach!

    Marzouk SIMDA seminar 21 / 43

  • Map parameterizations

    Ŝk ∈ arg minSk∈Sh4,k

    1M

    M∑

    i=1

    (12Sk(xi)2 − log ∂kSk(xi)

    )

    I Optimization is not needed for nonlinear separable parameterizationsof the form Ŝk(x1:k) = αxk + g(x1:k−1) (just linear regression)

    I Connection to EnKF: a linear parameterization of Ŝk yields aparticular form of EnKF with “perturbed observations”

    I Choice of approximation space allows control of the bias and varianceof ŜI Richer parameterizations yield less bias, but potentially higher variance

    Marzouk SIMDA seminar 22 / 43

  • Example: Lorenz-63

    Simple example: three-dimensional Lorenz-63 system

    dX1dt

    = σ(X2 − X1),

    dX2dt

    = X1(ρ− X3)− X2dX3dt

    = X1X2 − βX3

    I Chaotic setting: ρ = 28, σ = 10, β = 8/3I Fully observed, with additive Gaussian observation noiseEj ∼ N (0, 22)

    I Assimilation interval ∆t = 0.1I Results computed over 2000 assimilation cycles, following spin-up

    I Map parameterizations: Sk(x1:k) =∑

    i≤k Ψi(xi), with Ψi = linear+ {RBFs or sigmoids }

    Marzouk SIMDA seminar 23 / 43

  • Example: Lorenz-63

    Mean “tracking” error vs. ensemble size and choice of map

    Marzouk SIMDA seminar 24 / 43

  • Example: Lorenz-63

    What about comparison to the true Bayesian solution?

    Marzouk SIMDA seminar 25 / 43

  • “Localize” the map in high dimensions

    I Regularize the estimator Ŝ of S by imposing sparsity, e.g.,

    Ŝ(x1, . . . , x4) =

    Ŝ1(x1)

    Ŝ2(x1, x2)

    Ŝ3(x1, x2, x3)

    Ŝ4(x1,x2, x3, x4)

    I The sparsity of the kth component of S depends on the sparsity ofthe marginal conditional function πXk |X1:k−1(xk |x1:k−1)

    I Localization heuristic: let each Ŝk depend on variables (xj)j

  • Lorenz-96 in chaotic regime (40-dimensional state)

    I A hard test-case configuration [Bengtsson et al. 2003]:

    dXjdt

    = (Xj+1 − Xj−2)Xj−1 − Xj + F , j = 1, . . . , 40

    Yj = Xj + Ej , j = 1, 3, 5 . . . , 39

    I F = 8 (chaotic) and Ej ∼ N (0, 0.5) (small noise for PF)I Time between observations: ∆obs = 0.4 (large)I Results computed over 2000 assimilation cycles, following spin-up

  • Lorenz-96: “hard” case

    I The nonlinear filter is ≈ 25% more accurate in RMSE than EnKF

    Marzouk SIMDA seminar 28 / 43

  • Lorenz-96: “hard” case

    I The nonlinear filter is ≈ 25% more accurate in RMSE than EnKF

    Marzouk SIMDA seminar 28 / 43

  • Lorenz-96: “hard” case

    Marzouk SIMDA seminar 29 / 43

  • Lorenz-96: non-Gaussian noise

    I A heavy-tailed noise configuration:

    dXjdt

    = (Xj+1 − Xj−2)Xj−1 − Xj + F , j = 1, . . . , 40

    Yj = Xj + Ej , j = 1, 5, 9, 13, . . . , 37

    I F = 8 (chaotic) and Ej ∼ Laplace(λ = 1)I Time between observations: ∆obs = 0.1I Results computed over 2000 assimilation cycles, following spin-up

    Marzouk SIMDA seminar 30 / 43

  • Lorenz-96: non-Gaussian noise

    Marzouk SIMDA seminar 31 / 43

  • Lorenz-96: details on the filtering approximation

    1

    I Impose sparsity of the map with a 5-way interaction model (above)I Separable and nonlinear parameterization of each component

    Ŝk(xj1 , . . . , xjp , xk) = ψ(xj1) + . . .+ ψ(xjp ) + ψ̃(xk),

    where ψ(x) = a0 + a1 · x +∑

    i>1 ai exp(−(x − ci)2/σ).I More general parameterizations are of course possible

    Marzouk SIMDA seminar 32 / 43

  • Lorenz-96: tracking performance of the filter

    I Simple and and localized nonlinearities have significant impact

    Marzouk SIMDA seminar 33 / 43

  • Remarks and questions

    I Nonlinear generalization of the EnKF: move the ensemblemembers via local nonlinear transport maps, no weights or degeneracy

    I Learn non-Gaussian features via nonlinear continuous transport andconvex optimization

    I Choice of map basis and sparsity provide regularization (e.g.,localization)

    I In principle, filter is consistent as Sh4 is enriched and M →∞. Butwhat is a good choice of Sh4 for any fixed ensemble size M?

    I Are there better estimators than maximum likelihood? What are thefinite-sample statistical properties of candidate estimators, andproperties of the associated optimization problems?

    I How to relate map structure/parameterization to the underlyingdynamics, observation operators, and data?

    Marzouk SIMDA seminar 34 / 43

  • Remarks and questions

    I Nonlinear generalization of the EnKF: move the ensemblemembers via local nonlinear transport maps, no weights or degeneracy

    I Learn non-Gaussian features via nonlinear continuous transport andconvex optimization

    I Choice of map basis and sparsity provide regularization (e.g.,localization)

    I In principle, filter is consistent as Sh4 is enriched and M →∞. Butwhat is a good choice of Sh4 for any fixed ensemble size M?

    I Are there better estimators than maximum likelihood? What are thefinite-sample statistical properties of candidate estimators, andproperties of the associated optimization problems?

    I How to relate map structure/parameterization to the underlyingdynamics, observation operators, and data?

    Marzouk SIMDA seminar 34 / 43

  • Sparsity of triangular maps

    Theorem [Spantini et al. 2018]

    Conditional independence properties of π (encoded by graph G) define alower bound on the sparsity of S such that S]η = π

    G of π Sparsity of S

    State-space model Sparsity of S

    Main idea: Discover sparse structure using ATM algorithmMarzouk SIMDA seminar 35 / 43

  • Adaptive transport map (ATM) algorithm (Baptista et al. 2020)

    Goal: Approximate map given n i.i.d. samples from π

    Greedy enrichment procedureI Look for sparse expansion f (x) =

    ∑α∈Λ cαψα(x)

    I Use tensor-product Hermite functions ψα(x) = Pαj (x) exp(−‖x‖2/2)I Add one element to set of active multi-indices Λt at a timeI Restrict Λt to be downward closedI Search for new features in the reduced margin of Λt

    Marzouk SIMDA seminar 36 / 43

  • Adaptive transport map (ATM) algorithm (Baptista et al. 2020)

    Goal: Approximate map given n i.i.d. samples from π

    Initialize Λt = ∅ (i.e., f0 = 0)For t = 0, . . . ,m

    1 Find reduced margin ΛRMt of Λt2 Add new feature Λt+1 = Λt ∪ α∗t+1 for α∗t+1 ∈ ΛRMt3 Update approximation ft+1 = argminf ∈span(ψΛt+1 ) Lk(f )

    In practice, we can choose m (# of features) via cross-validation

    Marzouk SIMDA seminar 36 / 43

  • Numerical example: Lorenz-96 data

    I Distribution of state at a fixed time starting from a Gaussian initialcondition

    Log-like and m vs n1

    2

    3

    4

    5

    6

    7

    8

    9

    Sparsity of S: n = 3161 2 3 4 5 6 7 8 9 10 11 12 13 14 15

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    11

    12

    13

    14

    15

    Conditional independence

    Takeaways:I ATM discovers conditional independence structure in the stateI Natural semi-parametric method that gradually increases m with n

    Marzouk SIMDA seminar 37 / 43

  • Back to conditional maps: link to ABC

    “Analysis” step of the ensemble filtering scheme is an instance oflikelihood-free inference or approximate Bayesian computation(ABC):

    I Central idea: only need to simulate from πY,X in order to constructT̂ and thus draw samples from πX|Y=y∗

    I The map T̂ has some remarkable properties. . .

    Marzouk SIMDA seminar 38 / 43

  • Back to conditional maps: link to ABC

    “Analysis” step of the ensemble filtering scheme is an instance oflikelihood-free inference or approximate Bayesian computation(ABC):

    I Central idea: only need to simulate from πY,X in order to constructT̂ and thus draw samples from πX|Y=y∗

    I The map T̂ has some remarkable properties. . .

    Marzouk SIMDA seminar 38 / 43

  • Compare two approaches for posterior sampling

    Marzouk SIMDA seminar 39 / 43

  • Compare two approaches for posterior sampling

    X|y∗ ∼ ŜX(y∗, ·)−1] η X|y∗ ∼ T̂]πy,x forT̂ = ŜX(y∗, ·)−1 ◦ ŜX(·, ·)

    I Propagating the joint prior through composed maps has lower error!

    Marzouk SIMDA seminar 39 / 43

  • Composed maps for ABC

    I Simple and incomplete examples:

    I Remove dependence of the mean on y: ŜX (y, x) = x− E[X|Y = y]I If E[X|Y = y] ≈ c+ βy, we have the EnKF

    I Remove dependence of the mean and variance on y:ŜX (y, x) = Var(X|Y = y)−1/2(x− E[X|Y = y])

    I The transport map framework offers a much more general set ofpossibilities, which in principle includes the exact transformation

    I Current work: Alternative optimization objectives (hence mapestimators T̂ ) and tailored parameterizations

    I Also useful for optimal Bayesian experimental design!

    Marzouk SIMDA seminar 40 / 43

  • Composed maps for ABC

    I Simple and incomplete examples:

    I Remove dependence of the mean on y: ŜX (y, x) = x− E[X|Y = y]I If E[X|Y = y] ≈ c+ βy, we have the EnKF

    I Remove dependence of the mean and variance on y:ŜX (y, x) = Var(X|Y = y)−1/2(x− E[X|Y = y])

    I The transport map framework offers a much more general set ofpossibilities, which in principle includes the exact transformation

    I Current work: Alternative optimization objectives (hence mapestimators T̂ ) and tailored parameterizations

    I Also useful for optimal Bayesian experimental design!

    Marzouk SIMDA seminar 40 / 43

  • SIMDA plans, connections, and discussion

    I Nonlinear ensemble algorithms for filtering, smoothing, and jointstate-parameter inference (quite general)

    I Generative modeling with transport maps: will we need to buildpriors—spatiotemporal statistical models for sea ice thickness—fromdata?

    I Likelihood-free inference problems:I Are there sea ice problems where we will need to extract conditionalrelationships from large data sets? Conditional density estimation,conditional simulation.

    I High-dimensional and disparate sources of data

    I Are there strong/essential non-Gaussianities in the sea ice problem?Tail behaviors?

    I Explore cost-accuracy tradeoffs: balance posterior fidelity withcomputational effort

    I Use conditional density estimates in optimal experimental design

    Marzouk SIMDA seminar 41 / 43

  • Conclusions

    Thanks for your attention!

    Marzouk SIMDA seminar 42 / 43

  • References

    I R. Baptista, O. Zahm, Y. Marzouk. “ An adaptive transport framework for jointand conditional density estimation.” arXiv:2009.10303, 2020.

    I J. Zech, Y. Marzouk. “Sparse approximation of triangular transports on boundeddomains.” arXiv:2006.06994, 2020.

    I N. Kovachki, R. Baptista, B. Hosseini and Y. Marzouk, “Conditional sampling withmonotone GANs,” arXiv:2006.06755, 2020.

    I A. Spantini, R. Baptista, Y. Marzouk. “Coupling techniques for nonlinear ensemblefiltering.” arXiv:1907.00389, 2020.

    I M. Brennan, D. Bigoni, O. Zahm, A. Spantini, Y. Marzouk. “Greedy inferencewith structure-exploiting lazy maps.” NeurIPS 2020, arXiv:1906.00031.

    I O. Zahm, T. Cui, K. Law, A. Spantini, Y. Marzouk. “Certified dimensionreduction in nonlinear Bayesian inverse problems.” arXiv:1807.03712, 2019.

    I A. Spantini, D. Bigoni, Y. Marzouk. “Inference via low-dimensional couplings.”JMLR 19(66): 1–71, 2018.

    I M. Parno, Y. Marzouk, “Transport map accelerated Markov chain Monte Carlo.”SIAM JUQ 6: 645–682, 2018.

    I R. Morrison, R. Baptista, Y. Marzouk. “Beyond normality: learning sparseprobabilistic graphical models in the non-Gaussian setting.” NeurIPS 2017.

    Marzouk SIMDA seminar 43 / 43