7
2018 IEEE INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING, SEPT. 17–20, 2018, AALBORG, DENMARK LEARNING STOCHASTIC DIFFERENTIAL EQUATIONS WITH GAUSSIAN PROCESSES WITHOUT GRADIENT MATCHING Cagatay Yildiz ? Markus Heinonen ?Jukka Intosalmi ? Henrik Mannerstr¨ om ? Harri L¨ ahdesm¨ aki ? ? Dept. of CS, Aalto University, Finland Helsinki Inst. of Information Technology HIIT, Finland ABSTRACT We introduce a novel paradigm for learning non-parametric drift and diffusion functions for stochastic differential equa- tion (SDE). The proposed model learns to simulate path distributions that match observations with non-uniform time increments and arbitrary sparseness, which is in contrast with gradient matching that does not optimize simulated re- sponses. We formulate sensitivity equations for learning and demonstrate that our general stochastic distribution optimisa- tion leads to robust and efficient learning of SDE systems. Index TermsStochastic differential equations, Gaus- sian processes 1. INTRODUCTION Dynamical systems modeling is a cornerstone of experimen- tal sciences. Modelers attempt to capture the dynamical be- havior of a stochastic system or a phenomenon in order to im- prove its understanding and make predictions about its future state. Stochastic differential equations (SDEs) are an effective formalism for modelling systems with underlying stochastic dynamics, with wide range of applications [1]. The key prob- lem in SDE’s is estimation of the underlying deterministic driving function, and the stochastic diffusion component. We consider the dynamics of a multivariate system gov- erned by Markov process x t described by an SDE dx t = f (x t )dt + σ(x t )dW t (1) where x t R D is the state vector of a D-dimensional dy- namical system at continuous time t R, f (x) R D is a deterministic state evolution, σ(x) R is a scalar magni- tude of the stochastic multivariate Wiener process W t R D . The Wiener process has zero initial state W 0 = 0, and the independent increments W t+s - W t ∼N (0, sI ) follow a Gaussian with standard deviation s. The SDE system (1) transforms states x t forward in con- tinuous time by the deterministic drift component f , while the σ is the magnitude of the random Brownian diffusion W t that scatters the state x t with random fluctuations. The state solu- tions of SDE are given by the Itˆ o integral [2] x t = x 0 + Z t 0 f (x τ )+ Z t 0 σ(x τ )dW τ , (2) where we integrate the system state from an initial state x 0 for time t forward, and where τ is an auxiliary time variable. The only non-deterministic part of the solution (2) is the Brownian motion W τ , whose random realisations generate path realisa- tions x 0...t that induce state distributions p(x|t; f ) at time t given the drift f and diffusion σ. SDEs produce continu- ous, but non-smooth trajectories x 0...t over time due to the non-differentiable Brownian motion. We assume that both f (·) and σ(·) are completely un- known and we only observe one or several multivariate time series Y =(y 1 ,..., y N ) T R N×D obtained from noisy observations at observation times T =(t 1 ,...,t N ) R N , y t = x t + ε t , (3) where ε t ∼N (0, Ω) follows a stationary time-invariant zero- mean multivariate Gaussian distribution with diagonal noise variances Ω = diag(ω 2 1 ,...,ω 2 D ); and the latent states x t p(x|t; f ) follow the state distribution. The goal of SDE modelling is to learn the drift f and diffusion σ functions such that the process x t matches data y t . There is considerable amount of literature on inferring SDEs that have a pre-defined parametric drift or diffusion functions for specific applications [1]. There has also been interest on estimating non-parametric SDE drift and diffu- sion functions from data using the general Bayesian formal- ism [3, 4]. With linear drift approximations the state distri- bution turns out to be a Gaussian, which can be solved with variational smoothing algorithm [5] or by variational mean field approximation [6]. Non-linear drifts and diffusions are predominantly modelled with Gaussian processes [3, 7, 4], which are a family of Bayesian kernel methods [8]. For such models the state distributions are intractable, and hence these methods resort to using a family of gradient matching approx- imations [9, 10, 11], where the drift is estimated to match the empirical gradients of data, f (y i ) (y i+1 - y i t i , and the diffusion relates to the residual of the approximation [3, 7, 4]. The gradient matching is only applicable to dense observa- 978-1-5386-5477-4/18/$31.00 c 2018 European Union arXiv:1807.05748v2 [stat.ML] 31 Jul 2018

arXiv:1807.05748v2 [stat.ML] 31 Jul 2018 · Henrik Mannerstrom¨ Harri Lahdesm¨ aki ... R D!R , that is, an assignment of a D-dimensional gradient vector f(x) 2RD to every D-dimensional

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: arXiv:1807.05748v2 [stat.ML] 31 Jul 2018 · Henrik Mannerstrom¨ Harri Lahdesm¨ aki ... R D!R , that is, an assignment of a D-dimensional gradient vector f(x) 2RD to every D-dimensional

2018 IEEE INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING, SEPT. 17–20, 2018, AALBORG, DENMARK

LEARNING STOCHASTIC DIFFERENTIAL EQUATIONS WITH GAUSSIAN PROCESSESWITHOUT GRADIENT MATCHING

Cagatay Yildiz? Markus Heinonen?† Jukka Intosalmi? Henrik Mannerstrom? Harri Lahdesmaki?

?Dept. of CS, Aalto University, Finland †Helsinki Inst. of Information Technology HIIT, Finland

ABSTRACT

We introduce a novel paradigm for learning non-parametricdrift and diffusion functions for stochastic differential equa-tion (SDE). The proposed model learns to simulate pathdistributions that match observations with non-uniform timeincrements and arbitrary sparseness, which is in contrastwith gradient matching that does not optimize simulated re-sponses. We formulate sensitivity equations for learning anddemonstrate that our general stochastic distribution optimisa-tion leads to robust and efficient learning of SDE systems.

Index Terms— Stochastic differential equations, Gaus-sian processes

1. INTRODUCTION

Dynamical systems modeling is a cornerstone of experimen-tal sciences. Modelers attempt to capture the dynamical be-havior of a stochastic system or a phenomenon in order to im-prove its understanding and make predictions about its futurestate. Stochastic differential equations (SDEs) are an effectiveformalism for modelling systems with underlying stochasticdynamics, with wide range of applications [1]. The key prob-lem in SDE’s is estimation of the underlying deterministicdriving function, and the stochastic diffusion component.

We consider the dynamics of a multivariate system gov-erned by Markov process xt described by an SDE

dxt = f(xt)dt+ σ(xt)dWt (1)

where xt ∈ RD is the state vector of a D-dimensional dy-namical system at continuous time t ∈ R, f(x) ∈ RD is adeterministic state evolution, σ(x) ∈ R is a scalar magni-tude of the stochastic multivariate Wiener process Wt ∈ RD.The Wiener process has zero initial state W0 = 0, and theindependent increments Wt+s − Wt ∼ N (0, sI) follow aGaussian with standard deviation

√s.

The SDE system (1) transforms states xt forward in con-tinuous time by the deterministic drift component f , while theσ is the magnitude of the random Brownian diffusion Wt thatscatters the state xt with random fluctuations. The state solu-

tions of SDE are given by the Ito integral [2]

xt = x0 +

∫ t

0

f(xτ )dτ +

∫ t

0

σ(xτ )dWτ , (2)

where we integrate the system state from an initial state x0 fortime t forward, and where τ is an auxiliary time variable. Theonly non-deterministic part of the solution (2) is the Brownianmotion Wτ , whose random realisations generate path realisa-tions x0...t that induce state distributions p(x|t; f , σ) at timet given the drift f and diffusion σ. SDEs produce continu-ous, but non-smooth trajectories x0...t over time due to thenon-differentiable Brownian motion.

We assume that both f(·) and σ(·) are completely un-known and we only observe one or several multivariate timeseries Y = (y1, . . . ,yN )T ∈ RN×D obtained from noisyobservations at observation times T = (t1, . . . , tN ) ∈ RN ,

yt = xt + εt, (3)

where εt ∼ N (0,Ω) follows a stationary time-invariant zero-mean multivariate Gaussian distribution with diagonal noisevariances Ω = diag(ω2

1 , . . . , ω2D); and the latent states xt ∼

p(x|t; f , σ) follow the state distribution. The goal of SDEmodelling is to learn the drift f and diffusion σ functions suchthat the process xt matches data yt.

There is considerable amount of literature on inferringSDEs that have a pre-defined parametric drift or diffusionfunctions for specific applications [1]. There has also beeninterest on estimating non-parametric SDE drift and diffu-sion functions from data using the general Bayesian formal-ism [3, 4]. With linear drift approximations the state distri-bution turns out to be a Gaussian, which can be solved withvariational smoothing algorithm [5] or by variational meanfield approximation [6]. Non-linear drifts and diffusions arepredominantly modelled with Gaussian processes [3, 7, 4],which are a family of Bayesian kernel methods [8]. For suchmodels the state distributions are intractable, and hence thesemethods resort to using a family of gradient matching approx-imations [9, 10, 11], where the drift is estimated to match theempirical gradients of data, f(yi) ≈ (yi+1 − yi)∆ti, and thediffusion relates to the residual of the approximation [3, 7, 4].The gradient matching is only applicable to dense observa-

978-1-5386-5477-4/18/$31.00 c© 2018 European Union

arX

iv:1

807.

0574

8v2

[st

at.M

L]

31

Jul 2

018

Page 2: arXiv:1807.05748v2 [stat.ML] 31 Jul 2018 · Henrik Mannerstrom¨ Harri Lahdesm¨ aki ... R D!R , that is, an assignment of a D-dimensional gradient vector f(x) 2RD to every D-dimensional

tions over time, while additional linearisation [3, 7] is neces-sary to model sparse observations. Non-parametric estimationof diffusion only applies to dense data [7, 4].

In this paper we propose to infer non-parametric driftand diffusion functions with Gaussian processes for arbitrarysparse or dense data. We learn the underlying system toinduce state distributions with high expected likelihood,

p(Y |f , σ,Ω) =

N∏i=1

Ep(x|ti;f ,σ)[N (yi|x,Ω)]. (4)

The expected likelihood is generally intractable. In contrastto earlier works we do not use gradient matching or other ap-proximative models, but instead we directly tackle and opti-mize the SDE system against the true likelihood (4) by per-forming full forward simulation. We propose an unbiasedstochastic Monte Carlo approximation for the likelihood, forwhich we derive efficient, tractable gradients. Our approachplaces no restrictions on the spacing or sparsity of the ob-servations. Our model is denoted as NPSDE, and the imple-mentation is publicly available in http://www.github.com/cagatayyildiz/npde.

2. INDUCING GAUSSIAN PROCESS SDE MODEL

In this section we model both the drift f(x) and diffusionσ(x) as Gaussian processes with a general inducing pointparameterisation. The drift function defines a vector fieldf : RD → RD, that is, an assignment of a D-dimensionalgradient vector f(x) ∈ RD to every D-dimensional statex ∈ RD. We assume that drift does not depend on time. Thediffusion function σ(x) ∈ R is a standard scalar function. Wemodel both functions as Gaussian processes (GP), which areflexible Bayesian non-linear and non-parametric models [8].

2.1. Drift Gaussian process

The inducing point parameterisation for the drift GP was orig-inally proposed in the context of ordinary differential equa-tion systems [12], which we review here. We assume a zero-mean vector-valued GP prior on the drift function

f(x) ∼ GP(0,Kf (x,x′)), (5)

which defines a priori distribution over drift values f(x)whose mean and covariance are

E[f(x)] = 0 (6)

cov[f(x), f(x′)] = Kf (x,x′) ∈ RD×D, (7)

where the kernel Kf (x,x′) is matrix-valued [13]. A GP prior

defines that for any collection of statesX = (x1, . . . ,xN )T ∈RN×D, the drift values F = (f(x1), . . . , f(xN ))T ∈ RN×Dfollow a matrix-valued normal [13],

p(F ) = N (vecF |0,Kf (X,X)), (8)

where Kf (X,X) = (Kf (xi,xj))Ni,j=1 ∈ RND×ND is a

block matrix of matrix-valued kernels Kf (xi,xj). The keyproperty of Gaussian processes is that they encode functionswhere similar states x,x′ induce similar drifts f(x), f(x′),and where the state similarity is defined by the kernelsKf (x,x

′). Several families of rich matrix-valued kernelsexist [14, 13, 16]. In this work we opt for the family of de-composable kernels K(x,x′) = k(x,x′) · A, where k(x,x′)is a Gaussian base kernel

k(x,x′) = σ2f exp

(−1

2

D∑d=1

(xd − x′d)2

`2fd

)(9)

with drift variance σ2f , and dimension-specific lengthscales

`f1, . . . , `fD that determine the smoothness of drift field, andA ∈ RD×D is a PSD dependency matrix between dimen-sions. In practise global dependency structures are often un-available, and the diagonal structure A = ID is then chosen.

In standard GP regression we would obtain posterior ofthe drift by conditioning the GP prior with data [8]. In SDEmodels the conditional f(x)|Y is intractable due to the in-tegral mapping (2) between observations yi and drifts f(x).Instead, we augment the Gaussian process with a set of Minducing vectors uf ∈ RD at locations z ∈ RD, such thatf(z) = uf [17]. We interpolate drift from inducing points as

f(x) , K(x, Z)K(Z,Z)−1uf , (10)

which supports the drift with inducing locationsZ = (z1, . . . , zM )and inducing vectors Uf = (uf1, . . . ,ufM ), and whereuf = vecUf . This corresponds to a vector-valued kernelfunction [13], or to a multi-task Gaussian process posteriormean [8]. Due to universality of the Gaussian kernel [18], wecan represent arbitrary drifts with sufficient inducing points.

2.2. Diffusion Gaussian process

We represent diffusion σ(x) as another inducing point GP,similarly to drift. Diffusion is a scalar function that uses ascalar kernel. The diffusion has a zero-mean GP prior

σ(x) ∼ GP(0, kσ(x,x′)) (11)

that defines covariance cov[σ(x), σ(x′)] = kσ(x,x′) ∈ Rwith a Gaussian kernel of form (9), but with diffusion vari-ance σ2

σ and lengthscales `σd. Diffusion values σ =(σ(x1), . . . , σ(xN ))T ∈ RN at states X then follow a prior

p(σ) = N (σ|0,Kσ(X,X)), (12)

where Kσ(X,X) = (kσ(xi,xj))Ni,j=1 ∈ RN×N . We inter-

polate the diffusion fromM inducing locations Z with induc-ing values uσ = (uσ1, . . . , uσM )T ∈ RM ,

σ(x) , Kσ(x, Z)Kσ(Z,Z)−1uσ, (13)

Drift and diffusion naturally share their inducing locations Z.

Page 3: arXiv:1807.05748v2 [stat.ML] 31 Jul 2018 · Henrik Mannerstrom¨ Harri Lahdesm¨ aki ... R D!R , that is, an assignment of a D-dimensional gradient vector f(x) 2RD to every D-dimensional

Fig. 1. (a) A Van der Pol oscillator with local diffusion and drift, (b-c) path samples and distribution, (d) the estimated GPsystem, (e) three noisy input trajectories for training, (e-f) the estimated path samples that match the true samples.

2.3. Stochastic Monte Carlo inference

The inducing SDE model is determined via the inducing lo-cations Z, the inducing values uf and uσ , the observationnoise variance σ2

n, and the kernel parameters σσ, `σd andσf , `fd of the drift and diffusion kernels. The posterior ofthe model combines the likelihood p(Y |f , σ,Ω) of (4) and theindependent priors p(uf ) and p(uσ) using Bayes’ theorem as

p(uf ,uσ|Y ) ∝ p(uf ,uσ)p(Y |uf ,uσ) (14)

= p(uf )p(uσ)

N∏i=1

Ep(x|ti;f ,σ)[N (yi|x,Ω)]

≈ N (uf |0,Kf (Z,Z))N (uσ|0,Kσ(Z,Z)) (15)

×N∏i=1

1

Ns

Ns∑s=1

N (yi|x(s)i ,Ω), x(s) ∼ p(x0...t|uf ,uσ, Z)

where the time-dependent state distribution p(x|t; f , σ) ≡p(x|t;uf ,uσ, Z) now depends on the inducing parameters.We propose to approximate the true expected likelihoodwith an unbiased stochastic Monte Carlo averaging, sincewe can draw path samples x

(s)t from the state distribution

p(x0...t|uf ,uσ, Z) by sampling the Brownian motion pathW

(s)t . The stochastic likelihood estimate with Ns samples

turns out to be a kernel density estimator with Gaussian bases.

We draw the sample paths using Euler-Maruyama(EM)

method for approximating the solution of an SDE (2) [2]:

x(s)i+1 = x

(s)i + f(x

(s)i )∆t+ σ(x

(s)i )∆W

(s)i , (16)

where we discretise time into NT subintervals t0, t1, . . . , tNT

of width ∆t = tNT/NT , and sample the Wiener coefficients

as ∆W(s)i ∼ N (0,∆t · I) with standard deviation

√∆t. We

set x(s)0 to the initial observation and use (16) to compute state

path x(s) ≡ (x(s)0 ,x

(s)1 , . . . ,x

(s)TN

). The number of time stepsNT > N is often higher than the number of observed time-points to achieve sufficient path resolution.

We find a maximum a posteriori (MAP) estimates ofuf ,uσ,Ω by gradient ascent, while choosing lengthscales`f , `σ from a grid, keeping the inducing locations Z fixed ona dense grid (See Figure 1(d)) and setting σf = σσ = 1. Inpractise in 2D or 3D systems placing inducing locations on agrid is a robust choice, noting that they can be also optimisedwith increased computational complexity [19].

2.4. Computing stochastic gradients

The gradient of the expectation of the log-likelihood (15) is

d

du

N∑i=1

log1

Ns

Ns∑s=1

N (yi|x(s)i ,Ω) (17)

=

N∑i=1

∑Ns

s=1∂N (yi|x(s)

i ,Ω)

∂x

dx(s)i

du∑Ns

s=1N (yi|x(s)i ,Ω)

, (18)

Page 4: arXiv:1807.05748v2 [stat.ML] 31 Jul 2018 · Henrik Mannerstrom¨ Harri Lahdesm¨ aki ... R D!R , that is, an assignment of a D-dimensional gradient vector f(x) 2RD to every D-dimensional

Fig. 2. (a) True state distribution from the model in Figure1(a-b), (b,d) state distribution approximations with differentnumber of path samples, (c) approximation errors.

where the sample paths x(s)t are from equation (16). The last

term dx(s)i

du is the cumulative derivative of the state x(s)i of

sample s at time ti against the parameters u , (uf ,uσ). Thegradients of the piecewise Euler-Maruyama paths x(s)

i are:

dx(s)i+1

du=dx

(s)i

du+df(x

(s)i )

du∆t+

dσ(x(s)i )

du∆W

(s)i

=dx

(s)i

du+

(∂f(x

(s)i )

∂x

dx(s)i

duf+∂f(x

(s)i )

∂uf

)∆t (19)

+

(∂σ(x

(s)i )

∂x

dx(s)i

duσ+∂σ(x

(s)i )

∂uσ

)∆W

(s)i .

The derivatives dx(s)i

du are constructed iteratively over time

starting from dx(s)0

du = 0 with a fixed initial state x(s)0 , and the

four partial derivatives are gradients of the kernel functions(10) and (13) with respect to uf and uσ , respectively:

∂f(x)

∂x=∂Kf (x, Z)

∂xKf (Z,Z)−1uf (20)

∂f(x)

∂uf= Kf (x, Z)Kf (Z,Z)−1 (21)

∂σ(x)

∂x=∂Kσ(x, Z)

∂xKσ(Z,Z)−1uσ (22)

∂σ(x)

∂uσ= Kσ(x, Z)Kσ(Z,Z)−1. (23)

These gradients are related to the sensitivity equations [20,21] derived for non-parametric ODEs previously [12]. Thegradients (19) can be computed in practise together with the

Fig. 3. (a) Double well system, (b) estimated drift.

sample paths (16) during the Euler-Maruyama iteration. Thecomputation of the gradients has the same computationalcomplexity as the numerical simulator. The iterative gradi-ents are superior to finite difference approximation since wehave exact formulation of the gradients, albeit of the approxi-mal Euler-Maruyama paths, which can be solved to arbitrarynumerical accuracy by tuning ∆t discretisation.

3. EXPERIMENTS

In order to illustrate the performance of our model, we con-duct several experiments on synthetic data as well as real-world data sets. In all experiments inducing vectors are ini-tialized by gradient matching, and then fully optimised. Weuse EM method to simulate the state distributions and com-pute stochastic gradients over Ns = 50 samples. We useL-BFGS algorithm to compute the MAP estimates.

3.1. Double well

We first consider the double well system where the drift isgiven by f(x) = 4(x − x3) and with constant diffusionσ(x) = 1.5. We generate 6 random noisy input trajectories,each with 250 observed data points. True dynamics and theobserved data points are illustrated in Figure 3a. We fit ournpSDE model with M = 15 inducing points located uni-formly within [−5, 5]. We accurately approximate the truedrift (the right plot on Figure 3), and learn a diffusion estimateof 1.39.

3.2. Simple oscillating dynamics

Next, we investigate how the quality of fit changes by theamount of data used for training. We consider the 2Dsynthetic system in [7], whose drift equations are given byf(x)1 = x1(1−x2

1−x22)−x2 and f(x)2 = x2(1−x2

1−x22)+

x1, and the diffusion is σ(x) = 2N (x|[−1,−1], 0.5I) + 0.3.The state-dependent diffusion acts an hotspot of increased

Page 5: arXiv:1807.05748v2 [stat.ML] 31 Jul 2018 · Henrik Mannerstrom¨ Harri Lahdesm¨ aki ... R D!R , that is, an assignment of a D-dimensional gradient vector f(x) 2RD to every D-dimensional

Fig. 4. (a) Cumulative discrepancy in the state distributionsover time and over number of observed trajectories in syn-thetic 2D model, (b) drift and diffusion estimation errors.

path scatter, and provides interesting and challenging dy-namics to infer. We generate six data batches from the truedynamics using EM method with step size ∆t = 0.005, andobserve every 100’th state corrupted by a Gaussian noise withvariance σ2

n = 0.12. The data batches contain 1, 5, 10, 25,50 and 100 input trajectories, each having 25 data points. Werepeat the experiments 50 times and report the average error.

The left plot in Figure 4 illustrates the cumulative discrep-ancy in the state distributions over time, and the right plotshows the error between the true and estimated drift/diffusion.Unsurprisingly, the discrepancy in both plots decrease whenmore training data is used. We also observe that the perfor-mance gain is insignificant after 50 input trajectories.

3.3. Ice core data

As another showcase of our model, we consider the NGRIPice core dataset [22], which contains records of isotopic oxy-gen δ18O concetrations within clacial ice cores. The recordis used to explore the climatic changes that date back to lastglacial period. During that period, the North Atlantic re-gion underwent abrupt climate changes known as Dansgaard-Oeschger (DO) events. The events are characterized by a sud-den increase in the temperature followed by a gradual coolingphase. Following [23], we consider N = 2000 timepointsfrom the time span from 60000 years to 20000 years beforepresent, where 16 DO events have been identified.

Figure 5(a) illustrates the highly variable data. We ob-serve a repeating pattern of DO events: a sudden increase fol-lowed by a slower settlement phase. The panel 5(b) indicatesestimated drift that pushes the oxygen down until state −41and up with small states, matching the data. Interestingly, dif-fusion at 5(c) is highly peaked between −42 and −43, whichhas accurately identified the regime of DO events. The modelhas learned to explain the DO events with high diffusion.

3.4. Human motion dataset

We finally demonstrate our approach on human motion cap-ture data. Our goal is twofold: to estimate a single drift func-

tion that captures the dynamics of the walking sequences ofseveral people, and to explain the discrepancies among se-quences via diffusion. The input trajectories are the same asin [24]: four walking sequences, each from a different per-son. We also follow the preprocessing method described in[24], which results in a simplified skeleton that consists of 50-dimensional pose configurations. All records are mean cen-tered and downsampled by a factor of two.

Inference is performed in three dimensional space wherethe input sequences are projected using PCA. We place theinducing points on a 5× 5× 5 grid and set the length-scale ofboth drift and diffusion process to 0.5. 3D data set, inferreddrift fit and the density of the sample paths are visualized inFigure 6. We can conclude that our model is capable of infer-ring drift and diffusion functions that match arbitrary data.

4. DISCUSSION

We propose an approach for learning non-parametric drift anddiffusion functions of stochastic differential equation (SDE)systems such that the resulting simulated state distributionsmatch data. Our approach can learn arbitrary dynamics dueto the flexible inducing Gaussian process formulation. Wepropose a stochastic estimate of the simulated state distribu-tions and an efficient system of computing their gradients.Our approach does not place any restrictions on the sparsityor denseness of the observations data. We leave learning oftime-varying drifts and diffusions as interesting future work.

Acknowledgements. The data used in this project was ob-tained from mocap.cs.cmu.edu. The database was cre-ated with funding from NSF EIA-0196217. This work hasbeen supported by the Academy of Finland Center of Excel-lence in Systems Immunology and Physiology, the Academyof Finland grants no. 260403, 299915, 275537, 311584.

5. REFERENCES

[1] R. Friedrich, J. Peinkeb, M. Sahimic, and R. Tabar, “Approach-ing complexity by stochastic methods: From biological sys-tems to turbulence,” Phys. reports, vol. 506, pp. 87–162, 2011.

[2] B. Øksendal, Stochastic Differential Equations: An Introduc-tion with Applications, Springer, 6th edition, 2014.

[3] A. Ruttor, P. Batz, and M. Opper, “Approximate Gaussian pro-cess inference for the drift function in stochastic differentialequations,” in Advances in Neural Information Processing Sys-tems, 2013, pp. 2040–2048.

[4] C. Garcıa, A. Otero, P. Felix, J. Presedo, and D. Marquez,“Nonparametric estimation of stochastic differential equationswith sparse Gaussian processes,” Physical Review E, vol. 96,no. 2, pp. 022104, 2017.

[5] C. Archambeau, D. Cornford, M. Opper, and J. Shawe-Taylor,“Gaussian process approximations of stochastic differentialequations,” in Gaussian Processes in Practice, 2007.

Page 6: arXiv:1807.05748v2 [stat.ML] 31 Jul 2018 · Henrik Mannerstrom¨ Harri Lahdesm¨ aki ... R D!R , that is, an assignment of a D-dimensional gradient vector f(x) 2RD to every D-dimensional

Fig. 5. (a) The ice core data, (b) estimated drift, (c) the highly state-dependent diffusion.

Fig. 6. (a) Shared drift estimate learned from walking data of 4 subjects, (b) estimated sample paths, (c) density plots of thesample paths. The 4 observed trajectories are shown as black lines in (b-c), with red circles denoting the initial state.

[6] M. Vrettas, M. Opper, and D. Cornford, “Variational mean-field algorithm for efficient inference in large systems ofstochastic differential equations,” Physical Review E, vol. 91,pp. 012148, 2015.

[7] P. Batz, A. Ruttor, and M. Opper, “Approximate bayes learningof stochastic differential equations,” arXiv:1702.05390, 2017.

[8] C.E. Rasmussen and K.I. Williams, Gaussian processes formachine learning, MIT Press, 2006.

[9] J. M. Varah, “A spline least squares method for numerical pa-rameter estimation in differential equations,” SIAM J sci StatComput, vol. 3, pp. 28–46, 1982.

[10] S. Ellner, Y. Seifu, and R. Smith, “Fitting population dynamicmodels to time-series data by gradient matching,” Ecology,vol. 83, pp. 2256–2270, 2002.

[11] J. Ramsay, G. Hooker, D. Campbell, and J. Cao, “Parameterestimation for differential equations: a generalized smoothingapproach,” J R Stat Soc B, vol. 69, pp. 741–796, 2007.

[12] M. Heinonen, C. Yildiz, H. Mannerstrom, J. Intosalmi, andH. Lahdesmaki, “Learning unknown ODE models with Gaus-sian processes,” in Proceedings of the 35th International Con-ference on Machine Learning, 2018.

[13] M. Alvarez, L. Rosasco, and N. Lawrence, “Kernels for vector-valued functions: A review,” Foundations and Trends in Ma-chine Learning, 2012.

[14] N. Wahlstrom, M. Kok, and T. Schon, “Modeling magneticfields using Gaussian processes,” IEEE ICASSP, 2013.

[15] I. Macedo and R. Castro, “Learning divergence-free and curl-free vector fields with matrix-valued kernels,” Instituto Na-cional de Matematica Pura e Aplicada, 2008.

[16] C. Micchelli and M. Pontil, “On learning vector-valued func-tions,” Neural computation, 2005.

[17] J. Quinonero-Candela and C.E. Rasmussen, “A unifying viewof sparse approximate Gaussian process regression,” Journalof Machine Learning Research, vol. 6, pp. 1939–1959, 2005.

[18] J. Shawe-Taylor and N. Cristianini, Kernel methods for patternanalysis, Cambridge University Press, 2004.

[19] J. Hensman, N. Fusi, and N. Lawrence, “Gaussian processesfor big data,” in Uncertainty in Artificial Intelligence. AUAIPress, 2013, pp. 282–290.

[20] P. Kokotovic and J. Heller, “Direct and adjoint sensitivity equa-tions for parameter optimization,” IEEE Trans. on AutomaticControl, vol. 12, pp. 609–610, 1967.

[21] F. Frohlich, B. Kaltenbacher, F. Theis, and J. Hasenauer, “Scal-able parameter estimation for genome-scale biochemical reac-tion networks,” PLOS Comp Biol, vol. 13, pp. 1–18, 2017.

[22] K. Andersen, N. Azuma, J-M. Barnola, M. Bigler, P. Biscaye,N. Caillon, J. Chappellaz, H. Clausen, et al., “High-resolutionrecord of northern hemisphere climate extending into the lastinterglacial period,” Nature, vol. 431, pp. 147, 2004.

[23] F. Kwasniok, “Analysis and modelling of glacial climate tran-sitions using simple dynamical systems,” Phil Trans R Soc A,vol. 371, pp. 20110472, 2013.

Page 7: arXiv:1807.05748v2 [stat.ML] 31 Jul 2018 · Henrik Mannerstrom¨ Harri Lahdesm¨ aki ... R D!R , that is, an assignment of a D-dimensional gradient vector f(x) 2RD to every D-dimensional

[24] J. Wang, D. Fleet, and A. Hertzmann, “Gaussian process dy-namical models for human motion,” IEEE Trans. on patternanalysis and machine intelligence, vol. 30, pp. 283–298, 2008.