25
Adaptive Kernels for Gaussian Process Regression Charlotte L. Haley (CLH), C.J. Geoga (CJG) M. Anitescu (MA) Scientific Machine Learning and Uncertainty Quantification Los Angeles, California June 2018 1 / 25

Adaptive Kernels for Gaussian Process Regressionhyperion.usc.edu/UQ-SciML-2018/Haley.pdf · 2020. 7. 28. · to represent the density. I The lynchpin will be in determining the distance

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Adaptive Kernels for Gaussian Process Regressionhyperion.usc.edu/UQ-SciML-2018/Haley.pdf · 2020. 7. 28. · to represent the density. I The lynchpin will be in determining the distance

Adaptive Kernels for Gaussian Process Regression

Charlotte L. Haley (CLH), C.J. Geoga (CJG)M. Anitescu (MA)

Scientific Machine Learning and Uncertainty QuantificationLos Angeles, California

June 2018

1 / 25

Page 2: Adaptive Kernels for Gaussian Process Regressionhyperion.usc.edu/UQ-SciML-2018/Haley.pdf · 2020. 7. 28. · to represent the density. I The lynchpin will be in determining the distance

Probabilistic machine learning (ML) & uncertaintyquantification (UQ)

The goal of UQ is either to

I learn the response surface which propagates parameteruncertainty to model output (forward model) or

I calibrate or evaluate model error (inverse problem) given inputdata.

Probabilistic ML hopes to

I smooth, filter, or forecast

given a model learned from the data and the relevant posteriordistribution.GP models provide an analytically tractable Bayesian framework inwhich the prior information about the response surface is encodedin the covariance function, facilitating the quantification of theprediction uncertainty.

2 / 25

Page 3: Adaptive Kernels for Gaussian Process Regressionhyperion.usc.edu/UQ-SciML-2018/Haley.pdf · 2020. 7. 28. · to represent the density. I The lynchpin will be in determining the distance

Data-driven model using Satellite Data1

Predictive model for solar insolationtrained using the CONUS scan ofthe GOES-13 satellite data used

I Clear sky index data

I Reduced representation usingfactor analysis

I Recursive Gaussian processmodel with 2 step delay;exponential covariance

→ discrete forecast & uncertainty.

The spatial prediction was betterthan a model trained using groundobservations alone.

400× 400 km grid centeredat Lamont, OK

1 km spatial res, 30 minuteintervals

1Bilionis et al. “Data-driven model [...]” Solar Energy 110 (2014): 22-38.3 / 25

Page 4: Adaptive Kernels for Gaussian Process Regressionhyperion.usc.edu/UQ-SciML-2018/Haley.pdf · 2020. 7. 28. · to represent the density. I The lynchpin will be in determining the distance

Wind Speed Extremes: MotivationGoals

I To characterize wind speed at fine spatial and temporal scales.

Airline safety standardsRelevance:

I Patterns of extremes of 2D windspeed fields - when does a 2D fieldcross arbitrary thresholds? Climate downscaling1

I Extreme phenomena can’t be everywhere simultaneously.(Local, anisotropic, nonstationary.)

1Wagenbrenner, et al. “Downscaling surface wind ...” Atmos Chem Physics16.8 (2016): 5229-5241.

4 / 25

Page 5: Adaptive Kernels for Gaussian Process Regressionhyperion.usc.edu/UQ-SciML-2018/Haley.pdf · 2020. 7. 28. · to represent the density. I The lynchpin will be in determining the distance

Doppler LIDAR (DL) InstrumentsAtmospheric Radiation Measurement (ARM) Climate Facility

Vectors pointing upward flow southerly.

Vertical velocityN = 329, dx = 30 m, T = 2 y,dt = 1.2 s.

Mean horizontal winds2

dt = 15 min, (VAD)

Related: albedo, heatflux, CO2, etc. at SGP C1

2R. K. Newsom. Doppler LIDAR Handbook. 20125 / 25

Page 6: Adaptive Kernels for Gaussian Process Regressionhyperion.usc.edu/UQ-SciML-2018/Haley.pdf · 2020. 7. 28. · to represent the density. I The lynchpin will be in determining the distance

Background: Parametric Gaussian Process (GP) Modeling

Let Z ∼ GP{0, Σ(Θ)}, where Z ∈ Rn is a random vector andΣ(Θ) n × n pos def matrix, parameters Θ ∈ Rp

Maximum Likelihood Model Fitting: Marginal log-Likelihood fn

log p(Z |Θ) = −1

2ZTΣ(Θ)−1Z − 1

2log |Σ(Θ)| − n

2log 2π

Let Σj(Θ) = ∂∂Θj

Σ(Θ). Score Eqns:

0 =1

2ZTΣ(Θ)−1Σj(Θ)Σ(Θ)−1Z +

1

2tr{Σ(Θ)−1Σj(Θ)} j = 1, . . . , p

I The solution Θ to the score equations is the ML estimate.

I Normally, Cholesky factorization of the matrix Σ is required,and costs O(n3) operations.

6 / 25

Page 7: Adaptive Kernels for Gaussian Process Regressionhyperion.usc.edu/UQ-SciML-2018/Haley.pdf · 2020. 7. 28. · to represent the density. I The lynchpin will be in determining the distance

Fast Gaussian Process (GP) model fittingMaximum Likelihood fit for GP: Approximate gradient of thenegative log likelihood

∇l(Θ)j =1

2N

N∑j=1

uTj W−1Σ(Θ)jW

−Tuj︸ ︷︷ ︸Stochastic trace approx.3

− 1

2ZT Σ(Θ)

−1Σ(Θ)jΣ(Θ)

−1Z︸ ︷︷ ︸

Exact4

.

where Σ(Θ) is the HODLR approximation of Σ(Θ).

1. Factorization accelerates solves and log-determinants

Σ(Θ) = WW T ; W = Ll∏

j=1

{I + UjV

Tj

}︸ ︷︷ ︸

Low-rank

.

2. Approximate the off-diagonal blocks of Σj(Θ) with theNystrom approximation.

5Stein, Chen, and Anitescu, “Score functions for GPs”, 2013.2Geoga, Anitescu, Stein, Preprint, 2018.

7 / 25

Page 8: Adaptive Kernels for Gaussian Process Regressionhyperion.usc.edu/UQ-SciML-2018/Haley.pdf · 2020. 7. 28. · to represent the density. I The lynchpin will be in determining the distance

Motivation

I The choice of kernel profoundly affects the performance of aGaussian process. As much as the choice of architecture,activation functions, & learning rate, can affect theperformance of a neural network.

I Bochner’s theorem (Wiener-Khintchine theorem) allows us toequivalently model the spectral matrix of a stationary processin lieu of its covariance kernel.

I Fast Fourier transform (FFT) based algorithms for matrixoperations allow extremely fast convolution, superposition andinversion of covariance matrices when regularly sampled dataare available.

I We have employed GP models on space-time data with regulartemporal monitoring using a spectral-in-time approach.

8 / 25

Page 9: Adaptive Kernels for Gaussian Process Regressionhyperion.usc.edu/UQ-SciML-2018/Haley.pdf · 2020. 7. 28. · to represent the density. I The lynchpin will be in determining the distance

Bochner’s Theorem & the equivalent spectral problem

TheoremA complex-valued function k on RP is the covariance function of aweakly stationary mean square continuous complex-valued randomprocess on RP if and only if it can be represented as

k(τ) =

∫RP

e2πisT τφ(ds)

where φ is a positive finite measure. If φ has a density S(s), thenS is called the spectral density or power spectrum of k , and k andS are Fourier duals.

The squared exponential (SE) kernel has the form

kSE (x − x0) = exp(−||x − x0||2

2`2)

Spectral density of squared exponential kernel:

SSE (s) = (2π`2)P/2 exp(−2π`2s2).

9 / 25

Page 10: Adaptive Kernels for Gaussian Process Regressionhyperion.usc.edu/UQ-SciML-2018/Haley.pdf · 2020. 7. 28. · to represent the density. I The lynchpin will be in determining the distance

Fourier Analysis - Univariate Case

Fourier Transform: x(t)︸︷︷︸time

=

∫ 12

− 12

X (f )︸ ︷︷ ︸frequency

e i2πftdt

v(k)n ↔ U(k)(f )

time (s) - frequency (Hz) R(τ)↔ S(f )

Stationary, zero mean time series x(t)

10 / 25

Page 11: Adaptive Kernels for Gaussian Process Regressionhyperion.usc.edu/UQ-SciML-2018/Haley.pdf · 2020. 7. 28. · to represent the density. I The lynchpin will be in determining the distance

Why Spectrum?

The power spectrum decomposes the variance of the time series interms of frequency. The spectrum reveals dynamic features of theprocess such as

I oscillatory components,5

I turbulent structures,

I linear transfer functions

There is no unbiased estimator for autocovariance.

Spectra are more easily interpretable than covariances.

1FDR: Thomson & CLH (2014). Spacing and shape [...] Proc. R. Soc. A,470(2167), 20140101.

11 / 25

Page 12: Adaptive Kernels for Gaussian Process Regressionhyperion.usc.edu/UQ-SciML-2018/Haley.pdf · 2020. 7. 28. · to represent the density. I The lynchpin will be in determining the distance

Adaptive Kernels of Wilson and Adams

Wilson and Adams modeled GP covariance functions via spectraldensities that are scale-location mixtures of Gaussians.

Compared with Matern, squared exponential (dashed), rationalquadratic, and periodic kernels; dramatic failures in all except forthe adaptive spectral method, adaptive SE kernel giveshundredfold reduction in MSE when trained and tested on theMauna Loa CO2 data, above.

12 / 25

Page 13: Adaptive Kernels for Gaussian Process Regressionhyperion.usc.edu/UQ-SciML-2018/Haley.pdf · 2020. 7. 28. · to represent the density. I The lynchpin will be in determining the distance

Research Questions

An important research question we posit is the one of

I finding flexible and complete representations of kernelfunctions for use with kernel identification.

I A related issue is the one of determining new bias andvariance error estimates to guide the adjustment of kernelestimates on the way to the best bias/variance balance.

A possible way forward builds on the Wilson and Adams work andinvolves the use of functional representations of the spectrum.

The spectrum may be expanded, for economy of basis functions, interms of either a wavelet analysis or a dynamic adjustment ofbandwidth in regions where the spectrum is poorly resolved.

13 / 25

Page 14: Adaptive Kernels for Gaussian Process Regressionhyperion.usc.edu/UQ-SciML-2018/Haley.pdf · 2020. 7. 28. · to represent the density. I The lynchpin will be in determining the distance

Challenges

I The “global” nature of spectral methods, i.e. Fouriertransformed sequences smear local information, which mayproduce dense spaces.

I When the spectrum is truly sparse, an expansion in spectralatoms is convenient, or one can use alternate basis functionsto represent the density.

I The lynchpin will be in determining the distance between theestimated nonparametric spectrum and the truth.

I Alternatively, one might estimate location and scaleparameters for a self-similar representation of the spectrumusing, e.g. Haar wavelets, and complete a full multiresolutionanalysis.

I Process nonstationarity and a need to reproduce bifurcationsmay favor the wavelet approach.

I Selection of basis functions, independence and multipletesting issues, and the curse of dimensionality.

14 / 25

Page 15: Adaptive Kernels for Gaussian Process Regressionhyperion.usc.edu/UQ-SciML-2018/Haley.pdf · 2020. 7. 28. · to represent the density. I The lynchpin will be in determining the distance

Bias-Variance Tradeoff in Spectrum Estimation

Spectrum estimation seeks to estimate a function on a regularfrequency grid using a finite sample.

Thomson multitaper method6uses a set of concentrateddoubly-orthogonal sequences on a prespecified narrow bandwidth.Bandwidth selection controls the bias-variance tradeoff.Multitaper estimates are the MLE for the spectrum when it is alinear combination of step functions with “wide” treads.

6CLH & MA (2017). MT Optimal Bandwidth, IEEE Signal Proc. Lett.15 / 25

Page 16: Adaptive Kernels for Gaussian Process Regressionhyperion.usc.edu/UQ-SciML-2018/Haley.pdf · 2020. 7. 28. · to represent the density. I The lynchpin will be in determining the distance

Novelty

In the optimal selection problem described here, little has beendone beyond using tensor product kernels of simple form adjustedby cross validation.

For complex data streams, such as wind or fluid flow, they areunlikely to be close to optimal choices. In that sense, adaptive,nonparametric, kernel selection is an area which has not beenwidely explored.

Given the high resolution data streams in scientific use, the issue ofcorrelation between samples and nonstationarity of the samples islikely to be a major feature of these applications, which is currentlyone in which very little is done at a foundational level.

16 / 25

Page 17: Adaptive Kernels for Gaussian Process Regressionhyperion.usc.edu/UQ-SciML-2018/Haley.pdf · 2020. 7. 28. · to represent the density. I The lynchpin will be in determining the distance

Exploratory Frequency-Domain Analysis: Notation, etc

Coherence analysis: zero mean jointly stationary time seriesZ (x , t) from two heights x , x ′:

k(x , x ′, τ) = E{Z (x , t)Z (x ′, t + τ)} (Covariance)

S(x , x ′, ω) =

∫k(x , x ′, τ)e−2πiωτdτ (Cross-Spectrum)

Cω(x , x ′) =S(x , x ′, ω)√

S(x , x , ω)S(x ′, x ′, ω)(Coherency)

I Magnitude Squared Coherence, |Cω(x , x ′)|2, partitions thecovariance between two time series in terms of frequency.

I We estimate the above using multitaper methods.7

7Thomson, “Spectrum Estimation and Harmonic Analysis”, IEEE. 198217 / 25

Page 18: Adaptive Kernels for Gaussian Process Regressionhyperion.usc.edu/UQ-SciML-2018/Haley.pdf · 2020. 7. 28. · to represent the density. I The lynchpin will be in determining the distance

A quick look at the wind spectra

Oklahoma time - UTC minus 5 hours

18 / 25

Page 19: Adaptive Kernels for Gaussian Process Regressionhyperion.usc.edu/UQ-SciML-2018/Haley.pdf · 2020. 7. 28. · to represent the density. I The lynchpin will be in determining the distance

Closer look at the Spectra: Diurnal Cycle & HeightSpectra (w 95% CI b/c multitaper) from approximately 90 and 150 m inheight and at 12UTC and 9UTC from June 2, 2016 at SGP C1 ARMsite. Time series are first-differenced. Nighttime spectra have more poweron 0.02− 0.1 Hz.Duration: 11mins; Sampling: 1.2 s/sample

This motivates the use of a cyclostationary covariance. 19 / 25

Page 20: Adaptive Kernels for Gaussian Process Regressionhyperion.usc.edu/UQ-SciML-2018/Haley.pdf · 2020. 7. 28. · to represent the density. I The lynchpin will be in determining the distance

Cross-Spectra & Bandwidthlog10 Cross-spectrum hourly averages at 0:00 UTC (left) and 21:00(right) between range gates 0 and 5. Data are first-differenced before use.

Cross, NW = 4,W = 0.0061 Cross, NW = 4,W = 0.0061

12:00UTC, Spectrum gate 0 9:00 UTC, Spectrum gate 0

Cross spectral pairs would be identical were the data spatially separable.Needs a frequency-dependent spatial covariance 20 / 25

Page 21: Adaptive Kernels for Gaussian Process Regressionhyperion.usc.edu/UQ-SciML-2018/Haley.pdf · 2020. 7. 28. · to represent the density. I The lynchpin will be in determining the distance

Spectral-in-time Covariance Function

Using Bochner’s theorem for a zero mean stationary, isotropicprocess, Z (x , t), write the positive definite8covariance function

k(x − y , s − t) =

∫Rh(x − y , ω)e i(s−t)ωdω,

whereh(x − y , ω) = S(ω)Cω(x − y)e iΘ(ω)uT (x−y).

1. S(ω): a (marginal) temporal spectral density for any onelocation.

2. Cω(x): a frequency-dependent spatial correlation function(introduces nonseparability), interpreted as coherence atfrequency ω btw two series with spatial separation x .

3. Θ(ω)uTx : phase of the coherence, u a unit vector.

2Stein, “[...] Regular Monitoring Data”, JRSS Series B, 2005.21 / 25

Page 22: Adaptive Kernels for Gaussian Process Regressionhyperion.usc.edu/UQ-SciML-2018/Haley.pdf · 2020. 7. 28. · to represent the density. I The lynchpin will be in determining the distance

Spectral-in-time Covariance Function

Modifying the above for a spatially nonstationary, temporallystationary process gives

k(x , y , s − t) =

∫R

√Sτ(x)(ω)Sτ(y)(ω)Cω(x , y)e iω(s−t)dω,

where

1. Cω is a frequency-dependent nonstationary coherency

2. Sτ is a spectral density that introduces spatial dependence:

For now, we use a modified Matern

Sτ(x)(ω) = φ(α2τ1(x) + ω2)−ν−12 + ητ2(x)︸ ︷︷ ︸

noise

.

(ordinary Matern uses τ1(x) := 1, τ2(x) := 0)

22 / 25

Page 23: Adaptive Kernels for Gaussian Process Regressionhyperion.usc.edu/UQ-SciML-2018/Haley.pdf · 2020. 7. 28. · to represent the density. I The lynchpin will be in determining the distance

Example fits for marginal Sτ at 15 UTC

This is with one set of Modified Matern choices using NLS.

23 / 25

Page 24: Adaptive Kernels for Gaussian Process Regressionhyperion.usc.edu/UQ-SciML-2018/Haley.pdf · 2020. 7. 28. · to represent the density. I The lynchpin will be in determining the distance

Covariance model (Continued)Cω is a frequency-dependent nonstationary coherency

Cω(x , y) := exp

{−(

y − x

σc√φ(ω)

)2}; φ(ω) := exp

{−(ω

σω

)2}Remarks

I Evaluation of Σ is relatively expensive (∼ 10−5 seconds foreach call).

I Computing this full matrix once for N = 1000 will take about1.5 minutes.

I So even if we had a really nice likelihood surface and a fastgradient, this would hurt to compute.

I Making use of Hutchinson (stochastic) trace approximationand HODLR matrices to approximate the likelihood will speedthe computation (CJG, MA, MLS).

24 / 25

Page 25: Adaptive Kernels for Gaussian Process Regressionhyperion.usc.edu/UQ-SciML-2018/Haley.pdf · 2020. 7. 28. · to represent the density. I The lynchpin will be in determining the distance

Summary

UQ and PML have GPs in common.

Faster model fitting with large sample sizes using efficientcomputational means

I HODLR,

I stochastic trace,

I spectral methods for regular data.

Specific examples

I Spectral-in-time model for spacetime wind measurements.

I Adaptive kernel estimation for CO2 (R&W)

I Use of further-refined, interpretable basis functions foradaptive estimation of the spectrum.

25 / 25