Data-driven prediction strategies for low-frequency ...dimitris/files/ComeauEtAl15_prediction.pdf · 2 Darin Comeau et al. 66 corresponding to the PDO and NPGO when forced with 67

Noname manuscript No.

(will be inserted by the editor)

Data-driven prediction strategies for low-frequency patterns of North

Pacific climate variability

Darin Comeau · Zhizhen Zhao · Dimitrios Giannakis · Andrew J. Majda

Received: date / Accepted: date

Abstract The North Pacific exhibits patterns of low-frequency1

variability on the intra-annual to decadal time scales, which2

manifest themselves in both model data and the observa-3

tional record, and prediction of such low-frequency modes4

of variability is of great interest to the community. While5

parametric models, such as stationary and non-stationary au-6

toregressive models, possibly including external factors, may7

perform well in a data-fitting setting, they may perform poorly8

in a prediction setting. Ensemble analog forecasting, which9

relies on the historical record to provide estimates of the10

future based on past trajectories of those states similiar to11

the initial state of interest, provides a promising, nonpara-12

metric approach to forecasting that makes no assumptions13

on the underlying dynamics or its statistics. We apply such14

forecasting to low-frequency modes of variability for the15

North Pacific sea surface temperature and sea ice concentra-16

tion fields extracted through Nonlinear Laplacian Spectral17

Analysis. We find such methods may outperform paramet-18

ric methods and simple persistence with increased predictive19

skill.20

1 Introduction21

Predictability in general circulation models (GCMs) for the22

North Pacific, from seasonal to decadal time scales, has been23

the subject of many recent studies, (e.g., Tietsche et al (2014),24

Blanchard-Wrigglesworth and Bitz (2014), Blanchard-Wrigglesworth25

et al (2011a)), several of which focus on the role of the26

initial state (e.g., Collins (2002), Blanchard-Wrigglesworth27

et al (2011b), Branstator et al (2012), Day et al (2014)). The28

North Pacific exhibits prominent examples of interannual to29

D. ComeauCenter for Atmosphere Ocean Science, Courant Institute of Mathemat-ical Sciences, New York University, New York, NY.E-mail: [email protected]

decadal variability, such as the Pacific Decadal Oscillation30

(PDO; Mantua et al, 1997), and the more rapidly decorrelat-31

ing North Pacific Gyre Oscillation (NPGO; Di Lorenzo et al,32

2008), both of which have been the subject of much interest.33

An important phenomenon of intra-annual variability in the34

North Pacific is the reemergence of anomalies in both the sea35

surface temperature (SST) fields (Alexander et al, 1999), as36

well as in the sea ice concentration (SIC) field (Blanchard-37

Wrigglesworth et al, 2011a), where regional anomalies in38

these state variables vanish over a season, and reappear sev-39

eral months later, as made evident by high time-lagged cor-40

relations.41

The North Pacific (along with the North Atlantic) is a re-42

gion of relative strong low-frequency variability in the global43

climate system (Branstator et al, 2012). Yet, in GCMs it44

has been shown that this region shows relative lack of pre-45

dictability (less than a decade; Collins, 2002), with the Com-46

munity Climate System Model (CCSM) having particularly47

weak persistence and low predictability in the North Pacific48

among similiar GCMs (Branstator et al, 2012). The ocean49

and sea ice systems show stronger low-frequency variabil-50

ity than the atmosphere (Newman et al, 2003). The internal51

variability exhibited in the North Pacific has also been char-52

acterized as being in distinct climate regimes (e.g. Overland53

et al (2006, 2008)), where the dynamics exhibit regime tran-54

sitions and metastability. As a result, cluster based meth-55

ods have been a popular approach to model regime behavior56

in climate settings, such as in Franzke et al (2008, 2009),57

where hidden Markov models were used to model atmo-58

spheric flows.59

A traditional modeling approach for the PDO has been60

to fit an autoregressive process (Hasselmann, 1976; Frankig-61

noul and Hasselmann, 1977), as well as models being ex-62

ternally forced from the tropics through ENSO (Newman63

et al, 2003). In Giannakis and Majda (2012b), autoregres-64

sive models were successful in predicting temporal patterns65

2 Darin Comeau et al.

corresponding to the PDO and NPGO when forced with66

suitably modulated intermittent modes. Additional flexibil-67

ity can be built into regression models by allowing nonsta-68

tionary, i.e. time dependent coefficients, and a successful ex-69

ample of this is the finite element method (FEM) clustering70

procedure combined with multivariate autoregressive factor71

(VARX) model framework (Horenko, 2010a,b). In this ap-72

proach, the data is partitioned into a predetermined num-73

ber of clusters, and model regression coefficients are es-74

timated on each cluster, together with a cluster affiliation75

function that indicates which model is used at a particular76

time. This can further be adapted to account for external77

factors (Horenko, 2011a). While FEM-VARX methods can78

be effective at fitting the desired data, advancing the sys-79

tem state in the future for a prediction is dependent on be-80

ing able to successfully predict the unknown cluster affili-81

ation function. Methods for using this framework in a pre-82

diction setting have been used in Horenko (2011a,b). An-83

other regression modeling approach was put forward in Ma-84

jda and Harlim (2013), where physics constraints were im-85

posed on multilevel nonlinear regression models, preventing86

ad-hoc, finite-time blow-up, a pathological behavior previ-87

ously shown to exist in such models without physics con-88

straints in Majda and Yuan (2012). Appropriate nonlinear89

regression models using this strategy have been shown to90

have high skill for predicting the intermittent cloud patterns91

of tropical intra-seasonal variability (Chen et al, 2014; Chen92

and Majda, 2015b,a).93

Parametric models may perform well for fitting, but of-94

ten have poor performance in a prediction setting, partic-95

ularly in systems that exhibit distinct dynamical regimes.96

Nonparametric models can be advantageous in systems where97

the underlying dynamical system is unknown or imperfect.98

An early example of this is an analog forecast, first intro-99

duced by Lorenz (1969), where one considers the histori-100

cal record, and makes predictions based on examining state101

variable trajectories in the past that are similiar to the cur-102

rent state. Analog forecasting does not make any assump-103

tions on the underlying dynamics, and cleverly avoids model104

error when the underlying model is observations from na-105

ture. This has since been applied to other climate predic-106

tion scenarios, such as the Southern Oscillation Index (Dros-107

dowsky, 1994), the Indian summer monsoon (Xavier and108

Goswami, 2007), and wind forecasting (Alessandrini et al,109

2015), where it was found to be particularly useful in fore-110

casting rare events.111

Key to the success of an analog forecasting method is112

the ability to identify a good historical analog to the current113

initial state. In climate applications, the choice of analog is114

usually determined by minimizing Euclidean distance be-115

tween snapshots of system states, with a single analog being116

selected (Lorenz, 1969; Branstator et al, 2012). In Zhao and117

Giannakis (2014), analog forecasting was extended upon in118

two key ways. First, the state vectors considered were in119

Takens lagged embedding space (Takens, 1981), which cap-120

tures some of the dynamics of the system, rather than a snap-121

shot in time. Second, instead of selecting a single analog de-122

termined by Euclidean distance, weighted sums of analogs123

were considered, where weights are determined by a ker-124

nel function. In this context, a kernel is an exponentially de-125

caying pairwise similarity measure, intuitively, playing the126

role of a local covariance matrix. In Zhao and Giannakis127

(2014), kernels were introduced in the context of Nonlin-128

ear Spectral Analysis (NLSA; Giannakis and Majda, 2012c,129

2013, 2014) for decomposing high-dimensional spatiotem-130

poral data, leading naturally to a class of low-frequency ob-131

servables for prediction through kernel eigenfunctions. In132

Bushuk et al (2014), the kernel used in the NLSA algorithm133

was adapted to be multivariate, allowing for multiple vari-134

ables, possibly of different physical units, to be considered135

jointly in the analysis.136

The aim of this study is to develop low dimensional,137

data-driven models for prediction of dominant low-frequency138

climate variability patterns in the North Pacific, combining139

the approaches laid out in Zhao and Giannakis (2014) and140

Bushuk et al (2014). We invoke a prediction approach that141

is purely statistical, making use only of the available histori-142

cal record, possibly corrupted by model error, and the initial143

state itself. The approach utilizes out-of-sample extension144

methods (Coifman and Lafon, 2006b; Rabin and Coifman,145

2012) to define our target observable beyond a designated146

training period. There are some key differences between this147

study and that of Zhao and Giannakis (2014). First, we con-148

sider multivariate data, including SIC, which is intrinsically149

noisier than SST, given that SIC is a thresholded variable150

and experiences high variability in the marginal ice zone.151

We also make use of observational data, which is a rela-152

tively short time series compared to GCM model data. In153

addition to considering kernel eigenfunctions as our target154

observable, which we will see is well-suited for prediction,155

we also consider integrated sea ice anomalies, a more chal-156

lenging observable uninfluenced by the data analysis algo-157

rithm. We compare performance of these kernel ensemble158

analog forecasting techniques to some parametric forecast159

models (specifically, autoregressive and FEM-VARX mod-160

els), and find the former to have higher predictive skill than161

the latter, which often can not even outperform the simple162

persistence forecast.163

The rest of the paper is outlined as follows. In Section 2,164

we discuss the mathematical methods used to perform ker-165

nel ensemble analog forecasting predictions, and also dis-166

cuss alternative parametric forecasting methods. In Section167

3 we describe the data sets used for our experiments, and in168

Section 4 we present our results. Discussion and concluding169

remarks are in Section 5.170

Data-driven prediction strategies for low-frequency patterns of North Pacific climate variability 3

2 Methods171

Our forecasting approach is motivated by using kernels, apairwise measure of similarity for the state vectors of inter-est. A simple example of such an object is a Gaussian kernel:

w(xi

,xj

) = e

�kx

i

�x

j

k2

s0 , (1)

where x is a state variable of interest, and s0 is a scale pa-172

rameter that controls the locality of the kernel. The kernels173

are then used to generate a weighted ensemble of analog174

forecasts, making use of an available historical record, rather175

than relying on a single analog. To extend the observable we176

wish to predict beyond the historical period into the future,177

we make use of out-of-sample extension techniques. These178

techniques allow one to extend a function (observable) to a179

new point y by looking at values of the function at known180

points x

i

close to y, where closeness will be determined by181

the kernels. An important part of any prediction problem is182

to choose a target observable that is both physically mean-183

ingful and exhibits high predictability. As demonstrated in184

Zhao and Giannakis (2014), defining a kernel on the desired185

space naturally leads to a preferred class of observables that186

exhibit time-scale separation and good predictability.187

2.1 Nonlinear Laplacian spectral analysis188

Nonlinear dynamical systems generally give rise to datasets189

with low-dimensional nonlinear geometric structures (e.g.,190

attractors). We therefore turn to a data analysis technique,191

NLSA, that allows us to extract spatio-temporal patterns from192

data from a high-dimensional non-linear dynamical system,193

such as a coupled global climate system (Giannakis and Ma-194

jda, 2012a,c, 2013, 2014). The standard NLSA algorithm195

is a nonlinear manifold generalization of singular spectrum196

analysis (SSA) (Ghil et al, 2002), where the covariance op-197

erator is replaced by a discrete Laplace-Beltrami operator198

to account for non-linear geometry on the underlying data199

manifold. The eigenfunctions of this operator then form a200

convenient orthonormal basis on the data manifold. A key201

advantage of NLSA is there is no pre-processing of the data202

needed, such as band-pass filtering or seasonal partitioning.203

2.1.1 Time-lagged embedding204

An important first step in the NLSA algorithm is to perform205

a time-lagged embedding of the spatio-temporal data as a206

method of inducing time-scale separation in the extracted207

modes. An analog forecast method is driven by the initial208

data, and to incorporate some of the system’s dynamics into209

account in selecting an analog, instead of using a snapshot210

in time of the state as our initial condition, time-lagged em-211

bedding helps make the data more Markovian.212

Let z(ti

) 2 Rd be a time series sampled uniformly withtime step d t, on a grid of size d, with i = 1, . . . ,N samples.We construct a lag-embedding of the data set with a windowof length q, and consider the lag-embedded time series

x(ti

) =�z(t

i

),z(ti�1), . . . ,z(t

i�(q�1))�2 Rdq.

The data now lies in Rm, with m = dq the dimension of the213

lagged embedded space, and n = N �q+1 number of sam-214

ples in lagged embedded space (also called Takens embed-215

ding space, or delay coordinate space). It has been shown216

that time-lagged embedding recovers the topology of the at-217

tractor of the underlying dynamical system that has been218

lost through partial observations (Takens, 1981; Sauer et al,219

1991). In particular, the embedding affects the non-linear ge-220

ometry of the underlying data manifold M , in such a way to221

allow for dynamically stable patterns with time scale sepa-222

ration (Berry et al, 2013), a desirable property that will lead223

to observables with high predictability.224

2.1.2 Discrete Laplacian225

The next step is to define a kernel on the data manifold M .Rather than using a simple Gaussian as in Equation (1), theNLSA kernel makes use of phase velocities x

i

= kx(ti

)�x(t

i�1)k, which forms a vector field on the data manifoldand provides additional important dynamic information. TheNLSA kernel we use is

K (x(ti

),x(tj

)) = exp✓�kx(t

i

)� x(tj

)k2

ex

i

x

j

◆. (2)

With this kernel K and associated matrix K

i j

=K (x(ti

),x(tj

)),we solve the Laplacian eigenvalue problem to acquire aneigenfunction basis {f

i

} on the data-manifold M . To dothis, we construct the discrete graph Laplacian by followingthe diffusion maps approach of Coifman and Lafon (2006a),and forming the following matrices:

Q

i

=n

Âj=1

K

i j

, K

i j

=K

i j

Q

a

i

Q

a

j

,

D

i

=n

Âj=1

K

i j

, P

i j

=K

i j

D

i

, L

i j

= I �P

i j

.

Here a is a real parameter, typically with value 0, 12 , 1. We

note in the large data limit, as n ! • and e ! 0, this dis-crete Laplacian converges to the Laplace-Beltrami operatoron M for a Riemannian metric that depends on the kernel(Coifman and Lafon, 2006a). We can therefore think of thekernel as biasing the geometry of the data to reveal a class offeatures, and the NLSA kernel does this in such a way as toextract dynamically stable modes with time scale separation.We then solve the eigenvalue problem

Lf

i

= l

i

f

i

. (3)


The resulting Laplacian eigenfunctions f

i

= (f1i

, . . . ,fni

)T

form an orthonormal basis on the data manifold M withrespect to the weighted inner product:

hfi

,fj

i=n

Âk=1

D

k

f

ki

f

k j

= d

i j

.

As well as forming a convenient orthonormal basis, the226

eigenfunctions f

i

give us a natural class of observables with227

good time-scale separation and high predictability. These228

eigenfunctions f

i

are time series, nonlinear analogs to prin-229

cipal components, and can be used to recreate spatio-temporal230

modes, similar to extended empirical orthogonal functions231

(EOFs) (Giannakis and Majda, 2012b,c, 2013). However,232

unlike EOFs, the eigenfunctions f

i

do not measure variance,233

but rather measure oscillations or roughness in the abstract234

space M , the underlying data manifold. The eigenvalues l

i

235

measure the Dirichlet energy of the corresponding eigen-236

functions, which has the interpretation of being squared wave237

numbers on this manifold (Giannakis and Majda, 2014). We238

now use the leading low-frequency kernel eigenfunctions as239

our target observables for prediction.240

2.1.3 Multiple components241

As described in Bushuk et al (2014), the above NLSA al-gorithm can be modified to incorporate more than one timeseries. Let z

(1), z

(2) be two signals, sampled uniformly withtime step d t, on (possibly different) d1,d2 grid points. Af-ter lag-embedding each variable with embedding windowsq1,q2 to its appropriate embedding space, so z

(1)(ti

)2Rd1 7!x

(1)(ti

)2Rd1q1 and z

(2)(ti

)2Rd2 7! x

(2)(ti

)2Rd2q2 , we con-struct the kernel function K by scaling the physical variablesx

(1),x(2) to be dimensionless by

K

i j

= exp

0

@�kx

(1)(ti

)� x

(1)(tj

)k2

ex

(1)i

x

(1)j

�kx

(2)(ti

)� x

(2)(tj

)k2

ex

(2)i

x

(2)j

1

A .

(4)

This can then be extended to any number of variables,242

regardless of physical units, and allows for analysis in cou-243

pled systems, such as the ocean and sea ice components of244

a climate model. An alternative approach to Equation (4) in245

extending the NLSA kernel to multiple variables with dif-246

ferent physical units is to first normalize each variable to247

unit variance. However a drawback of this approach is that248

information about relative variability is lost as the ratios of249

the variances are set equal, rather than letting the dynam-250

ics of the system control the variance ratios of the different251

variables (Bushuk et al, 2014) as in Equation (4).252

2.2 Out-of-sample extension253

Now that we have established our class of target observ-254

ables, namely the eigenfunctions f

i

in Equation (3), we need255

a method for extending these observables into the future256

to form our predictions, for which we draw upon out-of-257

sample extension techniques. To be precise, let f be a func-258

tion defined on a set M = {x1, . . . ,xn

},xi

2 Rm; f may be259

vector-valued, but in our case is scalar. We wish to make260

a prediction of f by extending the function to be defined261

on a point outside of the training set M, by performing an262

out-of-sample extension, which we call f . There are some263

desireable qualities we wish to have in such an extension,264

namely that f is in some way well-behaved and smooth on265

our space, and is in consistent as the number of in-samples266

increases. Below we discuss two such methods of out-of-267

sample extension, the geometric harmonics, based on the268

Nystrom method, and Laplacian pyramids.269

2.2.1 Geometric harmonics270

The first approach for out-of-sample extension is based onthe Nystrom method (Nystrom, 1930), recently adapted tomachine learning applications (Coifman and Lafon, 2006b),and is based on representing a function f on M in terms of aneigenfunction basis obtained from a spectral decompositionof a kernel. While we use the NLSA kernel (Equation 2),other kernels could be used, another natural choice beinga Gaussian kernel. For a general kernel w : Rm ⇥Rm ! R,consider its row-sum normalized counterpart:

W (yi

,xj

) =w(y

i

,xj

)

Âj

w(yi

,xj

).

These kernels have the convenient interpretation of formingdiscrete probability distributions in the second argument, de-pendent on the first argument, so W (y,x) = p

y

(x), which willbe a useful perspective later in our analog forecasting in Sec-tion 2.3. We then solve the eigenvalue problem

l

l

j

l

(xi

) =n

Âj=1

W (xi

,xj

)jl

(xj

).

We note the spectral decomposition of W yields a set of realeigenvalues l

l

and an orthonormal set of eigenfunctions j

l

that form a basis for L

2(M) (Coifman and Lafon, 2006b),and we can thus represent our function f in terms of thisbasis:

f (xi

) =n

Âj=1

hjj

, f ijj

(xi

). (5)

Let y 2 Rm,y /2 M be an out-of-sample, or test data, point,to which we wish to extend the function f . If l

l

6= 0, the


eigenfunction j

l

can be extended to any y 2 Rm by

j

l

(y) =1l

l

n

Âj=1

W (y,xj

)jl

(xj

). (6)

This definition ensures that the out-of-sample extension isconsistent when restricted to M, meaning j

l

(y) = j

l

(y) fory 2 M. Combining Equations (5) and (6) allows f to be as-signed for any y 2Rm by evaluating the eigenfunctions j

l

aty and using the projection of f onto these eigenfunctions asweights:

f (y) =n

Âj=1

hjj

, f ijj

(y). (7)

Equation (7) is called the Nystrom extension, and in Coif-271

man and Lafon (2006b) the extended eigenfunctions in Equa-272

tion (6) are called geometric harmonics. We note this scheme273

becomes ill conditioned since l

l

! 0 as l ! • (Coifman274

and Lafon, 2006b), so in practice there is a truncation of275

the sum at some level l, usually determined by the decay of276

the eigenvalues. With the interpretation of the eigenvalues l277

as wavenumbers on the underlying data manifold, this trun-278

cation represents removing features from the data that are279

highly oscillatory in this space.280

2.2.2 Laplacian pyramid281

The geometric harmonics method is well suited for observ-282

ables that have a tight bandwidth in the eigenfunction ba-283

sis (particularly the eigenfunctions themselves), but for ob-284

servables that may require high levels of eigenfunctions in285

their representation, the above mentioned ill-conditioning286

may hamper this method. An alternative to geometric har-287

monics is the Laplacian pyramid (Rabin and Coifman, 2012),288

which invokes a multiscale decomposition of the original289

function f in its out-of-sample extension approach.290

A family of kernels defined at different scales is needed,and for clarity of exposition we will use a family of Gaus-sian kernels w

l

(and their row-normalized counterparts W

l

)at scales l:

w

l

(xi

,xj

) = e

�kx

i

�x

j

k2

s0/2l , (8)

That is, l = 0 represents the widest kernel width, and in-291

creasing l gives finer kernels resolving more localized struc-292

tures.293

For a function f : M ! R, the Laplacian pyramid rep-resentation of f approximates f in a multiscale manner byf ⇡ s0 + s1 + s2 + · · · , where the first level s0 is defined by

s0(xk

) =n

Âi=1

W0(xi

,xk

) f (xi

),

and we then evaluate the difference:

d1 = f � s0.

We then iteratively define the lth level decomposition s

l

:

s

l

(xk

) =n

Âi=1

W

l

(xi

,xk

)dl

(xi

), d

l

= f �l�1

Âi=0

s

i

.

Iteration is continued until some prescribed error tolerance294

k f �Âk

s

k

k< e is met.295

Next we extend f to a new point y 2 Rm,y /2 D by

s0(y) =n

Âi=1

W0(xi

,y) f (xi

), s

l

(y) =n

Âi=1

W

l

(xi

,y)dl

(xi

),

for l � 1, and assign f the value

f (y) = Âk

s

k

(y). (9)

That is, we have formed a multiscale representation of f296

using weighted averages of f for nearby inputs, where the297

weights are given by the scale of the kernel function. Since298

the kernel function can accept any inputs from Rm, we can299

define these weights for other points outside M, and thus de-300

fine f by using weighted values of f on M (known), where301

now the weights are given by the proximity of the out-of-302

sample y /2 M to input points x

i

2 M. The parameter choices303

of the initial scale s0 and error tolerance e set the scale and304

cut-off of the dyadic decomposition of f in the Laplacian305

pyramid scheme. We choose s0 to be the median of the pair-306

wise distances of our training data, and the error tolerance e307

to be scaled by the norm of the observable over the training308

data, for example 10�6k fk. In our applications below, we309

use a multiscale family of NLSA kernels based on Equation310

(4) rather than the above family of Gaussian kernels.311

2.3 Kernel ensemble analog forecasting312

The core idea of traditional analog forecasting is to iden-313

tify a suitable analog to one’s current initial state from a314

historical record, and then make a prediction based on the315

trajectory of that analog in the historical record. The analog316

forecasting approach laid out in Zhao and Giannakis (2014),317

which we use here, varies in a few important regards. First,318

the initial system state, as well as the historical record (train-319

ing data), is in Takens embedding space, so that an analog is320

not determined by a snapshot in time alone, but the current321

state with some history (a ‘video’). Second, rather than us-322

ing Euclidean distance, as in traditional analog forecasting,323

the distances we use are based on a defined kernel function,324

which reflects a non-Euclidean geometry on the underlying325

data manifold. The choice of geometry is influenced by the326

lagged embedding and choice of kernel, which we have done327

in a way that gives us time scale separation in the resulting328


eigenfunctions f , yielding high predictability. Third, rather329

than identify and use a single analog in the historical record,330

weighted sums of analogs are used, with weights determined331

by the kernel function.332

For this last point, it is useful to view analog forecastingin the context of an expectation over an empirical probabilitydistribution. For example, a traditional analog forecast of anew initial state y at some time t in the future, based onselected analog x

j

= x(tj

), can be written as

f (y,t) = Ep

y

S

t

f =n

Âi=1

p

y

(xi

) f (x(ti

+ t)) = f (x(tj

+ t)) ,

where p

y

= d

i j

is the Dirac delta function, and S

t

f (xi

) =333

f (x(ti

+ t)) is the operator that shifts the timestamp of x

i

334

by t . To make use of more than one analog and move to335

an ensemble, we let p

y

be a more general discrete empir-336

ical distribution, dependent on the initial condition y, with337

probabilities (weights) determined by our kernel function.338

Writing in this way, we simply need to define the empiri-339

cal probability distribution p

y

to form our ensemble analog340

forecast.341

For the geometric harmonics method, we projected f

onto an eigenfunction basis f

i

(truncated at some level l)and performed an out-of-sample extension on each eigen-function basis function, i.e.

f (y) =l

Âi=1

hfi

, f ifi

(y).

Or, to write this in terms of an expectation over an empiricalprobability measure:

f (y) = Ep

y

f =l

Âi=1

hfi

, f iEp

y

f

i

(y),

where

Ep

y

f

i

=1l

i

n

Âj=1

W (y,xj

)fi

(x(tj

)).

We then define our prediction for lead time t via geometricharmonics by

f (y,t) = Ep

y

S

t

f =l

Âi=1

hfi

, f iEp

y

S

t

f

i

,

where

Ep

y

S

t

f

i

=1l

i

n

Âj=1

W (y,xj

)fi

(x(tj

+ t)).

Similarly, the t shifted ensemble analog forecast via Lapla-cian pyramids is then

f (y,t) = Ep

y,0S

t

f +l

Âi=1

Ep

y,iSt

d

i

,

where p

y,i(x) = W

i

(y,x) corresponds to the probability dis-342

tribution from the kernel at scale i.343

Thus we have a method for forming a weighted ensem-344

ble of predictions, that is non-parametric and data-driven345

through the use of a historical record (training data), which346

itself has been subject to analysis that reflects the dynamics347

of the high-dimensional system in the non-linear geometry348

on the underlying abstract data manifold M , and produces a349

natural preferred class of observables to target for prediction350

through the kernel eigenfunctions.351

2.4 Autoregressive modeling352

We wish to compare our ensemble analog prediction meth-ods to more traditional parametric methods, namely autore-gressive models of the North Pacific variability (Frankignouland Hasselmann, 1977), of the form

x(t +1) = µ(t)+A(t)x(t)+s(t)e(t)

where x(t) is our signal, µ(t) is the external forcing (pos-353

sibly 0), A(t) is the autoregressive term, and s(t) is the354

noise term, each to be estimated from the training data, and355

e(t) is a Gaussian process. In the stationary case, the model356

coefficients µ(t) = µ , A(t) = A, s(t) = s are constant in357

time, and can be evaluated in an optimal way through ordi-358

nary least squares. In a non-stationary case, we will invoke359

the FEM-VARX framework of Horenko (2010b) by cluster-360

ing the training data into K clusters, and evaluating coef-361

ficients A

k

, s

k

, k = 1, . . . ,K for each cluster (see Horenko362

(2010a,b) for more details on this algorithm). From the algo-363

rithm we obtain a cluster identification function G (t), such364

that G (ti

) = k indicates at time t

i

the model coefficients are365

A(ti

) = A

k

, s(ti

) = s

k

. In addition to choosing the number366

of clusters K, the method also has a persistence parameter367

C that governs the number of allowable switches between368

clusters that must be chosen prior to calculating model co-369

efficients. These parameters are usually chosen to be opti-370

mal in the sense of the Akaike Information Criterion (AIC)371

(Horenko, 2010b; Metzner et al, 2012), an information the-372

oretic based measure for model selection which penalizes373

overfitting by large number of parameters.374

As mentioned earlier, while non-stationary autoregres-sive models may perform better than stationary models infitting, an inherent difficulty in a prediction setting is the ad-vancement of the model coefficients A(t),s(t) beyond thetraining period, which in the above framework amounts tosolely advancing the cluster affiliation function G (t). If wecall p

k

(t) the probability of the model being at cluster k attime t, we can view G (t) as determining a Markov switchingprocess on the cluster member probabilities

p(t) = (p1(t), . . . ,pK

(t)),


which over the training period will be 1 in one entry, and 0elsewhere at any given time. We can estimate the transitionprobability matrix T of that Markov process by using the op-timal cluster affiliation sequence G (t) from the FEM-VARXframework (here optimal is for the training period). Assum-ing the Markov hypothesis, we can estimate the stationaryprobability transition matrix directly from G (t) by:

T

i j

=N

i j

ÂK

k=1 N

ik

,

where N

i j

is the number of direct transitions from state i375

to state j (Horenko, 2011a; Franzke et al, 2009). This esti-376

mated transition probability matrix T can be used to model377

the Markov switching process in the following ways.378

2.4.1 Generating predictions of cluster affiliation379

The first method we employ is to advance the cluster mem-ber probabilities p(t) using the estimated probability transi-tion matrix T by the deterministic equation (Franzke et al,2009):

p(t0 + t) = p(t0)Tt

where p(t0) = (p1(t0),p2(t0)) is the initial cluster affiliation,380

which is determined by which cluster center the initial point381

x(t0) is closest to, and p

i

(t0) is either 0 or 1.382

The second method we employ is to use the estimated383

transition matrix T to generate a realization of the Markov384

switching process G

R

, and use this to determine the model385

cluster at any given time, maintaining strict model affilia-386

tion. Thus p

k

(t) = 1 is G

R

(t) = k, and 0 otherwise.387

2.5 Error metrics388

To gauge the fidelity of our predictions, we will evaluate the389

average root-mean-square error (RMSE) and pattern corre-390

lation (PC) of our predictions y against the ground truth x,391

where points in our test data set (of length n

0) are used as in-392

tial conditions for our predictions. As a benchmark, we will393

compare each prediction approach to a simple persistence394

forecast y(t) = y(0). The error metrics are calculated as395

rms

2(t) =1n

0

n

0

Âj=1

(y(tj

+ t)� x(tj

+ t))2 ,

pc(t) =1n

0

n

0

Âj=1

(y(tj

+ t)� y(t))(x(tj

+ t)� x(t))

s

y

(t)sx

(t),

where

y(t) =1n

0

n

0

Âj=1

y(tj

+ t), x(t) =1n

0

n

0

Âj=1

x(tj

+ t),

396

s

2y

(t) =1n

0

n

0

Âj=1

(y(tj

+ t)� y(t))2,

s

2x

(t) =1n

0

n

0

Âj=1

(x(tj

+ t)� x(t))2.

An important note is that for data-driven observables, such397

as NLSA eigenfunctions or EOF principal components, there398

is no underlying ground truth when predicting into the fu-399

ture. As such a ground truth for comparison needs to be de-400

fined when evaluating the error metrics, for which one can401

use the out-of-sample extended function f (y(tj

+ t)) as de-402

fined in Equations (7) and (9).403

3 Datasets404

3.1 CCSM model output405

We use model data from the Community Climate System406

Model (CCSM), versions 3 and 4, for monthly SIC and SST407

data, restricted to the North Pacific, which we define as 20�–408

65�N, 120�E–110�W. CCSM3 model data is used from a409

900 year control run (experiment b30.004) (Collins et al,410

2006). The sea ice component is the Community Sea Ice411

Model (CSIM; Holland et al, 2006) and the ocean compo-412

nent is the Parallel Ocean Program (POP), both of which are413

sampled on the same nominal 1� grid. CCSM4 model data414

is used from a 900 year control run (experiment b40.1850),415

which uses the Community Ice CodE 4 model for sea ice416

(CICE4; Hunke and Lipscomb, 2008) and the POP2 model417

for the ocean component (Smith et al, 2010), also on a com-418

mon nominal 1� grid. Specific differences and improvements419

between the two model versions can be found in Gent et al420

(2011).421

NLSA was performed on these data sets, both in sin-422

gle and multiple component settings, with the same embed-423

ding window of q1 = q2 = q = 24 months for each variable,424

kernel scale e = 2, and kernel normalization a = 0. A 24425

month embedding window was chosen to allow for dynam-426

ical memory beyond the seasonal cycle, as is used in other427

NLSA studies (e.g., Giannakis and Majda (2012b, 2013);428

Bushuk et al (2014)). There are 6648 grid points for SST and429

3743 for SIC in this region for these models, and with an em-430

bedding window of q= 24, this means our lagged embedded431

data lies in Rm with m = 159,552 for SST and m = 89,832432

for SIC. For the purposes of a perfect model experiment,433

where the same model run is used for both the training and434

test data, we split the CCSM4 control run into two 400 year435

sets; years 100 – 499 for the training data set, and years 500436

– 899 for out-of-sample test points. After embedding, this437

leaves us with n = n

0 = 4777 samples in each data set. In438

our model error experiment, we train on 800 years of the439


CCSM3 control run, and use 800 years of CCSM4 for test440

data, giving us n = n

0 = 9577 data points.441

In addition to low-frequency kernel eigenfunctions asobservables, we also consider North Pacific integrated seaice extent anomalies as a target for prediction in Section 4.2.This observable exhibits faster variability, on an intra-annualtime-scale, and as such we reduce the embedding windowfrom 24 months to 6 months for our kernel evaluation. Usingthe CCSM4 model training data, we define monthly anoma-lies by calculating a climatology f

c

of monthly mean sea iceextent. Let v

j

be gridpoints in our domain of interest (NorthPacific), c the mean SIC, a the grid cell area, and then thesea ice extent anomaly hat will be our target observable isdefined as

f (ti

) = Âj

c(vj

, ti

)a(vj

)� f

c

(ti

). (10)

3.2 Observational data442

For observational data, we turn to the Met Office Hadley443

Centre’s HadISST data set (Rayner et al, 2003), and use444

monthly data for SIC and SST, from years 1979–2012, sam-445

pled on a 1� latitude-longitude grid. We assign ice covered446

grid points an SST value of �1.8�C, and have removed a447

trend from the data by calculating a linear trend for each448

month. There are 4161 spatial grid points for SST, for a449

lagged embedded dimension of m = 99,864, and 3919 grid450

points for SIC, yielding m = 94,056. For direct comparison451

with this observational data set, the above CCSM model data452

sets have been interpolated from the native POP grid 1� grid453

to a common 1� latitude-longitude grid. After embedding,454

we are left with n

0 = 381 observation test data points.455

4 Results456

4.1 Low-frequency NLSA modes457

The eigenfunctions that arise from NLSA typically fall into458

one of three categories: i) periodic modes, which capture the459

seasonal cycle and its higher harmonics; ii) low-frequency460

modes, characterized by a red power spectrum and a slowly461

decaying autocorrelation function; and iii) intermittent modes,462

which have the structure of periodic modes modulated with a463

low-frequency envelope. The intermittent modes are dynam-464

ically important (Giannakis and Majda, 2012a), shifting be-465

tween periods of high activity and quiessence, but carry lit-466

tle variance, and are thus typically missed or mixed between467

modes in classical SSA. Examples of each of these eigen-468

functions arising from a CCSM4 data set with SIC and SST469

variables are shown in Figure 1, for years 100–499 of the470

pre-industrial control run. The corresponding out-of-sample471

extension eigenfunctions, defined through the Nystrom method,472

are shown in Figure 2, and are computed using years 500–473

899 of the same CCSM4 pre-industrial control run as test474

(out-of-sample) data. We use the notation f

S

L1, f

I

L1, or f

SI

L1,475

to indicate if the NLSA mode is from SIC, SST, or joint476

SST and SIC variables, respectively.477

Snapshot of Time Series

!1

!2

0

2

Power Spectral Density

1E!4

1E!2

1E0

1E2

Autocorrelation Function

!1

!3

!2

0

2

1E!4

1E!2

1E0

1E2

!3

!10

!2

0

2

1E!4

1E!2

1E0

1E2

!10

!15

!2

0

2

1E!4

1E!2

1E0

1E2

!15

!9

!2

0

2

1E!4

1E!2

1E0

1E2

!9

time t (y)

!14

0 10 20 30 40

!2

0

2

frequency " (y!1

)

1E!2 1E!1 1E0 1E11E!4

1E!2

1E0

1E2

time t (m)

!14

0 12 24 36 48

Fig. 1 Select NLSA eigenfunctions for CCSM4 North Pacific SST andSIC data. f1, f3 are periodic (annual and semi-annual) modes, charac-terized by a peak in the power spectrum and oscillatory autocorrelationfunction; f9, f14 are the two leading low-frequency modes, character-ized by a red power spectrum and slowly decaying montone autocor-relation function; f10, f15 are intermittent modes, characterized by abroad peak in the power spectrum and decaying oscillatory autocorre-lation function.

We perform our prediction schemes for five year time478

leads by applying the kernel ensemble analog forecast meth-479

ods discussed in Section 2.3 to the leading two low-frequency480

modes f

SI

L1, f

SI

L2from NLSA on North Pacific, shown in Fig-481

ure 1. The leading low-frequency modes extracted through482

NLSA can be though of as analogs to the well known PDO483

and NPGO modes, even in the multivariate setting. We have484

high correlations between the leading multivariate and uni-485

variate low-frequency NLSA modes, with corr (f SI

L1,f I

L1) =486

�0.9907 for our analog of the NPGO mode, and corr (f SI

L2,f S

L1)=487

0.8415 for our analog of the PDO mode.488

As a benchmark, we compare against the simple con-489

stant persistence forecast y(t) = y(0), which can perform490

reasonably well given the long decorrelation time of these491

low-frequency modes, and in fact beats paramteric autore-492

gressive models as we will see below. We define the ground493

truth itself to be the out-of-sample eigenfunction calculated494

by Equation (7), so by construction, our predictions by the495

geometric harmonics method are exact at time lag t = 0,496


Snapshot of Time Series

!1

!2

0

2

Power Spectral Density

1E!4

1E!2

1E0

1E2

Autocorrelation Function

!1

!3

!2

0

2

1E!4

1E!2

1E0

1E2

!3

!10

!2

0

2

1E!4

1E!2

1E0

1E2

!10

!15

!2

0

2

1E!4

1E!2

1E0

1E2

!15

!9

!2

0

2

1E!4

1E!2

1E0

1E2

!9

time t (y)

!14

0 10 20 30 40

!2

0

2

frequency " (y!1

)

1E!2 1E!1 1E0 1E11E!4

1E!2

1E0

1E2

time t (m)

!14

0 12 24 36 48

Fig. 2 Out-of-sample extension eigenfunctions, computed via Equa-tion (7), for CCSM North Pacific SST and SIC data, associated withthe same select (in-sample) eigenfunctions shown in Figure 1.

whereas predictions using Laplacian pyramids will have re-497

construction errors at time lag t = 0.498

4.1.1 Perfect model499

We first consider the perfect model setting, where the same500

dynamics generate the training data and test (forecast) data.501

This should give us a measure of the potential predictability502

of the methods. Snapshots of sample prediction trajectories503

along with the associated ground truth out-of-sample eigen-504

function are shown in Figure 3. In Figure 4, we see that for505

the leading low-frequency mode f

SI

L1(our NPGO analog),506

the ensemble based predictions perform only marginally bet-507

ter than persistence in the PC metric, but have improved508

RMSE scores over longer timescales. However with the sec-509

ond mode f

SI

L2(our PDO analog), we see a more noticeable510

gain in predictive skill with the ensemble analog methods511

over persistence. If we take 0.6 as a PC threshold (Collins,512

2002), below which we no longer consider the model to have513

predictive skill, we see an increase of about 8 months in pre-514

dictive skill with the ensemble analog methods over persis-515

tence, with skillful forecasts up to 20 months lead time. We516

note these low-frequency modes extracted from multivariate517

data exhibit similiar predictability (as measured by when the518

PC falls below the 0.6 threshold) than their univariate coun-519

terparts (f S

L1or f

I

L1, results not shown).520

0 10 20 30 40 50 60

0

0.5

1NPGO Sample Trajectories

time(months)

TGHLP

0 10 20 30 40 50 60

!0.5

0

0.5

1

PDO Sample Trajectories

time(months)

TGHLP

Fig. 3 Sample snapshot trajectories of ensemble analog prediction re-sults via geometric harmonics (GH, blue, solid) and Laplacian pyramid(LP, blue, dashed), for the leading two low-frequency modes, in perfectmodel setting using CCSM4 data. The ground truth (T, black) is itselfan out-of-sample extension eigenfunction, as shown in Figure 2.

0 10 20 30 40 50 600

0.5

1

1.5

2RMSE for NPGO predictions

time (months)

GHLPP

0 10 20 30 40 50 600

0.5

1

1.5

2RMSE for PDO predictions

time (months)

GHLPP

0 10 20 30 40 50 60!0.5

0

0.5

1PC for NPGO predictions

time (months)

GHLPP

0 10 20 30 40 50 60!0.5

0

0.5

1PC for PDO predictions

time (months)

GHLPP

Fig. 4 Kernel ensemble analog prediction results via geometric har-monics and Laplacian pyramid for the leading two low-frequencymodes, in perfect model setting using CCSM4 data. Both out-of-sample extension methods outperform the persistence forecast (P) inboth error metrics, particularly for in the f

SI

L2(PDO) mode.

4.1.2 Model error521

To incorporate model error into our prediction experiment,522

we train on the CCSM3 model data, serving as our ‘model’,523

and then use CCSM4 model data as our test data, serving the524

role of our ‘nature’, and the difference in the fidelity of the525

two model versions represents general model error. In this526

experiment, our ground truth is an out-of-sample eigenfunc-527

tion trained on CCSM3 data, and extended using CCSM4528

test data. For the leading f

SI

L1(NPGO) mode, we see again529

marginal increased predictive performance in the ensemble530

analog predictions over persistence at short time scales in531

PC in Figure 5, but at medium to long time scales this im-532

provement has been lost (though after the score has fallen533

below the 0.6 threshold). This could be due to the increased534

fidelity of the CCSM4 sea ice model component over the535

CCSM3 counterpart, where using the less sophisticated model536


data for training leaves us a bit handicapped in trying to pre-537

dict the more sophisticated model data (Gent et al, 2011;538

Holland et al, 2012). The improvement in the predictive skill539

of f

SI

L2(PDO) mode over persistence is less pronounced in540

the presence of model error than it was in the perfect model541

case shown in Figure 4. Nevertheless, the kernel ensemble542

analog forecasts still provide a substantial improvement of543

skill compared to the persistence forecast, extending the PC544

= 0.6 threshold to 20 months.545

0 10 20 30 40 50 600

0.5

1

1.5RMSE for NPGO predictions

time (months)

GHLPP

0 10 20 30 40 50 600

0.5

1

1.5RMSE for PDO predictions

time (months)

GHLPP

0 10 20 30 40 50 60!0.5

0

0.5


time (months)

GHLPP

0 10 20 30 40 50 60!0.5

0

0.5


time (months)

GHLPP

Fig. 5 Kernel ensemble analog forecasting prediction results for theleading two low-frequency modes, with model error. CCSM3 modeldata is used for the in-sample data, and CCSM4 model data is used asthe out-of-sample data. While the predictions still outperform persis-tence in the error metrics, there is less gain in predictive skill over ascompared to the perfect model case.

We can further examine the model error scenario by us-546

ing the actual obervational data set as our nature, and CCSM4547

as our model (Figure 6). Given the short observational record548

used, far fewer prediction realizations are generated, adding549

to the noisiness of the error metrics. The loss of predictabil-550

ity is apparent, especially in the f

SI

L2mode, where the kernel551

ensemble analog forecasts fail to beat persistence, and drop552

below 0.6 in PC by 10 months, half as long as the perfect553

model case.554

4.1.3 Comparison with Autoregressive Models555

We compare the ensemble analog predictions to standard556

stationary autoregressive models, as well as non-stationary557

models using the FEM-VARX framework of Horenko (2010b)558

discussed in Section 2.4, for the low-frequency modes f

SI

L1,559

f

SI

L2generated from CCSM4 model (Figure 7) and the HadISST560

observation (8) training data. For both sets of low-frequency561

modes, K = 2 clusters was judged to be optimal by the AIC562

as mentioned in Section 2.4–see Horenko (2010b); Metzner563

0 10 20 30 40 50 600

0.5

1

1.5RMSE for NPGO predictions

time (months)

GHLPP

0 10 20 30 40 50 600

0.5

1

1.5RMSE for PDO predictions

time (months)

RM

SE

GHLPP

0 10 20 30 40 50 60!0.5

0

0.5


time (months)

GHLPP

0 10 20 30 40 50 60!0.5

0

0.5


time (months)

GHLPP

Fig. 6 Ensemble analog prediction results for the leading two low-frequency modes, with model error. CCSM4 model data is used for thein-sample data, and HadISST observational data is used as the out-of-sample data. Here the method produces little advantage over persis-tence, given the model error between model and nature, and actuallyfails to beat persistence for the f

SI

L2(PDO) mode.

et al (2012) for more details. The coefficients for each clus-564

ter are nearly the same (autoregressive coefficent close to 1,565

similar noise coefficients), apart from the constant forcing566

coefficient µ

i

of almost equal magnitude and opposite sign,567

suggesting two distinct regime behaviors. In the stationary568

case, the external forcing coefficient µ is very close to 0.569

In the top left panel of Figures 7 and 8, we display snap-570

shots of the leading low-frequency mode f

SI

L1(NPGO) tra-571

jectory reconstruction during the training period, for the sta-572

tionary (blue, solid), and non-stationary (red, dashed) mod-573

els, along with the cluster switching function associated with574

the non-stationary model. In both the CCSM4 model and575

HadISST data sets, the non-stationary model snapshot is a576

better representation of the truth (black, solid), and the ben-577

efit over the stationary model is more clearly seen in the578

CCSM4 model data, Figure 7, which has the benefit of a579

400 year training period, as opposed to the shorter 16 year580

training period with the observational data set.581

In the prediction setting, however, the non-stationary mod-582

els, which are reliant on an advancement of the unknown583

cluster affiliation function G (t) beyond the training period,584

as discussed in Section 2.4, fail to outperform their station-585

ary counterparts in the RMSE and PC metrics (bottom pan-586

els of Figures 7 and 8). In fact, none of the proposed re-587

gression prediction models are able to outperform the sim-588

ple persistence forecast in these experiments. As a measure589

of potential predition skill for the non-stationary models,590

whereby we mean that if perfect knowledge of the underly-591

ing optimal cluster switching function G (t) could be known592

over the test period, we have run the experiment of replacing593


the test data period with the training data set, and find ex-594

ceedingly strong predictive performance, with PC between595

0.7 and 0.8 for all time lags tested, up to 60 months. Simi-596

lar qualitative results for the second leading low-frequency597

mode f

SI

L2(PDO) for each data set were found (not shown).598

This suggests that the Markov hypothesis, the basis for the599

predictions of G (t), is not accurate, and other methods in-600

corporating more memory are needed.601

20 40 60 80 100 120 140 160 1801

1.5

2FEM!VARX model for low!frequency mode

clust

er

affili

atio

n

20 40 60 80 100 120 140 160 180!4

!2

0

2

reco

nst

ruct

ed m

ode

0 10 20 30 40 50 60!2.5

!2

!1.5

!1

!0.5

0

0.5

1Sample Trajectories

time (months)

TP2K1M1M2

0 10 20 30 40 50 600

0.5

1

1.5

2RMSE

time (months)

P1P2K1M1M2PT

0 10 20 30 40 50 60!0.5

0

0.5

1PC

time (months)

P1P2K1M1M2PT

Fig. 7 Top left: Snapshot of the true CCSM4 f

SI

L1(NPGO) trajectory

(black) with reconstructed stationary (blue) and non-stationary K = 2(red) FEM-VARX model trajectories, along with corresponding modelaffiliation function G (t) for non-stationary case. Top right: Sample tra-jectories for various prediction methods: P2 = stationary, using FEM-VARX model coefficients from initial cluster; K1 = stationary autore-gressive; M1, M2 = FEM-VARX with predictions as described in Sec-tion 2.4.1, where M1 is deterministic evolution of the cluster affiliationp(t), and M2 uses realizations of p generated from the estimated prob-ability transition matrix T . Bottom panels: RMSE and PC as a functionof lead time for various prediction methods, including P1 = persistenceas a benchmark. The dashed black line is for potential predictive skill ofnon-stationary FEM-VARX, where predictions were ran over the train-ing period using the known optimal model affiliation function G (t).

4.2 Sea ice anomalies602

The targeted observables hereto considered for prediction603

have been data driven, and as such influenced by the data604

analysis algorithm. Hence there is no objective ground truth605

available when predicting these modes beyond the training606

period on which the data analysis was performed, and while607

in this case the NLSA algorithm was used, other data analy-608

sis methods such as EOFs would suffer the same drawback.609

We wish to test our prediction method on an observable that610

is objective, in the sense that it can be computed indepen-611

dently of the data analysis algorithm, for which we turn612

20 40 60 80 100 120 140 160 1801

1.5

2FEM!VARX model for low!frequency mode

clust

er

affili

atio

n

20 40 60 80 100 120 140 160 180!2

0

2

4

reco

nst

ruct

ed m

ode

0 10 20 30 40 50 60!2.5

!2

!1.5

!1

!0.5

0

0.5

1

1.5

2Sample Trajectories

time (months)

TP2K1M1M2

0 10 20 30 40 50 600

0.5

1

1.5

2RMSE

time (months)

P1P2K1M1M2PT

0 10 20 30 40 50 60!0.5

0

0.5

1PC

time (months)

P1P2K1M1M2PT

Fig. 8 Top left: Snapshot of the true HadISST f

SI

L1(NPGO) trajectory

(black) with stationary (blue) and non-stationary K = 2 (red) FEM-VARX model trajectories, along with the corresponding model affilia-tion function G (t) for non-stationary case. Top right: Sample trajecto-ries for various prediction methods–see Figure 7 for details of methods.Bottom panels: RMSE and PC as a function of lead time for variousprediction methods. The dashed black line is for potential predictiveskill of non-stationary FEM-VARX, where predictions were ran overthe training period using the known optimal model affiliation functionG (t).

to integrated sea ice extent anomalies, as defined in Equa-613

tion (10). We can clearly compute the time series of sea ice614

anomalies from the out-of-sample set directly (relative to the615

training set climatology), which will be our ground truth,616

and use the Laplacian pyramid approach to generate our617

out-of-sample extension predictions. This observable does618

not have a tight expansion in the eigenfunction basis, so619

the geometric harmonics method of extension will be ill-620

conditioned, and thus not considered. We note that in this621

approach, there are reconstruction errors at time lag t = 0, so622

at very short time scales we cannot outperform persistence.623

We consider a range of truncation levels for the number of624

ensemble members used, which are nearest neighbors to the625

out-of-sample data point, as determined by the kernel func-626

tion. Using all available neighbors will likely overly smooth627

and average out features in forward trajectories, while us-628

ing too few neighbors will place to much weight on particu-629

lar trajectories. Indeed we find good performance using 100630

(out of total possible 4791). The top left panel of Figure 9631

shows a snapshot of the true sea ice extent anomalies, re-632

spectively, together with a reconstructed out-of-sample ex-633

tension using the Laplacian pyramid. To be clear this is not634

a prediction trajectory, but rather each point in the out-of-635

sample extension is calculated using Equation (9); that is,636

each point is a time-lead t = 0 reconstruction.637


time(months)20 40 60 80 100 120

km2

!10 5

-6

-5

-4

-3

-2

-1

0

1

2

3OSE NP Sea Ice Anomaly

TOSE

time (months)0 5 10 15

km2

!10 5

-5

-4

-3

-2

-1

0

1NP Sea Ice Anomaly Prediction


!10 5

0

1

2

3

4RMSE

nN=allnN=100nN=10P


0

0.2

0.4

0.6

0.8

1PC

nN=allnN=100nN=10P

Fig. 9 Top left: True sea ice cover anomalies plotted with the cor-responding Laplacian pyramid out-of-sample extention function. Theother panels show prediction results using Laplacian pyramids for totalNorth Pacific sea ice cover anomalies in CCSM4 data. The number ofnearest neighbors (nN) used to form the ensemble was varied, and wefind the best performance when the ensemble is restricted to the nearest100 neighbors, corresponding to about 2% of the total sample size.

The top right panel shows sample snapshots of predic-638

tion trajectories, restricting the ensemble size to the nearest639

10, 100, and then all nearest neighbors. Notice in particular640

predictions match the truth when the anomalies are close to641

0, but then may subsequently progress in the opposite sign642

as the truth. As our prediction metrics are averaged over ini-643

tial conditions spanning all months, the difficulty the pre-644

dictions have in projecting from a state of near 0 anomaly645

significantly hampers the ability for long-range predictabil-646

ity of this observable.647

In the bottom two panels of Figure 9 we have the aver-648

aged error metrics, and see year to year correlations mani-649

festing as a dip/bump in RMSE and PC in the persistence650

forecast that occurs after 12 months. After the first month651

lag time, the kernel ensemble analog forecasts overcome the652

reconstruction error and beat persistence in both RMSE and653

PC, and give about a 2 month increase in prediction skill (as654

measured by when the PC drops below 0.6) over persistence.655

We see the best performance restricting the ensemble size to656

100 nearest neighbors (about 2% of the total sample size)657

in both the RMSE and PC metrics, though this is marginal658

before the error metrics drop below the 0.6 threshold.659

Pushing the prediction strategy to an even more difficult660

problem, in Figure 10 we try to predict observational sea ice661

extent anomalies using CCSM4 model data as training data.662

In this scenario, without knowledge of the test data clima-663

tology, the observation sea ice extent anomalies are defined664

using the CCSM4 climatology. In the top panel of Figure 10665

we see the strong bias as a result, where the observational666

record has less sea ice than the CCSM model climatology,667

which has been taken from a pre-industrial control run. This668

strongly hampers the ability to accurately predict observa-669

tional sea ice extent anomalies using CCSM4 model ensem-670

ble analogs, and as a result the only predictive skill we see671

is from the annual cycle.672

20 40 60 80 100 120!6

!5

!4

!3

!2

!1

0

1x 10

7 OSE NP Sea Ice Anomaly

time(months)

km2

TOSE

0 5 10 15!4

!3

!2

!1

0

1x 10

7 NP Sea Ice Anomaly Prediction

time (months)

km2

TLP

0 5 10 150

0.5

1

1.5

2

2.5

3

3.5x 10

7 RMSE

time (months)

LPP

0 5 10 15!0.8

!0.6

!0.4

!0.2

0

0.2

0.4

0.6

0.8

1PC

time (months)

LPP

Fig. 10 Prediction results for total observed (HadISST data) North Pa-cific sea ice volume anomalies, using CCSM4 as training data. The icecover function is out-of-sample extended via Laplacian pyramid, using100 nearest neighbors.

5 Discussion673

We have examined a recently proposed prediction strategy674

employing a kernel ensemble analog forecasting scheme mak-675

ing use of out-of-sample extension techniques. These non-676

parametric, data-driven methods make no assumptions on677

the underlying governing dynamics or statistics. We have678

used these methods in conjunction with NLSA to extract679

low-frequency modes of variability from North Pacific SST680

and SIC data sets, both from models and observations. We681

find that for these low-frequency modes, the analog fore-682

casting performs at least as well, and in many cases bet-683

ter than, the simple constant persistence forecast. Predictive684

skill, as measured by PC exceeding 0.6, can be increased by685

up to 3 to 6 months for low-frequency modes of variability686

in the North Pacific. This is a strong advantage over tradi-687

tional parametric regression models, which were shown to688

fail to beat persistence.689

The kernel ensemble analog forecasting methods out-690

lined included two variations on the underlying out-of-sample691

extension scheme, each with its strengths and weaknesses.692

The geometric harmonics method, based on the Nystrom693

method, worked well for observables that are band-limited694


in the eigenfunction basis, in particular the eigenfunctions695

themselves. However for observables not easily expressed in696

such a basis, the Laplacian pyramid provides an alternative697

method based on a multiscale decomposition of the original698

observable.699

While the low-frequency eigenfunctions from NLSA were700

a natural preferred class of observables to target for predic-701

tion, we also studied the case of objective observables un-702

influenced by the data analysis algorithm. Motivated by the703

strong reemergence phenomena, we considered sea ice ex-704

tent anomalies as our target for prediction in the North Pa-705

cific. Using a shorter embedding window due to faster (sea-706

sonal) time scale dynamics, we obtain approximately a two707

month increase in predictive skill over the persistence fore-708

cast. It is also evident that when considering regional sea709

ice extent anomalies winds play a large role in moving ice710

into and out of the domain of interest, and as such additional711

consideration of the atmospheric component in the system712

could be included in the multivariate kernel function, despite713

having weaker low-frequency variability.714

An important consideration is that our prediction metrics715

are averaged over initial conditions ranging over all possible716

initial states of the system. As we saw clearly in the case717

of North Pacific sea ice volume anomalies, these prediction718

strategies can have difficulty with projecting from an initial719

state of quiessence, and can easily predict to the wrong sign720

of an active state, greatly hampering predictive skill. On the721

other hand we would expect predictive skill to be stronger722

for those initial states that begin in a strongly active state,723

or said differently, clearly in one climate regime, as oppose724

to in transition between the two. Future work will further725

explore conditional forecasting, where we either condition726

forecasts on the initial month, or the target month. Also ex-727

tending this analysis to the North Atlantic, another region of728

strong low-frequency variability, is a natural progression of729

this work.730

Acknowledgements The research of Andrew Majda and Dimitrios731

Giannakis is partially supported by ONR MURI grant 25-74200-F7112.732

Darin Comeau is supported as a postdoctoral fellow through this grant.733

The research of Dimitrios Giannakis is also partially supported by734

ONR DRI grant N00014-14-0150.735

References736

Alessandrini S, Delle Monache L, Sperati S, Nissen J (2015)737

A novel application of an analog ensemble for short-term738

wind power forecasting. Renewable Energy 76:768–781739

Alexander MA, Deser C, Timlin MS (1999) The reemer-740

gence of SST anomalies in the North Pacific Ocean. Jour-741

nal of climate 12(8):2419–2433742

Berry T, Cressman JR, Greguri?-Feren?ek Z, Sauer T (2013)743

Time-scale separation from diffusion-mapped delay co-744

ordinates. SIAM Journal on Applied Dynamical Systems745

12(2):618–649746

Blanchard-Wrigglesworth E, Bitz CM (2014) Characteris-747

tics of Arctic sea-ice thickness variability in GCMs. Jour-748

nal of Climate 27(21):8244–8258749

Blanchard-Wrigglesworth E, Armour KC, Bitz CM,750

DeWeaver E (2011a) Persistence and inherent predictabil-751

ity of Arctic sea ice in a GCM ensemble and observations.752

Journal of Climate 24(1):231–250753

Blanchard-Wrigglesworth E, Bitz C, Holland M (2011b)754

Influence of initial conditions and climate forcing on755

predicting Arctic sea ice. Geophysical Research Letters756

38(18)757

Branstator G, Teng H, Meehl GA, Kimoto M, Knight JR,758

Latif M, Rosati A (2012) Systematic estimates of initial-759

value decadal predictability for six AOGCMs. Journal of760

Climate 25(6):1827–1846761

Bushuk M, Giannakis D, Majda AJ (2014) Reemergence762

mechanisms for North Pacific sea ice revealed through763

nonlinear Laplacian spectral analysis*. Journal of Climate764

27(16):6265–6287765

Chen N, Majda AJ (2015a) Predicting the cloud patterns766

for the boreal summer intraseasonal oscillation through a767

low-order stochastic model. Mathematics of Climate and768

Weather Forecasting769

Chen N, Majda AJ (2015b) Predicting the real-time mul-770

tivariate Madden-Julian oscillation index through a low-771

order nonlinear stochastic model. Monthly Weather Re-772

view (2015)773

Chen N, Majda A, Giannakis D (2014) Predicting the cloud774

patterns of the Madden-Julian oscillation through a low-775

order nonlinear stochastic model. Geophysical Research776

Letters 41(15):5612–5619777

Coifman RR, Lafon S (2006a) Diffusion maps. Applied and778

computational harmonic analysis 21(1):5–30779

Coifman RR, Lafon S (2006b) Geometric harmonics: a780

novel tool for multiscale out-of-sample extension of em-781

pirical functions. Applied and Computational Harmonic782

Analysis 21(1):31–52783

Collins M (2002) Climate predictability on interannual to784

decadal time scales: the initial value problem. Climate785

Dynamics 19(8):671–692786

Collins WD, Bitz CM, Blackmon ML, Bonan GB, Brether-787

ton CS, Carton JA, Chang P, Doney SC, Hack JJ, Hen-788

derson TB, et al (2006) The community climate sys-789

tem model version 3 (CCSM3). Journal of Climate790

19(11):2122–2143791

Day J, Tietsche S, Hawkins E (2014) Pan-Arctic and re-792

gional sea ice predictability: Initialization month depen-793

dence. Journal of Climate 27(12):4371–4390794

Di Lorenzo E, Schneider N, Cobb K, Franks P, Chhak795

K, Miller A, McWilliams J, Bograd S, Arango H, Cur-796

chitser E, et al (2008) North Pacific Gyre Oscillation links797


ocean climate and ecosystem change. Geophysical Re-798

search Letters 35(8)799

Drosdowsky W (1994) Analog (nonlinear) forecasts of the800

Southern Oscillation index time series. Weather and fore-801

casting 9(1):78–84802

Frankignoul C, Hasselmann K (1977) Stochastic climate803

models, part ii application to sea-surface temperature804

anomalies and thermocline variability. Tellus 29(4):289–805

305806

Franzke C, Crommelin D, Fischer A, Majda AJ (2008)807

A hidden Markov model perspective on regimes and808

metastability in atmospheric flows. Journal of Climate809

21(8):1740–1757810

Franzke C, Horenko I, Majda AJ, Klein R (2009) Sys-811

tematic metastable atmospheric regime identification812

in an AGCM. Journal of the Atmospheric Sciences813

66(7):1997–2012814

Gent PR, Danabasoglu G, Donner LJ, Holland MM, Hunke815

EC, Jayne SR, Lawrence DM, Neale RB, Rasch PJ,816

Vertenstein M, et al (2011) The community climate sys-817

tem model version 4. Journal of Climate 24(19):4973–818

4991819

Ghil M, Allen M, Dettinger M, Ide K, Kondrashov D, Mann820

M, Robertson AW, Saunders A, Tian Y, Varadi F, et al821

(2002) Advanced spectral methods for climatic time se-822

ries. Reviews of geophysics 40(1):3–1823

Giannakis D, Majda AJ (2012a) Comparing low-frequency824

and intermittent variability in comprehensive climate825

models through nonlinear Laplacian spectral analysis.826

Geophysical Research Letters 39(10)827

Giannakis D, Majda AJ (2012b) Limits of predictability828

in the North Pacific sector of a comprehensive climate829

model. Geophysical Research Letters 39(24)830

Giannakis D, Majda AJ (2012c) Nonlinear Laplacian spec-831

tral analysis for time series with intermittency and832

low-frequency variability. Proceedings of the National833

Academy of Sciences 109(7):2222–2227834

Giannakis D, Majda AJ (2013) Nonlinear Laplacian spectral835

analysis: capturing intermittent and low-frequency spa-836

tiotemporal patterns in high-dimensional data. Statistical837

Analysis and Data Mining 6(3):180–194838

Giannakis D, Majda AJ (2014) Data-driven methods for dy-839

namical systems: Quantifying predictability and extract-840

ing spatiotemporal patterns. Mathematical and Computa-841

tional Modeling: With Applications in Engineering and842

the Natural and Social Sciences p 288843

Hasselmann K (1976) Stochastic climate models part i. The-844

ory. Tellus 28(6):473–485845

Holland MM, Bitz CM, Hunke EC, Lipscomb WH,846

Schramm JL (2006) Influence of the sea ice thickness dis-847

tribution on polar climate in CCSM3. Journal of Climate848

19(11):2398–2414849

Holland MM, Bailey DA, Briegleb BP, Light B, Hunke E850

(2012) Improved sea ice shortwave radiation physics in851

ccsm4: the impact of melt ponds and aerosols on arctic852

sea ice*. Journal of Climate 25(5):1413–1430853

Horenko I (2010a) On clustering of non-stationary mete-854

orological time series. Dynamics of Atmospheres and855

Oceans 49(2):164–187856

Horenko I (2010b) On the identification of nonstation-857

ary factor models and their application to atmospheric858

data analysis. Journal of the Atmospheric Sciences859

67(5):1559–1574860

Horenko I (2011a) Nonstationarity in multifactor models861

of discrete jump processes, memory, and application to862

cloud modeling. Journal of the Atmospheric Sciences863

68(7):1493–1506864

Horenko I (2011b) On analysis of nonstationary categorical865

data time series: Dynamical dimension reduction, model866

selection, and applications to computational sociology.867

Multiscale Modeling & Simulation 9(4):1700–1726868

Hunke E, Lipscomb W (2008) CICE: The Los Alamos sea869

ice model, documentation and software, version 4.0, Los870

Alamos National Laboratory tech. rep. Tech. rep., LA-871

CC-06-012872

Lorenz EN (1969) Atmospheric predictability as revealed873

by naturally occurring analogues. Journal of the Atmo-874

spheric sciences 26(4):636–646875

Majda AJ, Harlim J (2013) Physics constrained nonlinear876

regression models for time series. Nonlinearity 26(1):201877

Majda AJ, Yuan Y (2012) Fundamental limitations of ad878

hoc linear and quadratic multi-level regression models879

for physical systems. Discrete and Continuous Dynami-880

cal Systems B 17(4):1333–1363881

Mantua NJ, Hare SR, Zhang Y, Wallace JM, Francis RC882

(1997) A Pacific interdecadal climate oscillation with im-883

pacts on salmon production. Bulletin of the american Me-884

teorological Society 78(6):1069–1079885

Metzner P, Putzig L, Horenko I (2012) Analysis of persistent886

nonstationary time series and applications. Communica-887

tions in Applied Mathematics and Computational Science888

7(2):175–229889

Newman M, Compo GP, Alexander MA (2003) ENSO-890

forced variability of the Pacific decadal oscillation. Jour-891

nal of Climate 16(23):3853–3857892

Nystrom EJ (1930) Uber die praktische auflosung von inte-893

gralgleichungen mit anwendungen auf randwertaufgaben.894

Acta Mathematica 54(1):185–204895

Overland J, Rodionov S, Minobe S, Bond N (2008) North896

Pacific regime shifts: Definitions, issues and recent tran-897

sitions. Progress in Oceanography 77(2):92–102898

Overland JE, Percival DB, Mofjeld HO (2006) Regime899

shifts and red noise in the North Pacific. Deep Sea Re-900

search Part I: Oceanographic Research Papers 53(4):582–901

588902


Rabin N, Coifman RR (2012) Heterogeneous datasets rep-903

resentation and learning using diffusion maps and Lapla-904

cian pyramids. In: SDM, SIAM, pp 189–199905

Rayner N, Parker DE, Horton E, Folland C, Alexander L,906

Rowell D, Kent E, Kaplan A (2003) Global analyses of907

sea surface temperature, sea ice, and night marine air tem-908

perature since the late nineteenth century. Journal of Geo-909

physical Research: Atmospheres (1984–2012) 108(D14)910

Sauer T, Yorke JA, Casdagli M (1991) Embedology. Journal911

of statistical Physics 65(3-4):579–616912

Smith R, Jones P, Briegleb B, Bryan F, Danabasoglu G, Den-913

nis J, Dukowicz J, Eden C, Fox-Kemper B, Gent P, et al914

(2010) The Parallel Ocean Program (POP) reference man-915

ual: Ocean component of the Community Climate Sys-916

tem Model (CCSM). Los Alamos National Laboratory,917

LAUR-10-01853918

Takens F (1981) Detecting strange attractors in turbulence.919

Springer920

Tietsche S, Day J, Guemas V, Hurlin W, Keeley S, Matei921

D, Msadek R, Collins M, Hawkins E (2014) Seasonal922

to interannual Arctic sea ice predictability in current923

global climate models. Geophysical Research Letters924

41(3):1035–1043925

Xavier PK, Goswami BN (2007) An analog method for real-926

time forecasting of summer monsoon subseasonal vari-927

ability. Monthly Weather Review 135(12):4149–4160928

Zhao Z, Giannakis D (2014) Analog forecasting929

with dynamics-adapted kernels. arXiv preprint930

arXiv:14123831931

Documents

Data-driven prediction strategies for low-frequency ...dimitris/files/ComeauEtAl15_prediction.pdf · 2 Darin Comeau et al. 66 corresponding to the PDO and NPGO when forced with 67