Nonlinear Time Seriesmason.gmu.edu/~jgentle/csi779/14s/L08_Chapter4_14s.pdfNonlinear Time Series Recall that a linear time series {Xt} is one that follows the rela- tion, Xt= µ+ X∞

$Page 1: Nonlinear Time Seriesmason.gmu.edu/~jgentle/csi779/14s/L08_Chapter4_14s.pdfNonlinear Time Series Recall that a linear time series {Xt} is one that follows the rela- tion, Xt= µ+ X∞$
Nonlinear Time Series

Recall that a linear time series {Xt} is one that follows the rela-

tion,

Xt = µ+∞∑

i=0

ψiAt−i,

where {At} is iid with mean 0 and finite variance.

A linear time series is stationary if∑∞i=0ψ

2i <∞.

A time series that cannot be put in this form is nonlinear.

1

Tests for Nonlinearity

What kinds of statistics would be useful in testing for nonlinear-

ity?

Null hypothesis: the data follow a linear time series model.

One approach would be to fit some kind of general linear model

to the data, and then use some statistic computed from the

residuals.

Another approach would be based on comparisons of transforms

with known properties of transforms of data following the hy-

pothesized model.

For tests regarding time series specifically (and maybe a few

other types of data) the transform could be into the frequency

domain.

A different approach would be to specify an alternative hypoth-

esis and test against it specifically.

2

Tests for Autocorrelations of Squared Residuals

Residuals of what?

An ARMA(p, q) model is a pretty good approximation for linear

time series.

Before attempting to fit an ARMA model, we should do some

exploratory analyses to make sure we’re even in the right ballpark.

Are there any obvious departures from stationarity?

Trends? Would differencing help?

When it appears that we may have a stationary process, we go

through the usual motions to fit an ARMA(p, q) model.

Is the model a good fit?

What could go wrong?

There may be an ARCH effect.

3


The “ARCH effect” arises from autocorrelations of squared resid-

uals from an ARMA model.

The simplest test for autocorrelations is based on the asymptotic

normality of ρ̂(h) under the null hypothesis of 0 autocorrelation

at lag h.

(Recall that the test would be a t test, where the denominator

is√

√

√

√

√

1 + 2h−1∑

i=1

ρ̂2(i)

/n.

The denominator is not obvious.)

4


Of course, if ρ̂(h) is asymptotically normal, then ρ̂2(h) prop-

erly normalized is asymptotically chi-squared, and if ρ̂2(i) for

i = 1, . . . ,m are independent then the sum of them, each prop-

erly normalized, is asymptotically chi-squared with m degrees of

freedom.

These facts led to the Q∗(m) portmanteau test of Box and

Pierce, and then led to the modified portmanteau test of Ljung

and Box, using the statistic

Q(m) = n(n+ 2)m∑

i=1

ρ̂2(i)(n− i).

This is asymptotically chi-squared with m degrees of freedom.

5


As we have seen, the Q test applied to squared residuals can be

used to detect an ARCH effect, as suggested by McLeod and Li.

We choose a value of m.

So this is one test for nonlinearity.

A related test is the F test suggested by Engle. This is the usual

F test of

H0 : β1 = · · · = βm = 0

in the linear regression model

a2t = β0 + β1a2t−1 + · · · + βma

2t−m,

where the ai are the residuals from the fitted ARMA model.

6

Stationary Processes in the Frequency Domain

Time series models of the form

Xt = f(Xt−1, Xt−2, . . . , At, At−1, . . . , β),

are said to be represented in the “time domain”.

Representations of a time series as a composition of periodic

behaviors are said to be in the “frequency domain”.

Processes with strong periodic behavior and periodic processes

with a small number of periodicities (audio signals, for example)

are usually modeled better in the frequency domain than in the

time domain.

Financial time series are best analyzed in the time domain.

Stationary processes (and of course, there aren’t many of those

in financial time series!) have an important relationship between

a time-domain measure and a frequency-domain function.

7

Spectral Representation of the ACVF

If we have a stationary process with autocovariance γ(h), then

there exists a unique monotonically increasing function F (ω) on

the closed interval [−1/2,1/2], such that F (−1/2) = 0, and

F (1/2) = γ(0) and

γ(h) =∫ 1/2

−1/2e2πiωhdF (ω)

The function F (ω) is called the spectral distribution function.

The proof of this theorem, “the spectral representation theo-

rem”, is available in many books, but we will not prove it in this

class.

Note that my notation for Fourier transforms may differ slightly

from that in Tsay; the difference is whether or not frequencies

are in radians or in π radians.

8

Spectral Density

The derivative of the spectral distribution function F (ω), which

we write as f(ω) is a measure of the intensity of any periodic

component at the frequency ω.

We call f(ω) the spectral density.

The ACVF is essentially the Fourier transform of the spectral

density.

By the Inversion Theorem for the Fourier transform, we have,

for −1/2 ≤ ω ≤ 1/2,

f(ω) =∞∑

h=−∞γ(h)e−2πiωh,

or

f(ω) =∞∑

h=−∞E(XtXt+h)e

−2πiωh.

9

Bispectral Density

Now, if the third moment E(XtXt+uXt+v) exists and is finite, in

the linear time series

Xt = µ+∞∑

i=0

ψiAt−i,

we have

E(XtXt+uXt+v) = E(

X3t

)

∞∑

i=0

ψiψi+uψi+v.

Now by analogy, we call the double Fourier transform,

the bispectral density:

b(ω1, ω2) =∞∑

u=−∞

∞∑

v=−∞E(XtXt+uXt+v)e

−2πi(ω1u+ω2v).

10

Spectral Densities

Letting Ψ represent the polynomial formed in the usual way from

the ψi in the linear time series model, we have for the spectral

density,

f(ω) = E(

X2t

)

Ψ(

e−2πiω1)

Ψ(

e2πiω2)

;

and for the bispectral density,

b(ω1, ω2) = E(

X3t

)

Ψ(

e−2πiω1)

Ψ(

e−2πiω2)

Ψ(

e2πi(ω1+ω2))

.

Now, note in this case that

|b(ω1, ω2)|2f(ω1)f(ω2)f(ω1 + ω2)

=

(

E(

X3t

))2

(

E(

X2t

))3,

which is constant.

11

Bispectral Test

The constancy of the ratio on the previous slide provides the

basis for a test of nonlinearity.

How would you do that?

Compute it for several subsequences.

There are various nonparametric tests for constancy, and conse-

quently there are various bispectral tests.

Notice also that the numerator in the test statistic is 0, if the

time series is linear and the errors have a normal distribution.

12

BDS Test

The BDS test is names after Brock, Dechert, and Scheinkman,

who proposed it.

The test is for strict stationarity of the error process.

For the data x1, . . . , xn, it is based on the normalized counts of

closeness of subsequences, Xmi and Xm

j , where Xmi = (xi, . . . , xi+m−1).

For fixed δ > 0, the closeness is measured by how many subse-

quences are with δ of each other in the supnorm.

We define

Iδ(Xmi , X

mj ) =

{

1 if ‖Xmi −Xm

j ‖∞ ≤ δ

0 otherwise

13

BDS Test

We compare the counts for subsequences of length 1 and k:

C1(δ, n) =2

n(n− 1)

∑

i<j

Iδ(X1i , X

1j )

and

Ck(δ, n) =2

(n− k+ 1)(n− k)

∑

i<j

Iδ(Xki , X

kj ).

In the iid case,

Ck(δ, n) → (C1(δ, n))k,

and asymptotically√n(Ck(δ, n)−(C1(δ, n))

k) is normal with mean

0 and known variance (see Tsay, page 208).

The null hypothesis that the errors are iid, which is one of the

properties of a linear time series is tested using extreme quantiles

of the normal distribution.

14

BDS Test

Notice that the BDS test depends on two quantities, δ and k.

Obviously, k should be small relative to n.

There is an R function, bds.test, in the tseries package that

performs the computations for the BDS test.

15

RESET Test

The Regression Equation Specification Error Test (RESET) test

of Ramsey is a general test for misspecification of a linear model

(not just a linear time series).

It may detect omitted variables, incorrect functional form, and

heteroscedasticity.

For applications of the RESET test to linear time series, we

assume that the linear model is an AR model.

The test statistic is an F statistic computed from the residuals

of a fitted AR model (see equation (4.44) in Tsay).

Because of the omnibus nature of the alternative hypothesis, the

performance of the test is highly variable, and often has very low

power.

There is an R function, resettest, in the lmtest package that

performs the computations for the RESET test.

16

F Tests

There are several variations on the F statistic used in the RESET

test.

Tsay mentions some of these, and you can probably think of

other modifications of the basic ideas.

17

Threshold Test

There are various types of tests that could be constructed based

on dividing the time series into different regimes.

Simple approaches would be based on regimes that are separated

by fixed time points.

Other approaches could be based on regimes in which either

observed values or fitted residuals appear to be different.

Obviously, if the data are used to identify possible thresholds the

significance level of a test must take that fact into consideration.

In general, the form of the test statistics, however they are con-

structed are F statistics.

18

Time Series Models

I don’t know of any other area of statistics that has so many

different models as in the time domain of time series analysis.

Each model has its own name – and sometimes different name.

The common linear models are of the AR and MA types.

We combine AR and MA to get ARMA.

Then we difference a time series to get an ARMA, and call the

complete model ARIMA.

Next, we from AR and MA relations at multiples of longer lags.

This yields seasonal ARMA and ARIMA models.

Most of the linear models in the time domain are of these types.

Then we have the nonlinear time series models.

19


Four general types that are useful:

•models of squared quantities such as variances; these are of-

ten coupled with other models to allow stochastic volatility;

ARIMA+GARCH, for example

•bilinear models

Xt = c+p

∑

i=1

φiXt−i −q

∑

j=1

θjAt−j +m∑

i=1

s∑

j=1

βijXt−iAt−j +At

•random coefficients models

Xt =p

∑

i=1

(φi + U(i)t )Xt−i +At

•threshold models – Tsay describes a number of these in Chap-

ters 3 and 4.

Another general source of nonlinearity is local fitting of a general

model.

20


In the area of nonlinear time series models is where the small modificationswith their specific names really proliferates.

First, we have the basic ones that account for conditional heteroscedasticity:ARCH and GARCH.

Then the modifications (Chapter 3):IGARCH, GARCH-M, EGARCH, TGARCH (also GJR), CHARMA (or RCA),LMSV

Then further modifications (Chapter 4):TAR (similar to TGARCH, but for linear terms), SETAR (“self-exciting”TAR; the regime depends on a lagged value), STAR, MSA (or MSAR), NAAR.

Other models are local regression models.

Finally, we have algorithmic models, such as neural nets.

I am not going to consider all of these little variations.

The most common method of fitting these models is by maximum likelihood.

There are R functions for many of them.

21

Time Series Models

The names of the wide variety of time series models that evolved from thebasic ARCH and GARCH models can be rather confusing.

Some models go by different names. Tsay sometimes refers to the TGARCH(m, s)model as the GJR model (see p. 149). (“GJR” is not in the index for hisbook.)

Most of the models that Tsay uses are special cases of the APARCH modelof Ding, Grange, and Engle (1993).

This model is

At = σtεt, (1)

as the basic ARCH model, and

σδt = α0 +

m∑

i=1

αi(|At−i| − γiAt−i)δ +s

∑

j=1

βjσδt−j. (2)

This model includes several of the other variations on the GARCH model.

22

Transition or Threshold Models

A two-regime transition model is of the general form

Xt =

g1(xpt−p−1, a

qt−q−1, β1) +At if condition 1

g2(xpt−p−1, a

qt−q−1, β2) +At otherwise

A threshold model usually depends on the past state and so is

of the general form

Xt =

g1(xpt−p−1, a

qt−q−1, β1) +At if x

pt−p−1 ∈ R1

g2(xpt−p−1, a

qt−q−1, β2) +At otherwise

For the specific case of an AR model, the gi functions above are

linear functions of xpt−p−1.

Also, the condition xpt−p−1 ∈ R1 is usually simplified to a simple

form xt−d ∈ R1.

23

Smooth Transition Autoregressive (STAR)

Model

An obvious modification is to make the transition smooth by

using a smooth weighting function.

If the linear functions are AR relationships, we have a simple

instance, namely, the STAR(p) model:

Xt = φ1,0 +p

∑

i=1

φ1,ixt−i+F

(

xt−d −∆

s

)

φ2,0 +p

∑

i=1

φ2,ixt−i

+At,

where F (·) is a smooth function going from 0 to 1.

Tsay gives an R function to fit a STAR model on page 186.

I could not find this code anywhere, but if someone will key it in

and send it to me, I’ll post it.

24

Markov Switching Model

Another simple transition model in which the underlying compo-

nents are AR is the Markov switching model (MSA).

Here, the regime is chosen as a Markov process.

The model for a two-state MSA, as before is

Xt =

φ1,0 +∑pi=1 φ1,ixt−i +A1t if state 1

φ2,0 +∑pi=1 φ2,ixt−i +A2t otherwise

All we need to specify are the transition probabilities

Pr(St|st−1).

Fitting this is a little harder, but again, can be done by max-

imum likelihood. The transition probabilities can be estimated

by MCMC or by the EM algorithm.

25

The following slides are preliminary versions of the material we

will discuss on April 10.

26

Kernel Regression

Local regression is another type of nonlinear modeling.

A simple form of local regression is to use a filter or kernel

function to provide local weighting of the observed data.

This approach ensures that at a given point the observations

close to that point influence the estimate at the point more

strongly than more distant observations.

A standard method in this approach is to convolve the observa-

tions with a unimodal function that decreases rapidly away from

a central point.

This function is the filter or the kernel.

A kernel function has two arguments representing the two points

in the convolution, but we typically use a single argument that

represents the distance between the two points.

27

Choice of Kernels

Standard normal densities have these properties described above,

so the kernel is often chosen to be the standard normal density.

As it turns out, the kernel density estimator is not very sensitive

to the form of the kernel.

Although the kernel may be from a parametric family of distribu-

tions, in kernel density estimation, we do not estimate those pa-

rameters; hence, the kernel method is a nonparametric method.

28

Choice of Kernels

Sometimes, a kernel with finite support is easier to work with.

In the univariate case, a useful general form of a compact kernel

is

K(t) = κrs(1 − |t|r)sI[−1,1](t),

where

κrs =r

2B(1/r, s+ 1), for r > 0, s ≥ 0,

and B(a, b) is the complete beta function.

29

Choice of Kernels

This general form leads to several simple specific cases:

• for r = 1 and s = 0, it is the rectangular or uniform kernel;

• for r = 1 and s = 1, it is the triangular kernel;

• for r = 2 and s = 1 (κrs = 3/4), it is the “Epanechnikov”

kernel, which yields the optimal rate of convergence of the

MISE;

• for r = 2 and s = 2 (κrs = 15/16), it is the “biweight” kernel.

If r = 2 and s → ∞, we have the Gaussian kernel (with some

rescaling).

30

Kernel Methods

In kernel methods, the locality of influence is controlled by a

window around the point of interest.

The choice of the size of the window, or the “bandwidth”, is the

most important issue in the use of kernel methods.

In univariate applications, the window size is just a length, usually

denoted by “h” (except maybe in time series applications).

In practice, for a given choice of the size of the window, the

argument of the kernel function is transformed to reflect the

size.

The transformation is accomplished using a positive definite ma-

trix, V , whose determinant measures the volume (size) of the

window.

31

Local Linear Regression

Use of the kernel function is simple.

When least squares is the basic criterion, the kernel just becomes

the weight.

32

Choice of Bandwidth

There are two ways to choose a bandwidth.

One is based on the mean-integrated squared error (MISE).

In this method, the MISE for an assumed model is determined,and then the bandwidth that minimizes this is determined.

The other method is a data-based method.

We use cross-validation to determine the optimal bandwidth.

In cross-validation, for a given bandwidth, we fit a model using allof data except for a few points (“leave-out-d”), then determinethe SSE using all of the data.

We do this over a grid of bandwidths.

Then we do this multiple times (“k-fold cross-validation”).

The best bandwidth is the one that minimizes the SSE (from alldata).

33

Nonparametric Smoothing

Kernel methods may be parametric or nonparametric.

In nonparametric methods, the kernels are generally simple.

There are various methods, such as running medians or running

(weighted) means.

Running means are moving averages.

The R function lowess does locally weighted smoothing using

weighted running means.

These methods are widely used for smoothing time series.

The emphasis is on prediction, rather than model building.

34

General Additive Time Series Models

A model of the form

yi = β0 + β1x1i + · · · + βmxmi + εi

can be generalized by replacing the constant (but unknown) co-

efficients by unknown functions (with specified forms):

yi = f1(x)x1i + · · ·+ fm(x)xmi + εi

Hastie and Tibshirani have written extensively on such models.

35

Neural Networks

When the emphasis is on prediction, we can form a “black box”

algorithm that accepts a set of input values x, combines them

into intermediate values (in a “hidden layer”) and then combines

the values of the hidden layer into a single output y.

In a time series application, we have data r1, . . . , rn, and for

i = k, . . . , n, we choose a subsequence xi = (ri, . . . , ri−k as an

input to produce an output oi as a predictor of ri.

We train the neural net so as to minimize∑

(oi − ri)2

The R function nnet in nnet can be used to do this.

See Appendix B in Chapter 4 of Tsay.

Watch out for the assignment statements!

Never write R or S-Plus code like that!

36

Monte Carlo Forecasting

Monte Carlo can be used for forecasting in any time series model

(“parametric bootstrap”).

At forecast origin t we forecast at the horizon t + h by use of

the fitted (or assumed) model and simulated errors (or “innova-

tions”).

Doing this many times, we get a sample of r̂(j)t+h.

The mean of this sample is the estimator, r̂t+h, and the sample

quantiles provide confidence limits.

37

Fitting Time Series Models in R

There are a number of R functions that perform the computa-

tions to fit various time series models.

ARMA / ARIMA arima(stats)

ARMA order determination autofit(itsmr)

ARMA + GARCH garchFit(fGarch)

APARCH garchFit(fGarch)

The APARCH model includes the TGARCH and GJR models,

among others; see equation (2).

Also see the help page for fGarch-package in fGarch.

38

Other R Functions for Time Series Models

There are R functions for forecasting using different time series

models that have been fitted.

ARMA/ARIMA predict.Arima(stats)

APARCH (including ARMA + GARCH) predict(fGarch)

There are also R functions for simulating data from different

time series models.

ARMA/ARIMA arima.sim(stats)

APARCH (including ARMA + GARCH) garchSim(fGarch)

39

Documents

Nonlinear Time Seriesmason.gmu.edu/~jgentle/csi779/14s/L08_Chapter4_14s.pdfNonlinear Time Series Recall that a linear time series {Xt} is one that follows the rela- tion, Xt= µ+ X∞