Upload
others
View
9
Download
0
Embed Size (px)
Citation preview
Nonlinear Time Series
Recall that a linear time series {Xt} is one that follows the rela-
tion,
Xt = µ+∞∑
i=0
ψiAt−i,
where {At} is iid with mean 0 and finite variance.
A linear time series is stationary if∑∞i=0ψ
2i <∞.
A time series that cannot be put in this form is nonlinear.
1
Tests for Nonlinearity
What kinds of statistics would be useful in testing for nonlinear-
ity?
Null hypothesis: the data follow a linear time series model.
One approach would be to fit some kind of general linear model
to the data, and then use some statistic computed from the
residuals.
Another approach would be based on comparisons of transforms
with known properties of transforms of data following the hy-
pothesized model.
For tests regarding time series specifically (and maybe a few
other types of data) the transform could be into the frequency
domain.
A different approach would be to specify an alternative hypoth-
esis and test against it specifically.
2
Tests for Autocorrelations of Squared Residuals
Residuals of what?
An ARMA(p, q) model is a pretty good approximation for linear
time series.
Before attempting to fit an ARMA model, we should do some
exploratory analyses to make sure we’re even in the right ballpark.
Are there any obvious departures from stationarity?
Trends? Would differencing help?
When it appears that we may have a stationary process, we go
through the usual motions to fit an ARMA(p, q) model.
Is the model a good fit?
What could go wrong?
There may be an ARCH effect.
3
Tests for Autocorrelations of Squared Residuals
The “ARCH effect” arises from autocorrelations of squared resid-
uals from an ARMA model.
The simplest test for autocorrelations is based on the asymptotic
normality of ρ̂(h) under the null hypothesis of 0 autocorrelation
at lag h.
(Recall that the test would be a t test, where the denominator
is√
√
√
√
√
1 + 2h−1∑
i=1
ρ̂2(i)
/n.
The denominator is not obvious.)
4
Tests for Autocorrelations of Squared Residuals
Of course, if ρ̂(h) is asymptotically normal, then ρ̂2(h) prop-
erly normalized is asymptotically chi-squared, and if ρ̂2(i) for
i = 1, . . . ,m are independent then the sum of them, each prop-
erly normalized, is asymptotically chi-squared with m degrees of
freedom.
These facts led to the Q∗(m) portmanteau test of Box and
Pierce, and then led to the modified portmanteau test of Ljung
and Box, using the statistic
Q(m) = n(n+ 2)m∑
i=1
ρ̂2(i)(n− i).
This is asymptotically chi-squared with m degrees of freedom.
5
Tests for Autocorrelations of Squared Residuals
As we have seen, the Q test applied to squared residuals can be
used to detect an ARCH effect, as suggested by McLeod and Li.
We choose a value of m.
So this is one test for nonlinearity.
A related test is the F test suggested by Engle. This is the usual
F test of
H0 : β1 = · · · = βm = 0
in the linear regression model
a2t = β0 + β1a2t−1 + · · · + βma
2t−m,
where the ai are the residuals from the fitted ARMA model.
6
Stationary Processes in the Frequency Domain
Time series models of the form
Xt = f(Xt−1, Xt−2, . . . , At, At−1, . . . , β),
are said to be represented in the “time domain”.
Representations of a time series as a composition of periodic
behaviors are said to be in the “frequency domain”.
Processes with strong periodic behavior and periodic processes
with a small number of periodicities (audio signals, for example)
are usually modeled better in the frequency domain than in the
time domain.
Financial time series are best analyzed in the time domain.
Stationary processes (and of course, there aren’t many of those
in financial time series!) have an important relationship between
a time-domain measure and a frequency-domain function.
7
Spectral Representation of the ACVF
If we have a stationary process with autocovariance γ(h), then
there exists a unique monotonically increasing function F (ω) on
the closed interval [−1/2,1/2], such that F (−1/2) = 0, and
F (1/2) = γ(0) and
γ(h) =∫ 1/2
−1/2e2πiωhdF (ω)
The function F (ω) is called the spectral distribution function.
The proof of this theorem, “the spectral representation theo-
rem”, is available in many books, but we will not prove it in this
class.
Note that my notation for Fourier transforms may differ slightly
from that in Tsay; the difference is whether or not frequencies
are in radians or in π radians.
8
Spectral Density
The derivative of the spectral distribution function F (ω), which
we write as f(ω) is a measure of the intensity of any periodic
component at the frequency ω.
We call f(ω) the spectral density.
The ACVF is essentially the Fourier transform of the spectral
density.
By the Inversion Theorem for the Fourier transform, we have,
for −1/2 ≤ ω ≤ 1/2,
f(ω) =∞∑
h=−∞γ(h)e−2πiωh,
or
f(ω) =∞∑
h=−∞E(XtXt+h)e
−2πiωh.
9
Bispectral Density
Now, if the third moment E(XtXt+uXt+v) exists and is finite, in
the linear time series
Xt = µ+∞∑
i=0
ψiAt−i,
we have
E(XtXt+uXt+v) = E(
X3t
)
∞∑
i=0
ψiψi+uψi+v.
Now by analogy, we call the double Fourier transform,
the bispectral density:
b(ω1, ω2) =∞∑
u=−∞
∞∑
v=−∞E(XtXt+uXt+v)e
−2πi(ω1u+ω2v).
10
Spectral Densities
Letting Ψ represent the polynomial formed in the usual way from
the ψi in the linear time series model, we have for the spectral
density,
f(ω) = E(
X2t
)
Ψ(
e−2πiω1)
Ψ(
e2πiω2)
;
and for the bispectral density,
b(ω1, ω2) = E(
X3t
)
Ψ(
e−2πiω1)
Ψ(
e−2πiω2)
Ψ(
e2πi(ω1+ω2))
.
Now, note in this case that
|b(ω1, ω2)|2f(ω1)f(ω2)f(ω1 + ω2)
=
(
E(
X3t
))2
(
E(
X2t
))3,
which is constant.
11
Bispectral Test
The constancy of the ratio on the previous slide provides the
basis for a test of nonlinearity.
How would you do that?
Compute it for several subsequences.
There are various nonparametric tests for constancy, and conse-
quently there are various bispectral tests.
Notice also that the numerator in the test statistic is 0, if the
time series is linear and the errors have a normal distribution.
12
BDS Test
The BDS test is names after Brock, Dechert, and Scheinkman,
who proposed it.
The test is for strict stationarity of the error process.
For the data x1, . . . , xn, it is based on the normalized counts of
closeness of subsequences, Xmi and Xm
j , where Xmi = (xi, . . . , xi+m−1).
For fixed δ > 0, the closeness is measured by how many subse-
quences are with δ of each other in the supnorm.
We define
Iδ(Xmi , X
mj ) =
{
1 if ‖Xmi −Xm
j ‖∞ ≤ δ
0 otherwise
13
BDS Test
We compare the counts for subsequences of length 1 and k:
C1(δ, n) =2
n(n− 1)
∑
i<j
Iδ(X1i , X
1j )
and
Ck(δ, n) =2
(n− k+ 1)(n− k)
∑
i<j
Iδ(Xki , X
kj ).
In the iid case,
Ck(δ, n) → (C1(δ, n))k,
and asymptotically√n(Ck(δ, n)−(C1(δ, n))
k) is normal with mean
0 and known variance (see Tsay, page 208).
The null hypothesis that the errors are iid, which is one of the
properties of a linear time series is tested using extreme quantiles
of the normal distribution.
14
BDS Test
Notice that the BDS test depends on two quantities, δ and k.
Obviously, k should be small relative to n.
There is an R function, bds.test, in the tseries package that
performs the computations for the BDS test.
15
RESET Test
The Regression Equation Specification Error Test (RESET) test
of Ramsey is a general test for misspecification of a linear model
(not just a linear time series).
It may detect omitted variables, incorrect functional form, and
heteroscedasticity.
For applications of the RESET test to linear time series, we
assume that the linear model is an AR model.
The test statistic is an F statistic computed from the residuals
of a fitted AR model (see equation (4.44) in Tsay).
Because of the omnibus nature of the alternative hypothesis, the
performance of the test is highly variable, and often has very low
power.
There is an R function, resettest, in the lmtest package that
performs the computations for the RESET test.
16
F Tests
There are several variations on the F statistic used in the RESET
test.
Tsay mentions some of these, and you can probably think of
other modifications of the basic ideas.
17
Threshold Test
There are various types of tests that could be constructed based
on dividing the time series into different regimes.
Simple approaches would be based on regimes that are separated
by fixed time points.
Other approaches could be based on regimes in which either
observed values or fitted residuals appear to be different.
Obviously, if the data are used to identify possible thresholds the
significance level of a test must take that fact into consideration.
In general, the form of the test statistics, however they are con-
structed are F statistics.
18
Time Series Models
I don’t know of any other area of statistics that has so many
different models as in the time domain of time series analysis.
Each model has its own name – and sometimes different name.
The common linear models are of the AR and MA types.
We combine AR and MA to get ARMA.
Then we difference a time series to get an ARMA, and call the
complete model ARIMA.
Next, we from AR and MA relations at multiples of longer lags.
This yields seasonal ARMA and ARIMA models.
Most of the linear models in the time domain are of these types.
Then we have the nonlinear time series models.
19
Nonlinear Time Series
Four general types that are useful:
•models of squared quantities such as variances; these are of-
ten coupled with other models to allow stochastic volatility;
ARIMA+GARCH, for example
•bilinear models
Xt = c+p
∑
i=1
φiXt−i −q
∑
j=1
θjAt−j +m∑
i=1
s∑
j=1
βijXt−iAt−j +At
•random coefficients models
Xt =p
∑
i=1
(φi + U(i)t )Xt−i +At
•threshold models – Tsay describes a number of these in Chap-
ters 3 and 4.
Another general source of nonlinearity is local fitting of a general
model.
20
Nonlinear Time Series
In the area of nonlinear time series models is where the small modificationswith their specific names really proliferates.
First, we have the basic ones that account for conditional heteroscedasticity:ARCH and GARCH.
Then the modifications (Chapter 3):IGARCH, GARCH-M, EGARCH, TGARCH (also GJR), CHARMA (or RCA),LMSV
Then further modifications (Chapter 4):TAR (similar to TGARCH, but for linear terms), SETAR (“self-exciting”TAR; the regime depends on a lagged value), STAR, MSA (or MSAR), NAAR.
Other models are local regression models.
Finally, we have algorithmic models, such as neural nets.
I am not going to consider all of these little variations.
The most common method of fitting these models is by maximum likelihood.
There are R functions for many of them.
21
Time Series Models
The names of the wide variety of time series models that evolved from thebasic ARCH and GARCH models can be rather confusing.
Some models go by different names. Tsay sometimes refers to the TGARCH(m, s)model as the GJR model (see p. 149). (“GJR” is not in the index for hisbook.)
Most of the models that Tsay uses are special cases of the APARCH modelof Ding, Grange, and Engle (1993).
This model is
At = σtεt, (1)
as the basic ARCH model, and
σδt = α0 +
m∑
i=1
αi(|At−i| − γiAt−i)δ +s
∑
j=1
βjσδt−j. (2)
This model includes several of the other variations on the GARCH model.
22
Transition or Threshold Models
A two-regime transition model is of the general form
Xt =
g1(xpt−p−1, a
qt−q−1, β1) +At if condition 1
g2(xpt−p−1, a
qt−q−1, β2) +At otherwise
A threshold model usually depends on the past state and so is
of the general form
Xt =
g1(xpt−p−1, a
qt−q−1, β1) +At if x
pt−p−1 ∈ R1
g2(xpt−p−1, a
qt−q−1, β2) +At otherwise
For the specific case of an AR model, the gi functions above are
linear functions of xpt−p−1.
Also, the condition xpt−p−1 ∈ R1 is usually simplified to a simple
form xt−d ∈ R1.
23
Smooth Transition Autoregressive (STAR)
Model
An obvious modification is to make the transition smooth by
using a smooth weighting function.
If the linear functions are AR relationships, we have a simple
instance, namely, the STAR(p) model:
Xt = φ1,0 +p
∑
i=1
φ1,ixt−i+F
(
xt−d −∆
s
)
φ2,0 +p
∑
i=1
φ2,ixt−i
+At,
where F (·) is a smooth function going from 0 to 1.
Tsay gives an R function to fit a STAR model on page 186.
I could not find this code anywhere, but if someone will key it in
and send it to me, I’ll post it.
24
Markov Switching Model
Another simple transition model in which the underlying compo-
nents are AR is the Markov switching model (MSA).
Here, the regime is chosen as a Markov process.
The model for a two-state MSA, as before is
Xt =
φ1,0 +∑pi=1 φ1,ixt−i +A1t if state 1
φ2,0 +∑pi=1 φ2,ixt−i +A2t otherwise
All we need to specify are the transition probabilities
Pr(St|st−1).
Fitting this is a little harder, but again, can be done by max-
imum likelihood. The transition probabilities can be estimated
by MCMC or by the EM algorithm.
25
The following slides are preliminary versions of the material we
will discuss on April 10.
26
Kernel Regression
Local regression is another type of nonlinear modeling.
A simple form of local regression is to use a filter or kernel
function to provide local weighting of the observed data.
This approach ensures that at a given point the observations
close to that point influence the estimate at the point more
strongly than more distant observations.
A standard method in this approach is to convolve the observa-
tions with a unimodal function that decreases rapidly away from
a central point.
This function is the filter or the kernel.
A kernel function has two arguments representing the two points
in the convolution, but we typically use a single argument that
represents the distance between the two points.
27
Choice of Kernels
Standard normal densities have these properties described above,
so the kernel is often chosen to be the standard normal density.
As it turns out, the kernel density estimator is not very sensitive
to the form of the kernel.
Although the kernel may be from a parametric family of distribu-
tions, in kernel density estimation, we do not estimate those pa-
rameters; hence, the kernel method is a nonparametric method.
28
Choice of Kernels
Sometimes, a kernel with finite support is easier to work with.
In the univariate case, a useful general form of a compact kernel
is
K(t) = κrs(1 − |t|r)sI[−1,1](t),
where
κrs =r
2B(1/r, s+ 1), for r > 0, s ≥ 0,
and B(a, b) is the complete beta function.
29
Choice of Kernels
This general form leads to several simple specific cases:
• for r = 1 and s = 0, it is the rectangular or uniform kernel;
• for r = 1 and s = 1, it is the triangular kernel;
• for r = 2 and s = 1 (κrs = 3/4), it is the “Epanechnikov”
kernel, which yields the optimal rate of convergence of the
MISE;
• for r = 2 and s = 2 (κrs = 15/16), it is the “biweight” kernel.
If r = 2 and s → ∞, we have the Gaussian kernel (with some
rescaling).
30
Kernel Methods
In kernel methods, the locality of influence is controlled by a
window around the point of interest.
The choice of the size of the window, or the “bandwidth”, is the
most important issue in the use of kernel methods.
In univariate applications, the window size is just a length, usually
denoted by “h” (except maybe in time series applications).
In practice, for a given choice of the size of the window, the
argument of the kernel function is transformed to reflect the
size.
The transformation is accomplished using a positive definite ma-
trix, V , whose determinant measures the volume (size) of the
window.
31
Local Linear Regression
Use of the kernel function is simple.
When least squares is the basic criterion, the kernel just becomes
the weight.
32
Choice of Bandwidth
There are two ways to choose a bandwidth.
One is based on the mean-integrated squared error (MISE).
In this method, the MISE for an assumed model is determined,and then the bandwidth that minimizes this is determined.
The other method is a data-based method.
We use cross-validation to determine the optimal bandwidth.
In cross-validation, for a given bandwidth, we fit a model using allof data except for a few points (“leave-out-d”), then determinethe SSE using all of the data.
We do this over a grid of bandwidths.
Then we do this multiple times (“k-fold cross-validation”).
The best bandwidth is the one that minimizes the SSE (from alldata).
33
Nonparametric Smoothing
Kernel methods may be parametric or nonparametric.
In nonparametric methods, the kernels are generally simple.
There are various methods, such as running medians or running
(weighted) means.
Running means are moving averages.
The R function lowess does locally weighted smoothing using
weighted running means.
These methods are widely used for smoothing time series.
The emphasis is on prediction, rather than model building.
34
General Additive Time Series Models
A model of the form
yi = β0 + β1x1i + · · · + βmxmi + εi
can be generalized by replacing the constant (but unknown) co-
efficients by unknown functions (with specified forms):
yi = f1(x)x1i + · · ·+ fm(x)xmi + εi
Hastie and Tibshirani have written extensively on such models.
35
Neural Networks
When the emphasis is on prediction, we can form a “black box”
algorithm that accepts a set of input values x, combines them
into intermediate values (in a “hidden layer”) and then combines
the values of the hidden layer into a single output y.
In a time series application, we have data r1, . . . , rn, and for
i = k, . . . , n, we choose a subsequence xi = (ri, . . . , ri−k as an
input to produce an output oi as a predictor of ri.
We train the neural net so as to minimize∑
(oi − ri)2
The R function nnet in nnet can be used to do this.
See Appendix B in Chapter 4 of Tsay.
Watch out for the assignment statements!
Never write R or S-Plus code like that!
36
Monte Carlo Forecasting
Monte Carlo can be used for forecasting in any time series model
(“parametric bootstrap”).
At forecast origin t we forecast at the horizon t + h by use of
the fitted (or assumed) model and simulated errors (or “innova-
tions”).
Doing this many times, we get a sample of r̂(j)t+h.
The mean of this sample is the estimator, r̂t+h, and the sample
quantiles provide confidence limits.
37
Fitting Time Series Models in R
There are a number of R functions that perform the computa-
tions to fit various time series models.
ARMA / ARIMA arima(stats)
ARMA order determination autofit(itsmr)
ARMA + GARCH garchFit(fGarch)
APARCH garchFit(fGarch)
The APARCH model includes the TGARCH and GJR models,
among others; see equation (2).
Also see the help page for fGarch-package in fGarch.
38
Other R Functions for Time Series Models
There are R functions for forecasting using different time series
models that have been fitted.
ARMA/ARIMA predict.Arima(stats)
APARCH (including ARMA + GARCH) predict(fGarch)
There are also R functions for simulating data from different
time series models.
ARMA/ARIMA arima.sim(stats)
APARCH (including ARMA + GARCH) garchSim(fGarch)
39