609

Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

  • Upload
    others

  • View
    24

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory
Page 2: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

For other titles published in this series, go to

G. Casella S. Fienberg I. Olkin

www.springer.com/series/417

Series Editors

Springer Texts in Statistics

Page 3: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory
Page 4: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

Robert H. Shumway • David S. Stoffer

With R Examples

Its Applications

Third edition

Time Series Analysis and

Page 5: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

subject to proprietary rights. Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com)

ISSN 1431-875X

Springer New York Dordrecht Heidelberg London

© Springer Science+Business Media, LLC 2011All rights reserved. This work may not be translated or copied in whole or in part without the writtenpermission of the publisher (Springer Science+Business Media, LLC, 233 Spring Street, New York,NY 10013, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed is forbidden. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are

ISBN 978-1-4419-7864-6 DOI 10.1007/978-1-4419-7865-3

University of CaliforniaDavis, California

Department of Statistics

USA

Department of Statistics University of Pittsburgh Pittsburgh, Pennsylvania

Prof. David S. Stoffer

e-ISBN 978-1-4419-7865-3

USA

Prof. Robert H. Shumway

Page 6: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

To my wife, Ruth, for her support and joie de vivre, and to thememory of my thesis adviser, Solomon Kullback.

R.H.S.

To my family and friends, who constantly remind me what isimportant.

D.S.S.

Page 7: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory
Page 8: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

Preface to the Third Edition

The goals of this book are to develop an appreciation for the richness andversatility of modern time series analysis as a tool for analyzing data, and stillmaintain a commitment to theoretical integrity, as exemplified by the seminalworks of Brillinger (1975) and Hannan (1970) and the texts by Brockwell andDavis (1991) and Fuller (1995). The advent of inexpensive powerful computinghas provided both real data and new software that can take one considerablybeyond the fitting of simple time domain models, such as have been elegantlydescribed in the landmark work of Box and Jenkins (1970). This book isdesigned to be useful as a text for courses in time series on several differentlevels and as a reference work for practitioners facing the analysis of time-correlated data in the physical, biological, and social sciences.

We have used earlier versions of the text at both the undergraduate andgraduate levels over the past decade. Our experience is that an undergraduatecourse can be accessible to students with a background in regression analysisand may include §1.1–§1.6, §2.1–§2.3, the results and numerical parts of §3.1–§3.9, and briefly the results and numerical parts of §4.1–§4.6. At the advancedundergraduate or master’s level, where the students have some mathematicalstatistics background, more detailed coverage of the same sections, with theinclusion of §2.4 and extra topics from Chapter 5 or Chapter 6 can be used asa one-semester course. Often, the extra topics are chosen by the students ac-cording to their interests. Finally, a two-semester upper-level graduate coursefor mathematics, statistics, and engineering graduate students can be craftedby adding selected theoretical appendices. For the upper-level graduate course,we should mention that we are striving for a broader but less rigorous levelof coverage than that which is attained by Brockwell and Davis (1991), theclassic entry at this level.

The major difference between this third edition of the text and the secondedition is that we provide R code for almost all of the numerical examples. Inaddition, we provide an R supplement for the text that contains the data andscripts in a compressed file called tsa3.rda; the supplement is available on thewebsite for the third edition, http://www.stat.pitt.edu/stoffer/tsa3/,

Page 9: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

viii Preface to the Third Edition

or one of its mirrors. On the website, we also provide the code used in eachexample so that the reader may simply copy-and-paste code directly into R.Specific details are given in Appendix R and on the website for the text.Appendix R is new to this edition, and it includes a small R tutorial as wellas providing a reference for the data sets and scripts included in tsa3.rda. Sothere is no misunderstanding, we emphasize the fact that this text is abouttime series analysis, not about R. R code is provided simply to enhance theexposition by making the numerical examples reproducible.

We have tried, where possible, to keep the problem sets in order so that aninstructor may have an easy time moving from the second edition to the thirdedition. However, some of the old problems have been revised and there aresome new problems. Also, some of the data sets have been updated. We addedone section in Chapter 5 on unit roots and enhanced some of the presenta-tions throughout the text. The exposition on state-space modeling, ARMAXmodels, and (multivariate) regression with autocorrelated errors in Chapter 6have been expanded. In this edition, we use standard R functions as much aspossible, but we use our own scripts (included in tsa3.rda) when we feel itis necessary to avoid problems with a particular R function; these problemsare discussed in detail on the website for the text under R Issues.

We thank John Kimmel, Executive Editor, Springer Statistics, for his guid-ance in the preparation and production of this edition of the text. We aregrateful to Don Percival, University of Washington, for numerous suggestionsthat led to substantial improvement to the presentation in the second edition,and consequently in this edition. We thank Doug Wiens, University of Alberta,for help with some of the R code in Chapters 4 and 7, and for his many sug-gestions for improvement of the exposition. We are grateful for the continuedhelp and advice of Pierre Duchesne, University of Montreal, and AlexanderAue, University of California, Davis. We also thank the many students andother readers who took the time to mention typographical errors and othercorrections to the first and second editions. Finally, work on the this editionwas supported by the National Science Foundation while one of us (D.S.S.)was working at the Foundation under the Intergovernmental Personnel Act.

Davis, CA Robert H. ShumwayPittsburgh, PA David S. StofferSeptember 2010

Page 10: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

Contents

Preface to the Third Edition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

1 Characteristics of Time Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 The Nature of Time Series Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Time Series Statistical Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111.4 Measures of Dependence: Autocorrelation and

Cross-Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171.5 Stationary Time Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221.6 Estimation of Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281.7 Vector-Valued and Multidimensional Series . . . . . . . . . . . . . . . . . 33Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

2 Time Series Regression and Exploratory Data Analysis . . . . 472.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 472.2 Classical Regression in the Time Series Context . . . . . . . . . . . . . 482.3 Exploratory Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 572.4 Smoothing in the Time Series Context . . . . . . . . . . . . . . . . . . . . . 70Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

3 ARIMA Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 833.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 833.2 Autoregressive Moving Average Models . . . . . . . . . . . . . . . . . . . . 843.3 Difference Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 973.4 Autocorrelation and Partial Autocorrelation . . . . . . . . . . . . . . . . 1023.5 Forecasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1083.6 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1213.7 Integrated Models for Nonstationary Data . . . . . . . . . . . . . . . . . 1413.8 Building ARIMA Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1443.9 Multiplicative Seasonal ARIMA Models . . . . . . . . . . . . . . . . . . . . 154Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162

Page 11: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

x Contents

4 Spectral Analysis and Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1734.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1734.2 Cyclical Behavior and Periodicity . . . . . . . . . . . . . . . . . . . . . . . . . . 1754.3 The Spectral Density . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1804.4 Periodogram and Discrete Fourier Transform . . . . . . . . . . . . . . . 1874.5 Nonparametric Spectral Estimation . . . . . . . . . . . . . . . . . . . . . . . . 1964.6 Parametric Spectral Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 2124.7 Multiple Series and Cross-Spectra . . . . . . . . . . . . . . . . . . . . . . . . . 2164.8 Linear Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2214.9 Dynamic Fourier Analysis and Wavelets . . . . . . . . . . . . . . . . . . . . 2284.10 Lagged Regression Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2424.11 Signal Extraction and Optimum Filtering . . . . . . . . . . . . . . . . . . . 2474.12 Spectral Analysis of Multidimensional Series . . . . . . . . . . . . . . . . 252Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255

5 Additional Time Domain Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . 2675.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2675.2 Long Memory ARMA and Fractional Differencing . . . . . . . . . . . 2675.3 Unit Root Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2775.4 GARCH Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2805.5 Threshold Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2895.6 Regression with Autocorrelated Errors . . . . . . . . . . . . . . . . . . . . . 2935.7 Lagged Regression: Transfer Function Modeling . . . . . . . . . . . . . 2965.8 Multivariate ARMAX Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315

6 State-Space Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3196.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3196.2 Filtering, Smoothing, and Forecasting . . . . . . . . . . . . . . . . . . . . . 3256.3 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . . . . . . . . 3356.4 Missing Data Modifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3446.5 Structural Models: Signal Extraction and Forecasting . . . . . . . . 3506.6 State-Space Models with Correlated Errors . . . . . . . . . . . . . . . . . 354

6.6.1 ARMAX Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3556.6.2 Multivariate Regression with Autocorrelated Errors . . . . 356

6.7 Bootstrapping State-Space Models . . . . . . . . . . . . . . . . . . . . . . . . 3596.8 Dynamic Linear Models with Switching . . . . . . . . . . . . . . . . . . . . 3656.9 Stochastic Volatility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3786.10 Nonlinear and Non-normal State-Space Models Using Monte

Carlo Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 398

Page 12: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

Contents xi

7 Statistical Methods in the Frequency Domain . . . . . . . . . . . . . 4057.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4057.2 Spectral Matrices and Likelihood Functions . . . . . . . . . . . . . . . . . 4097.3 Regression for Jointly Stationary Series . . . . . . . . . . . . . . . . . . . . 4107.4 Regression with Deterministic Inputs . . . . . . . . . . . . . . . . . . . . . . 4207.5 Random Coefficient Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 4297.6 Analysis of Designed Experiments . . . . . . . . . . . . . . . . . . . . . . . . . 4347.7 Discrimination and Cluster Analysis . . . . . . . . . . . . . . . . . . . . . . . 4507.8 Principal Components and Factor Analysis . . . . . . . . . . . . . . . . . 4687.9 The Spectral Envelope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 501

Appendix A: Large Sample Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 507A.1 Convergence Modes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 507A.2 Central Limit Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 515A.3 The Mean and Autocorrelation Functions . . . . . . . . . . . . . . . . . . . 518

Appendix B: Time Domain Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 527B.1 Hilbert Spaces and the Projection Theorem . . . . . . . . . . . . . . . . . 527B.2 Causal Conditions for ARMA Models . . . . . . . . . . . . . . . . . . . . . . 531B.3 Large Sample Distribution of the AR(p) Conditional Least

Squares Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533B.4 The Wold Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 537

Appendix C: Spectral Domain Theory . . . . . . . . . . . . . . . . . . . . . . . . . 539C.1 Spectral Representation Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . 539C.2 Large Sample Distribution of the DFT and Smoothed

Periodogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 543C.3 The Complex Multivariate Normal Distribution . . . . . . . . . . . . . 554

Appendix R: R Supplement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 559R.1 First Things First . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 559

R.1.1 Included Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 560R.1.2 Included Scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 562

R.2 Getting Started . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 567R.3 Time Series Primer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 571

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 577

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 591

Page 13: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory
Page 14: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

1

Characteristics of Time Series

1.1 Introduction

The analysis of experimental data that have been observed at different pointsin time leads to new and unique problems in statistical modeling and infer-ence. The obvious correlation introduced by the sampling of adjacent pointsin time can severely restrict the applicability of the many conventional statis-tical methods traditionally dependent on the assumption that these adjacentobservations are independent and identically distributed. The systematic ap-proach by which one goes about answering the mathematical and statisticalquestions posed by these time correlations is commonly referred to as timeseries analysis.

The impact of time series analysis on scientific applications can be par-tially documented by producing an abbreviated listing of the diverse fieldsin which important time series problems may arise. For example, many fa-miliar time series occur in the field of economics, where we are continuallyexposed to daily stock market quotations or monthly unemployment figures.

ments. An epidemiologist might be interested in the number of influenza casesobserved over some time period. In medicine, blood pressure measurementstraced over time could be useful for evaluating drugs used in treating hy-pertension. Functional magnetic resonance imaging of brain-wave time seriespatterns might be used to study how the brain reacts to certain stimuli undervarious experimental conditions.

Many of the most intensive and sophisticated applications of time seriesmethods have been to problems in the physical and environmental sciences.This fact accounts for the basic engineering flavor permeating the language oftime series analysis. One of the earliest recorded series is the monthly sunspotnumbers studied by Schuster (1906). More modern investigations may cen-ter on whether a warming is present in global temperature measurements

1

Social scientists follow population series, such as birthrates or school enroll-

Springer Texts in Statistics, DOI 10.1007/978-1-4419-7865-3_1, © Springer Science+Business Media, LLC 2011

R.H. Shumway and D.S. Stoffer, Time Series Analysis and Its Applications: With R Examples,

Page 15: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

2 1 Characteristics of Time Series

or whether levels of pollution may influence daily mortality in Los Angeles.The modeling of speech series is an important problem related to the efficienttransmission of voice recordings. Common features in a time series character-istic known as the power spectrum are used to help computers recognize andtranslate speech. Geophysical time series such as those produced by yearly de-positions of various kinds can provide long-range proxies for temperature andrainfall. Seismic recordings can aid in mapping fault lines or in distinguishingbetween earthquakes and nuclear explosions.

The above series are only examples of experimental databases that canbe used to illustrate the process by which classical statistical methodologycan be applied in the correlated time series framework. In our view, the firststep in any time series investigation always involves careful scrutiny of therecorded data plotted over time. This scrutiny often suggests the method ofanalysis as well as statistics that will be of use in summarizing the informationin the data. Before looking more closely at the particular statistical methods,it is appropriate to mention that two separate, but not necessarily mutuallyexclusive, approaches to time series analysis exist, commonly identified as thetime domain approach and the frequency domain approach.

The time domain approach is generally motivated by the presumptionthat correlation between adjacent points in time is best explained in termsof a dependence of the current value on past values. The time domain ap-proach focuses on modeling some future value of a time series as a parametricfunction of the current and past values. In this scenario, we begin with linearregressions of the present value of a time series on its own past values andon the past values of other series. This modeling leads one to use the resultsof the time domain approach as a forecasting tool and is particularly popularwith economists for this reason.

One approach, advocated in the landmark work of Box and Jenkins (1970;see also Box et al., 1994), develops a systematic class of models called au-toregressive integrated moving average (ARIMA) models to handle time-correlated modeling and forecasting. The approach includes a provision fortreating more than one input series through multivariate ARIMA or throughtransfer function modeling. The defining feature of these models is that theyare multiplicative models, meaning that the observed data are assumed toresult from products of factors involving differential or difference equationoperators responding to a white noise input.

A more recent approach to the same problem uses additive models morefamiliar to statisticians. In this approach, the observed data are assumed toresult from sums of series, each with a specified time series structure; for exam-ple, in economics, assume a series is generated as the sum of trend, a seasonaleffect, and error. The state-space model that results is then treated by makingjudicious use of the celebrated Kalman filters and smoothers, developed origi-nally for estimation and control in space applications. Two relatively completepresentations from this point of view are in Harvey (1991) and Kitagawa andGersch (1996). Time series regression is introduced in Chapter 2, and ARIMA

Page 16: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

1.2 The Nature of Time Series Data 3

and related time domain models are studied in Chapter 3, with the empha-sis on classical, statistical, univariate linear regression. Special topics on timedomain analysis are covered in Chapter 5; these topics include modern treat-ments of, for example, time series with long memory and GARCH modelsfor the analysis of volatility. The state-space model, Kalman filtering andsmoothing, and related topics are developed in Chapter 6.

Conversely, the frequency domain approach assumes the primary charac-teristics of interest in time series analyses relate to periodic or systematicsinusoidal variations found naturally in most data. These periodic variationsare often caused by biological, physical, or environmental phenomena of inter-est. A series of periodic shocks may influence certain areas of the brain; windmay affect vibrations on an airplane wing; sea surface temperatures causedby El Nino oscillations may affect the number of fish in the ocean. The studyof periodicity extends to economics and social sciences, where one may beinterested in yearly periodicities in such series as monthly unemployment ormonthly birth rates.

In spectral analysis, the partition of the various kinds of periodic variationin a time series is accomplished by evaluating separately the variance associ-ated with each periodicity of interest. This variance profile over frequency iscalled the power spectrum. In our view, no schism divides time domain andfrequency domain methodology, although cliques are often formed that advo-cate primarily one or the other of the approaches to analyzing data. In manycases, the two approaches may produce similar answers for long series, butthe comparative performance over short samples is better done in the timedomain. In some cases, the frequency domain formulation simply provides aconvenient means for carrying out what is conceptually a time domain calcu-lation. Hopefully, this book will demonstrate that the best path to analyzingmany data sets is to use the two approaches in a complementary fashion. Ex-positions emphasizing primarily the frequency domain approach can be foundin Bloomfield (1976, 2000), Priestley (1981), or Jenkins and Watts (1968).On a more advanced level, Hannan (1970), Brillinger (1981, 2001), Brockwelland Davis (1991), and Fuller (1996) are available as theoretical sources. Ourcoverage of the frequency domain is given in Chapters 4 and 7.

The objective of this book is to provide a unified and reasonably completeexposition of statistical methods used in time series analysis, giving seriousconsideration to both the time and frequency domain approaches. Because amyriad of possible methods for analyzing any particular experimental seriescan exist, we have integrated real data from a number of subject fields intothe exposition and have suggested methods for analyzing these data.

1.2 The Nature of Time Series Data

Some of the problems and questions of interest to the prospective time se-ries analyst can best be exposed by considering real experimental data taken

Page 17: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

4 1 Characteristics of Time Series

Fig. 1.1. Johnson & Johnson quarterly earnings per share, 84 quarters, 1960-I to1980-IV.

from different subject areas. The following cases illustrate some of the com-mon kinds of experimental time series data as well as some of the statisticalquestions that might be asked about such data.

Example 1.1 Johnson & Johnson Quarterly Earnings

Figure 1.1 shows quarterly earnings per share for the U.S. company Johnson& Johnson, furnished by Professor Paul Griffin (personal communication) ofthe Graduate School of Management, University of California, Davis. Thereare 84 quarters (21 years) measured from the first quarter of 1960 to thelast quarter of 1980. Modeling such series begins by observing the primarypatterns in the time history. In this case, note the gradually increasing un-derlying trend and the rather regular variation superimposed on the trendthat seems to repeat over quarters. Methods for analyzing data such as theseare explored in Chapter 2 (see Problem 2.1) using regression techniques andin Chapter 6, §6.5, using structural equation modeling.

To plot the data using the R statistical package, type the following:1

1 load("tsa3.rda") # SEE THE FOOTNOTE

2 plot(jj, type="o", ylab="Quarterly Earnings per Share")

Example 1.2 Global Warming

Consider the global temperature series record shown in Figure 1.2. The dataare the global mean land–ocean temperature index from 1880 to 2009, with

1 We assume that tsa3.rda has been downloaded to a convenient directory. SeeAppendix R for further details.

Page 18: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

1.2 The Nature of Time Series Data 5

Fig. 1.2. Yearly average global temperature deviations (1880–2009) in degrees centi-grade.

the base period 1951-1980. In particular, the data are deviations, measuredin degrees centigrade, from the 1951-1980 average, and are an update ofHansen et al. (2006). We note an apparent upward trend in the series duringthe latter part of the twentieth century that has been used as an argumentfor the global warming hypothesis. Note also the leveling off at about 1935and then another rather sharp upward trend at about 1970. The question ofinterest for global warming proponents and opponents is whether the overalltrend is natural or whether it is caused by some human-induced interface.Problem 2.8 examines 634 years of glacial sediment data that might be takenas a long-term temperature proxy. Such percentage changes in temperaturedo not seem to be unusual over a time period of 100 years. Again, thequestion of trend is of more interest than particular periodicities.

The R code for this example is similar to the code in Example 1.1:1 plot(gtemp, type="o", ylab="Global Temperature Deviations")

Example 1.3 Speech Data

More involved questions develop in applications to the physical sciences.Figure 1.3 shows a small .1 second (1000 point) sample of recorded speechfor the phrase aaa · · ·hhh, and we note the repetitive nature of the signaland the rather regular periodicities. One current problem of great inter-est is computer recognition of speech, which would require converting thisparticular signal into the recorded phrase aaa · · ·hhh. Spectral analysis canbe used in this context to produce a signature of this phrase that can becompared with signatures of various library syllables to look for a match.

Page 19: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

6 1 Characteristics of Time Series

Time

spee

ch

0 200 400 600 800 1000

010

0020

0030

0040

00

Fig. 1.3. Speech recording of the syllable aaa · · ·hhh sampled at 10,000 points persecond with n = 1020 points.

One can immediately notice the rather regular repetition of small wavelets.The separation between the packets is known as the pitch period and rep-resents the response of the vocal tract filter to a periodic sequence of pulsesstimulated by the opening and closing of the glottis.

In R, you can reproduce Figure 1.3 as follows:1 plot(speech)

Example 1.4 New York Stock Exchange

As an example of financial time series data, Figure 1.4 shows the dailyreturns (or percent change) of the New York Stock Exchange (NYSE) fromFebruary 2, 1984 to December 31, 1991. It is easy to spot the crash ofOctober 19, 1987 in the figure. The data shown in Figure 1.4 are typical ofreturn data. The mean of the series appears to be stable with an averagereturn of approximately zero, however, the volatility (or variability) of datachanges over time. In fact, the data show volatility clustering; that is, highlyvolatile periods tend to be clustered together. A problem in the analysis ofthese type of financial data is to forecast the volatility of future returns.Models such as ARCH and GARCH models (Engle, 1982; Bollerslev, 1986)and stochastic volatility models (Harvey, Ruiz and Shephard, 1994) havebeen developed to handle these problems. We will discuss these models andthe analysis of financial data in Chapters 5 and 6. The R code for thisexample is similar to the previous examples:

1 plot(nyse, ylab="NYSE Returns")

Page 20: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

1.2 The Nature of Time Series Data 7

Time

NY

SE

Ret

urns

0 500 1000 1500 2000

−0.1

5−0

.10

−0.0

50.

000.

05

Fig. 1.4. Returns of the NYSE. The data are daily value weighted market returnsfrom February 2, 1984 to December 31, 1991 (2000 trading days). The crash ofOctober 19, 1987 occurs at t = 938.

Example 1.5 El Nino and Fish Population

We may also be interested in analyzing several time series at once. Fig-ure 1.5 shows monthly values of an environmental series called the SouthernOscillation Index (SOI) and associated Recruitment (number of new fish)furnished by Dr. Roy Mendelssohn of the Pacific Environmental FisheriesGroup (personal communication). Both series are for a period of 453 monthsranging over the years 1950–1987. The SOI measures changes in air pressure,related to sea surface temperatures in the central Pacific Ocean. The centralPacific warms every three to seven years due to the El Nino effect, which hasbeen blamed, in particular, for the 1997 floods in the midwestern portionsof the United States. Both series in Figure 1.5 tend to exhibit repetitivebehavior, with regularly repeating cycles that are easily visible. This peri-odic behavior is of interest because underlying processes of interest may beregular and the rate or frequency of oscillation characterizing the behaviorof the underlying series would help to identify them. One can also remarkthat the cycles of the SOI are repeating at a faster rate than those of theRecruitment series. The Recruitment series also shows several kinds of oscil-lations, a faster frequency that seems to repeat about every 12 months and aslower frequency that seems to repeat about every 50 months. The study ofthe kinds of cycles and their strengths is the subject of Chapter 4. The twoseries also tend to be somewhat related; it is easy to imagine that somehowthe fish population is dependent on the SOI. Perhaps even a lagged relationexists, with the SOI signaling changes in the fish population. This possibility

Page 21: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

8 1 Characteristics of Time Series

Southern Oscillation Index

1950 1960 1970 1980

−1.0

−0.5

0.0

0.5

1.0

Recruitment

1950 1960 1970 1980

020

4060

8010

0

Fig. 1.5. Monthly SOI and Recruitment (estimated new fish), 1950-1987.

suggests trying some version of regression analysis as a procedure for relat-ing the two series. Transfer function modeling, as considered in Chapter 5,can be applied in this case to obtain a model relating Recruitment to itsown past and the past values of the SOI.

The following R code will reproduce Figure 1.5:1 par(mfrow = c(2,1)) # set up the graphics

2 plot(soi, ylab="", xlab="", main="Southern Oscillation Index")

3 plot(rec, ylab="", xlab="", main="Recruitment")

Example 1.6 fMRI Imaging

A fundamental problem in classical statistics occurs when we are given acollection of independent series or vectors of series, generated under varyingexperimental conditions or treatment configurations. Such a set of series isshown in Figure 1.6, where we observe data collected from various locationsin the brain via functional magnetic resonance imaging (fMRI). In this ex-ample, five subjects were given periodic brushing on the hand. The stimuluswas applied for 32 seconds and then stopped for 32 seconds; thus, the signalperiod is 64 seconds. The sampling rate was one observation every 2 secondsfor 256 seconds (n = 128). For this example, we averaged the results oversubjects (these were evoked responses, and all subjects were in phase). The

Page 22: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

1.2 The Nature of Time Series Data 9

CortexBO

LD

0 20 40 60 80 100 120

−0.6

−0.2

0.2

0.6

Thalamus & Cerebellum

BOLD

0 20 40 60 80 100 120

−0.6

−0.2

0.2

0.4

0.6

Time (1 pt = 2 sec)

Fig. 1.6. fMRI data from various locations in the cortex, thalamus, and cerebellum;n = 128 points, one observation taken every 2 seconds.

series shown in Figure 1.6 are consecutive measures of blood oxygenation-level dependent (bold) signal intensity, which measures areas of activationin the brain. Notice that the periodicities appear strongly in the motor cor-tex series and less strongly in the thalamus and cerebellum. The fact thatone has series from different areas of the brain suggests testing whether theareas are responding differently to the brush stimulus. Analysis of variancetechniques accomplish this in classical statistics, and we show in Chapter 7how these classical techniques extend to the time series case, leading to aspectral analysis of variance.

The following R commands were used to plot the data:1 par(mfrow=c(2,1), mar=c(3,2,1,0)+.5, mgp=c(1.6,.6,0))

2 ts.plot(fmri1[,2:5], lty=c(1,2,4,5), ylab="BOLD", xlab="",

main="Cortex")

3 ts.plot(fmri1[,6:9], lty=c(1,2,4,5), ylab="BOLD", xlab="",

main="Thalamus & Cerebellum")

4 mtext("Time (1 pt = 2 sec)", side=1, line=2)

Example 1.7 Earthquakes and Explosions

As a final example, the series in Figure 1.7 represent two phases or arrivalsalong the surface, denoted by P (t = 1, . . . , 1024) and S (t = 1025, . . . , 2048),

Page 23: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

10 1 Characteristics of Time Series

Earthquake

Time

EQ5

0 500 1000 1500 2000

−0.4

0.0

0.4

Explosion

Time

EXP6

0 500 1000 1500 2000

−0.4

0.0

0.4

Fig. 1.7. Arrival phases from an earthquake (top) and explosion (bottom) at 40points per second.

at a seismic recording station. The recording instruments in Scandinavia areobserving earthquakes and mining explosions with one of each shown in Fig-ure 1.7. The general problem of interest is in distinguishing or discriminatingbetween waveforms generated by earthquakes and those generated by explo-sions. Features that may be important are the rough amplitude ratios of thefirst phase P to the second phase S, which tend to be smaller for earth-quakes than for explosions. In the case of the two events in Figure 1.7, theratio of maximum amplitudes appears to be somewhat less than .5 for theearthquake and about 1 for the explosion. Otherwise, note a subtle differ-ence exists in the periodic nature of the S phase for the earthquake. We canagain think about spectral analysis of variance for testing the equality of theperiodic components of earthquakes and explosions. We would also like to beable to classify future P and S components from events of unknown origin,leading to the time series discriminant analysis developed in Chapter 7.

To plot the data as in this example, use the following commands in R:1 par(mfrow=c(2,1))

2 plot(EQ5, main="Earthquake")

3 plot(EXP6, main="Explosion")

Page 24: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

1.3 Time Series Statistical Models 11

1.3 Time Series Statistical Models

The primary objective of time series analysis is to develop mathematical mod-els that provide plausible descriptions for sample data, like that encounteredin the previous section. In order to provide a statistical setting for describingthe character of data that seemingly fluctuate in a random fashion over time,we assume a time series can be defined as a collection of random variables in-dexed according to the order they are obtained in time. For example, we mayconsider a time series as a sequence of random variables, x1, x2, x3, . . . , wherethe random variable x1 denotes the value taken by the series at the first timepoint, the variable x2 denotes the value for the second time period, x3 denotesthe value for the third time period, and so on. In general, a collection of ran-dom variables, {xt}, indexed by t is referred to as a stochastic process. In thistext, t will typically be discrete and vary over the integers t = 0,±1,±2, ...,or some subset of the integers. The observed values of a stochastic process arereferred to as a realization of the stochastic process. Because it will be clearfrom the context of our discussions, we use the term time series whether weare referring generically to the process or to a particular realization and makeno notational distinction between the two concepts.

It is conventional to display a sample time series graphically by plottingthe values of the random variables on the vertical axis, or ordinate, withthe time scale as the abscissa. It is usually convenient to connect the valuesat adjacent time periods to reconstruct visually some original hypotheticalcontinuous time series that might have produced these values as a discretesample. Many of the series discussed in the previous section, for example,could have been observed at any continuous point in time and are conceptuallymore properly treated as continuous time series. The approximation of theseseries by discrete time parameter series sampled at equally spaced pointsin time is simply an acknowledgment that sampled data will, for the mostpart, be discrete because of restrictions inherent in the method of collection.Furthermore, the analysis techniques are then feasible using computers, whichare limited to digital computations. Theoretical developments also rest on theidea that a continuous parameter time series should be specified in terms offinite-dimensional distribution functions defined over a finite number of pointsin time. This is not to say that the selection of the sampling interval or rateis not an extremely important consideration. The appearance of data can bechanged completely by adopting an insufficient sampling rate. We have allseen wagon wheels in movies appear to be turning backwards because of theinsufficient number of frames sampled by the camera. This phenomenon leadsto a distortion called aliasing (see §4.2).

The fundamental visual characteristic distinguishing the different seriesshown in Examples 1.1–1.7 is their differing degrees of smoothness. One pos-sible explanation for this smoothness is that it is being induced by the suppo-sition that adjacent points in time are correlated, so the value of the series attime t, say, xt, depends in some way on the past values xt−1, xt−2, . . .. This

Page 25: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

12 1 Characteristics of Time Series

model expresses a fundamental way in which we might think about gener-ating realistic-looking time series. To begin to develop an approach to usingcollections of random variables to model time series, consider Example 1.8.

Example 1.8 White Noise

A simple kind of generated series might be a collection of uncorrelated ran-dom variables, wt, with mean 0 and finite variance σ2

w. The time seriesgenerated from uncorrelated variables is used as a model for noise in en-gineering applications, where it is called white noise; we shall sometimesdenote this process as wt ∼ wn(0, σ2

w). The designation white originatesfrom the analogy with white light and indicates that all possible periodicoscillations are present with equal strength.

We will, at times, also require the noise to be independent and identicallydistributed (iid) random variables with mean 0 and variance σ2

w. We shalldistinguish this case by saying white independent noise, or by writing wt ∼iid(0, σ2

w). A particularly useful white noise series is Gaussian white noise,wherein the wt are independent normal random variables, with mean 0 andvariance σ2

w; or more succinctly, wt ∼ iid N(0, σ2w). Figure 1.8 shows in the

upper panel a collection of 500 such random variables, with σ2w = 1, plotted

in the order in which they were drawn. The resulting series bears a slightresemblance to the explosion in Figure 1.7 but is not smooth enough toserve as a plausible model for any of the other experimental series. The plottends to show visually a mixture of many different kinds of oscillations inthe white noise series.

If the stochastic behavior of all time series could be explained in terms ofthe white noise model, classical statistical methods would suffice. Two waysof introducing serial correlation and more smoothness into time series modelsare given in Examples 1.9 and 1.10.

Example 1.9 Moving Averages

We might replace the white noise series wt by a moving average that smoothsthe series. For example, consider replacing wt in Example 1.8 by an averageof its current value and its immediate neighbors in the past and future. Thatis, let

vt = 13

(wt−1 + wt + wt+1

), (1.1)

which leads to the series shown in the lower panel of Figure 1.8. Inspectingthe series shows a smoother version of the first series, reflecting the fact thatthe slower oscillations are more apparent and some of the faster oscillationsare taken out. We begin to notice a similarity to the SOI in Figure 1.5, orperhaps, to some of the fMRI series in Figure 1.6.

To reproduce Figure 1.8 in R use the following commands. A linear com-bination of values in a time series such as in (1.1) is referred to, generically,as a filtered series; hence the command filter.

Page 26: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

1.3 Time Series Statistical Models 13

white noise

Time

w

0 100 200 300 400 500

−3−1

01

2

moving average

v

0 100 200 300 400 500

−1.5

−0.5

0.5

1.5

Fig. 1.8. Gaussian white noise series (top) and three-point moving average of theGaussian white noise series (bottom).

1 w = rnorm(500,0,1) # 500 N(0,1) variates

2 v = filter(w, sides=2, rep(1/3,3)) # moving average

3 par(mfrow=c(2,1))

4 plot.ts(w, main="white noise")

5 plot.ts(v, main="moving average")

The speech series in Figure 1.3 and the Recruitment series in Figure 1.5,as well as some of the MRI series in Figure 1.6, differ from the moving averageseries because one particular kind of oscillatory behavior seems to predom-inate, producing a sinusoidal type of behavior. A number of methods existfor generating series with this quasi-periodic behavior; we illustrate a popularone based on the autoregressive model considered in Chapter 3.

Example 1.10 Autoregressions

Suppose we consider the white noise series wt of Example 1.8 as input andcalculate the output using the second-order equation

xt = xt−1 − .9xt−2 + wt (1.2)

successively for t = 1, 2, . . . , 500. Equation (1.2) represents a regression orprediction of the current value xt of a time series as a function of the pasttwo values of the series, and, hence, the term autoregression is suggested

Page 27: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

14 1 Characteristics of Time Series

autoregressionx

0 100 200 300 400 500

−6−4

−20

24

6

Fig. 1.9. Autoregressive series generated from model (1.2).

for this model. A problem with startup values exists here because (1.2) alsodepends on the initial conditions x0 and x−1, but, for now, we assume thatwe are given these values and generate the succeeding values by substitutinginto (1.2). The resulting output series is shown in Figure 1.9, and we notethe periodic behavior of the series, which is similar to that displayed bythe speech series in Figure 1.3. The autoregressive model above and itsgeneralizations can be used as an underlying model for many observed seriesand will be studied in detail in Chapter 3.

One way to simulate and plot data from the model (1.2) in R is to usethe following commands (another way is to use arima.sim).

1 w = rnorm(550,0,1) # 50 extra to avoid startup problems

2 x = filter(w, filter=c(1,-.9), method="recursive")[-(1:50)]

3 plot.ts(x, main="autoregression")

Example 1.11 Random Walk with Drift

A model for analyzing trend such as seen in the global temperature data inFigure 1.2, is the random walk with drift model given by

xt = δ + xt−1 + wt (1.3)

for t = 1, 2, . . ., with initial condition x0 = 0, and where wt is white noise.The constant δ is called the drift, and when δ = 0, (1.3) is called simply arandom walk. The term random walk comes from the fact that, when δ = 0,the value of the time series at time t is the value of the series at time t− 1plus a completely random movement determined by wt. Note that we mayrewrite (1.3) as a cumulative sum of white noise variates. That is,

xt = δ t+

t∑j=1

wj (1.4)

Page 28: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

1.3 Time Series Statistical Models 15

random walk

0 50 100 150 200

010

2030

4050

Fig. 1.10. Random walk, σw = 1, with drift δ = .2 (upper jagged line), withoutdrift, δ = 0 (lower jagged line), and a straight line with slope .2 (dashed line).

for t = 1, 2, . . .; either use induction, or plug (1.4) into (1.3) to verify thisstatement. Figure 1.10 shows 200 observations generated from the modelwith δ = 0 and .2, and with σw = 1. For comparison, we also superimposedthe straight line .2t on the graph.

To reproduce Figure 1.10 in R use the following code (notice the use ofmultiple commands per line using a semicolon).

1 set.seed(154) # so you can reproduce the results

2 w = rnorm(200,0,1); x = cumsum(w) # two commands in one line

3 wd = w +.2; xd = cumsum(wd)

4 plot.ts(xd, ylim=c(-5,55), main="random walk")

5 lines(x); lines(.2*(1:200), lty="dashed")

Example 1.12 Signal in Noise

Many realistic models for generating time series assume an underlying signalwith some consistent periodic variation, contaminated by adding a randomnoise. For example, it is easy to detect the regular cycle fMRI series displayedon the top of Figure 1.6. Consider the model

xt = 2 cos(2πt/50 + .6π) + wt (1.5)

for t = 1, 2, . . . , 500, where the first term is regarded as the signal, shown inthe upper panel of Figure 1.11. We note that a sinusoidal waveform can bewritten as

A cos(2πωt+ φ), (1.6)

where A is the amplitude, ω is the frequency of oscillation, and φ is a phaseshift. In (1.5), A = 2, ω = 1/50 (one cycle every 50 time points), andφ = .6π.

Page 29: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

16 1 Characteristics of Time Series

2cos2t 50 0.6

0 100 200 300 400 500

−2−1

01

2

2cos2t 50 0.6 N01

0 100 200 300 400 500

−4−2

02

4

2cos2t 50 0.6 N025

0 100 200 300 400 500

−15

−50

510

15

Fig. 1.11. Cosine wave with period 50 points (top panel) compared with the cosinewave contaminated with additive white Gaussian noise, σw = 1 (middle panel) andσw = 5 (bottom panel); see (1.5).

An additive noise term was taken to be white noise with σw = 1 (mid-dle panel) and σw = 5 (bottom panel), drawn from a normal distribution.Adding the two together obscures the signal, as shown in the lower panels ofFigure 1.11. Of course, the degree to which the signal is obscured dependson the amplitude of the signal and the size of σw. The ratio of the amplitudeof the signal to σw (or some function of the ratio) is sometimes called thesignal-to-noise ratio (SNR); the larger the SNR, the easier it is to detectthe signal. Note that the signal is easily discernible in the middle panel ofFigure 1.11, whereas the signal is obscured in the bottom panel. Typically,we will not observe the signal but the signal obscured by noise.

To reproduce Figure 1.11 in R, use the following commands:1 cs = 2*cos(2*pi*1:500/50 + .6*pi)

2 w = rnorm(500,0,1)

3 par(mfrow=c(3,1), mar=c(3,2,2,1), cex.main=1.5)

4 plot.ts(cs, main=expression(2*cos(2*pi*t/50+.6*pi)))

5 plot.ts(cs+w, main=expression(2*cos(2*pi*t/50+.6*pi) + N(0,1)))

6 plot.ts(cs+5*w, main=expression(2*cos(2*pi*t/50+.6*pi) + N(0,25)))

In Chapter 4, we will study the use of spectral analysis as a possibletechnique for detecting regular or periodic signals, such as the one described

Page 30: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

1.4Measures of Dependence 17

in Example 1.12. In general, we would emphasize the importance of simpleadditive models such as given above in the form

xt = st + vt, (1.7)

where st denotes some unknown signal and vt denotes a time series that maybe white or correlated over time. The problems of detecting a signal and thenin estimating or extracting the waveform of st are of great interest in manyareas of engineering and the physical and biological sciences. In economics,the underlying signal may be a trend or it may be a seasonal component of aseries. Models such as (1.7), where the signal has an autoregressive structure,form the motivation for the state-space model of Chapter 6.

In the above examples, we have tried to motivate the use of various com-binations of random variables emulating real time series data. Smoothnesscharacteristics of observed time series were introduced by combining the ran-dom variables in various ways. Averaging independent random variables overadjacent time points, as in Example 1.9, or looking at the output of differ-ence equations that respond to white noise inputs, as in Example 1.10, arecommon ways of generating correlated data. In the next section, we introducevarious theoretical measures used for describing how time series behave. Asis usual in statistics, the complete description involves the multivariate dis-tribution function of the jointly sampled values x1, x2, . . . , xn, whereas moreeconomical descriptions can be had in terms of the mean and autocorrelationfunctions. Because correlation is an essential feature of time series analysis, themost useful descriptive measures are those expressed in terms of covarianceand correlation functions.

1.4 Measures of Dependence: Autocorrelation andCross-Correlation

A complete description of a time series, observed as a collection of n randomvariables at arbitrary integer time points t1, t2, . . . , tn, for any positive integern, is provided by the joint distribution function, evaluated as the probabilitythat the values of the series are jointly less than the n constants, c1, c2, . . . , cn;i.e.,

F (c1, c2, . . . , cn) = P(xt1 ≤ c1, xt2 ≤ c2, . . . , xtn ≤ cn

). (1.8)

Unfortunately, the multidimensional distribution function cannot usually bewritten easily unless the random variables are jointly normal, in which casethe joint density has the well-known form displayed in (1.31).

Although the joint distribution function describes the data completely, itis an unwieldy tool for displaying and analyzing time series data. The dis-tribution function (1.8) must be evaluated as a function of n arguments, soany plotting of the corresponding multivariate density functions is virtuallyimpossible. The marginal distribution functions

Page 31: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

18 1 Characteristics of Time Series

Ft(x) = P{xt ≤ x}

or the corresponding marginal density functions

ft(x) =∂Ft(x)

∂x,

when they exist, are often informative for examining the marginal behaviorof a series.2 Another informative marginal descriptive measure is the meanfunction.

Definition 1.1 The mean function is defined as

µxt = E(xt) =

∫ ∞−∞

xft(x) dx, (1.9)

provided it exists, where E denotes the usual expected value operator. Whenno confusion exists about which time series we are referring to, we will dropa subscript and write µxt as µt.

Example 1.13 Mean Function of a Moving Average Series

If wt denotes a white noise series, then µwt = E(wt) = 0 for all t. The topseries in Figure 1.8 reflects this, as the series clearly fluctuates around amean value of zero. Smoothing the series as in Example 1.9 does not changethe mean because we can write

µvt = E(vt) = 13 [E(wt−1) + E(wt) + E(wt+1)] = 0.

Example 1.14 Mean Function of a Random Walk with Drift

Consider the random walk with drift model given in (1.4),

xt = δ t+t∑

j=1

wj , t = 1, 2, . . . .

Because E(wt) = 0 for all t, and δ is a constant, we have

µxt = E(xt) = δ t+

t∑j=1

E(wj) = δ t

which is a straight line with slope δ. A realization of a random walk withdrift can be compared to its mean function in Figure 1.10.

2 If xt is Gaussian with mean µt and variance σ2t , abbreviated as xt ∼ N(µt, σ

2t ),

the marginal density is given by ft(x) =1

σt√

2πexp

{− 1

2σ2t

(x− µt)2}

.

Page 32: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

1.4Measures of Dependence 19

Example 1.15 Mean Function of Signal Plus Noise

A great many practical applications depend on assuming the observed datahave been generated by a fixed signal waveform superimposed on a zero-mean noise process, leading to an additive signal model of the form (1.5). Itis clear, because the signal in (1.5) is a fixed function of time, we will have

µxt = E(xt) = E[2 cos(2πt/50 + .6π) + wt

]= 2 cos(2πt/50 + .6π) + E(wt)

= 2 cos(2πt/50 + .6π),

and the mean function is just the cosine wave.

The lack of independence between two adjacent values xs and xt can beassessed numerically, as in classical statistics, using the notions of covarianceand correlation. Assuming the variance of xt is finite, we have the followingdefinition.

Definition 1.2 The autocovariance function is defined as the second mo-ment product

γx(s, t) = cov(xs, xt) = E[(xs − µs)(xt − µt)], (1.10)

for all s and t. When no possible confusion exists about which time series weare referring to, we will drop the subscript and write γx(s, t) as γ(s, t).

Note that γx(s, t) = γx(t, s) for all time points s and t. The autocovariancemeasures the linear dependence between two points on the same series ob-served at different times. Very smooth series exhibit autocovariance functionsthat stay large even when the t and s are far apart, whereas choppy series tendto have autocovariance functions that are nearly zero for large separations.The autocovariance (1.10) is the average cross-product relative to the jointdistribution F (xs, xt). Recall from classical statistics that if γx(s, t) = 0, xsand xt are not linearly related, but there still may be some dependence struc-ture between them. If, however, xs and xt are bivariate normal, γx(s, t) = 0ensures their independence. It is clear that, for s = t, the autocovariancereduces to the (assumed finite) variance, because

γx(t, t) = E[(xt − µt)2] = var(xt). (1.11)

Example 1.16 Autocovariance of White Noise

The white noise series wt has E(wt) = 0 and

γw(s, t) = cov(ws, wt) =

{σ2w s = t,

0 s 6= t.(1.12)

A realization of white noise with σ2w = 1 is shown in the top panel of

Figure 1.8.

Page 33: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

20 1 Characteristics of Time Series

Example 1.17 Autocovariance of a Moving Average

Consider applying a three-point moving average to the white noise series wtof the previous example as in Example 1.9. In this case,

γv(s, t) = cov(vs, vt) = cov{

13 (ws−1 + ws + ws+1) , 13 (wt−1 + wt + wt+1)

}.

When s = t we have3

γv(t, t) = 19cov{(wt−1 + wt + wt+1), (wt−1 + wt + wt+1)}

= 19 [cov(wt−1, wt−1) + cov(wt, wt) + cov(wt+1, wt+1)]

= 39σ

2w.

When s = t+ 1,

γv(t+ 1, t) = 19cov{(wt + wt+1 + wt+2), (wt−1 + wt + wt+1)}

= 19 [cov(wt, wt) + cov(wt+1, wt+1)]

= 29σ

2w,

using (1.12). Similar computations give γv(t − 1, t) = 2σ2w/9, γv(t + 2, t) =

γv(t− 2, t) = σ2w/9, and 0 when |t− s| > 2. We summarize the values for all

s and t as

γv(s, t) =

39σ

2w s = t,

29σ

2w |s− t| = 1,

19σ

2w |s− t| = 2,

0 |s− t| > 2.

(1.13)

Example 1.17 shows clearly that the smoothing operation introduces acovariance function that decreases as the separation between the two timepoints increases and disappears completely when the time points are separatedby three or more time points. This particular autocovariance is interestingbecause it only depends on the time separation or lag and not on the absolutelocation of the points along the series. We shall see later that this dependencesuggests a mathematical model for the concept of weak stationarity.

Example 1.18 Autocovariance of a Random Walk

For the random walk model, xt =∑tj=1 wj , we have

γx(s, t) = cov(xs, xt) = cov

s∑j=1

wj ,

t∑k=1

wk

= min{s, t}σ2w,

because the wt are uncorrelated random variables. Note that, as opposedto the previous examples, the autocovariance function of a random walk

3 If the random variables U =∑mj=1 ajXj and V =

∑rk=1 bkYk are linear com-

binations of random variables {Xj} and {Yk}, respectively, then cov(U, V ) =∑mj=1

∑rk=1 ajbkcov(Xj , Yk). Furthermore, var(U) = cov(U,U).

Page 34: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

1.4Measures of Dependence 21

depends on the particular time values s and t, and not on the time separationor lag. Also, notice that the variance of the random walk, var(xt) = γx(t, t) =t σ2

w, increases without bound as time t increases. The effect of this varianceincrease can be seen in Figure 1.10 where the processes start to move awayfrom their mean functions δ t (note that δ = 0 and .2 in that example).

As in classical statistics, it is more convenient to deal with a measure ofassociation between −1 and 1, and this leads to the following definition.

Definition 1.3 The autocorrelation function (ACF) is defined as

ρ(s, t) =γ(s, t)√

γ(s, s)γ(t, t). (1.14)

The ACF measures the linear predictability of the series at time t, say xt,using only the value xs. We can show easily that −1 ≤ ρ(s, t) ≤ 1 using theCauchy–Schwarz inequality.4 If we can predict xt perfectly from xs througha linear relationship, xt = β0 + β1xs, then the correlation will be +1 whenβ1 > 0, and −1 when β1 < 0. Hence, we have a rough measure of the abilityto forecast the series at time t from the value at time s.

Often, we would like to measure the predictability of another series yt fromthe series xs. Assuming both series have finite variances, we have the followingdefinition.

Definition 1.4 The cross-covariance function between two series, xt andyt, is

γxy(s, t) = cov(xs, yt) = E[(xs − µxs)(yt − µyt)]. (1.15)

There is also a scaled version of the cross-covariance function.

Definition 1.5 The cross-correlation function (CCF) is given by

ρxy(s, t) =γxy(s, t)√

γx(s, s)γy(t, t). (1.16)

We may easily extend the above ideas to the case of more than two series,say, xt1, xt2, . . . , xtr; that is, multivariate time series with r components. Forexample, the extension of (1.10) in this case is

γjk(s, t) = E[(xsj − µsj)(xtk − µtk)] j, k = 1, 2, . . . , r. (1.17)

In the definitions above, the autocovariance and cross-covariance functionsmay change as one moves along the series because the values depend on both s

4 The Cauchy–Schwarz inequality implies |γ(s, t)|2 ≤ γ(s, s)γ(t, t).

Page 35: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

22 1 Characteristics of Time Series

and t, the locations of the points in time. In Example 1.17, the autocovariancefunction depends on the separation of xs and xt, say, h = |s− t|, and not onwhere the points are located in time. As long as the points are separated byh units, the location of the two points does not matter. This notion, calledweak stationarity, when the mean is constant, is fundamental in allowing usto analyze sample time series data when only a single series is available.

1.5 Stationary Time Series

The preceding definitions of the mean and autocovariance functions are com-pletely general. Although we have not made any special assumptions aboutthe behavior of the time series, many of the preceding examples have hintedthat a sort of regularity may exist over time in the behavior of a time series.We introduce the notion of regularity using a concept called stationarity.

Definition 1.6 A strictly stationary time series is one for which the prob-abilistic behavior of every collection of values

{xt1 , xt2 , . . . , xtk}

is identical to that of the time shifted set

{xt1+h, xt2+h, . . . , xtk+h}.

That is,

P{xt1 ≤ c1, . . . , xtk ≤ ck} = P{xt1+h ≤ c1, . . . , xtk+h ≤ ck} (1.18)

for all k = 1, 2, ..., all time points t1, t2, . . . , tk, all numbers c1, c2, . . . , ck, andall time shifts h = 0,±1,±2, ... .

If a time series is strictly stationary, then all of the multivariate distri-bution functions for subsets of variables must agree with their counterpartsin the shifted set for all values of the shift parameter h. For example, whenk = 1, (1.18) implies that

P{xs ≤ c} = P{xt ≤ c} (1.19)

for any time points s and t. This statement implies, for example, that theprobability that the value of a time series sampled hourly is negative at 1amis the same as at 10am. In addition, if the mean function, µt, of the series xtexists, (1.19) implies that µs = µt for all s and t, and hence µt must be con-stant. Note, for example, that a random walk process with drift is not strictlystationary because its mean function changes with time; see Example 1.14 onpage 18.

When k = 2, we can write (1.18) as

Page 36: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

1.5 Stationary Time Series 23

P{xs ≤ c1, xt ≤ c2} = P{xs+h ≤ c1, xt+h ≤ c2} (1.20)

for any time points s and t and shift h. Thus, if the variance function of theprocess exists, (1.20) implies that the autocovariance function of the series xtsatisfies

γ(s, t) = γ(s+ h, t+ h)

for all s and t and h. We may interpret this result by saying the autocovariancefunction of the process depends only on the time difference between s and t,and not on the actual times.

The version of stationarity in Definition 1.6 is too strong for most appli-cations. Moreover, it is difficult to assess strict stationarity from a single dataset. Rather than imposing conditions on all possible distributions of a timeseries, we will use a milder version that imposes conditions only on the firsttwo moments of the series. We now have the following definition.

Definition 1.7 A weakly stationary time series, xt, is a finite varianceprocess such that

(i) the mean value function, µt, defined in (1.9) is constant and does notdepend on time t, and

(ii) the autocovariance function, γ(s, t), defined in (1.10) depends on s andt only through their difference |s− t|.

Henceforth, we will use the term stationary to mean weakly stationary; if aprocess is stationary in the strict sense, we will use the term strictly stationary.

It should be clear from the discussion of strict stationarity following Defini-tion 1.6 that a strictly stationary, finite variance, time series is also stationary.The converse is not true unless there are further conditions. One importantcase where stationarity implies strict stationarity is if the time series is Gaus-sian [meaning all finite distributions, (1.18), of the series are Gaussian]. Wewill make this concept more precise at the end of this section.

Because the mean function, E(xt) = µt, of a stationary time series isindependent of time t, we will write

µt = µ. (1.21)

Also, because the autocovariance function, γ(s, t), of a stationary time series,xt, depends on s and t only through their difference |s− t|, we may simplifythe notation. Let s = t+ h, where h represents the time shift or lag. Then

γ(t+ h, t) = cov(xt+h, xt) = cov(xh, x0) = γ(h, 0)

because the time difference between times t + h and t is the same as thetime difference between times h and 0. Thus, the autocovariance function ofa stationary time series does not depend on the time argument t. Henceforth,for convenience, we will drop the second argument of γ(h, 0).

Page 37: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

24 1 Characteristics of Time Series

−4 −2 0 2 4

0.00

0.15

0.30

Lag

ACov

F

Fig. 1.12. Autocovariance function of a three-point moving average.

Definition 1.8 The autocovariance function of a stationary time se-ries will be written as

γ(h) = cov(xt+h, xt) = E[(xt+h − µ)(xt − µ)]. (1.22)

Definition 1.9 The autocorrelation function (ACF) of a stationarytime series will be written using (1.14) as

ρ(h) =γ(t+ h, t)√

γ(t+ h, t+ h)γ(t, t)=γ(h)

γ(0). (1.23)

The Cauchy–Schwarz inequality shows again that −1 ≤ ρ(h) ≤ 1 for allh, enabling one to assess the relative importance of a given autocorrelationvalue by comparing with the extreme values −1 and 1.

Example 1.19 Stationarity of White Noise

The mean and autocovariance functions of the white noise series discussedin Examples 1.8 and 1.16 are easily evaluated as µwt = 0 and

γw(h) = cov(wt+h, wt) =

{σ2w h = 0,

0 h 6= 0.

Thus, white noise satisfies the conditions of Definition 1.7 and is weaklystationary or stationary. If the white noise variates are also normally dis-tributed or Gaussian, the series is also strictly stationary, as can be seen byevaluating (1.18) using the fact that the noise would also be iid.

Example 1.20 Stationarity of a Moving Average

The three-point moving average process of Example 1.9 is stationary be-cause, from Examples 1.13 and 1.17, the mean and autocovariance functionsµvt = 0, and

Page 38: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

1.5 Stationary Time Series 25

γv(h) =

39σ

2w h = 0,

29σ

2w h = ±1,

19σ

2w h = ±2,

0 |h| > 2

are independent of time t, satisfying the conditions of Definition 1.7. Fig-ure 1.12 shows a plot of the autocovariance as a function of lag h withσ2w = 1. Interestingly, the autocovariance function is symmetric about lag

zero and decays as a function of lag.

The autocovariance function of a stationary process has several usefulproperties (also, see Problem 1.25). First, the value at h = 0, namely

γ(0) = E[(xt − µ)2] (1.24)

is the variance of the time series; note that the Cauchy–Schwarz inequalityimplies

|γ(h)| ≤ γ(0).

A final useful property, noted in the previous example, is that the autoco-variance function of a stationary series is symmetric around the origin; thatis,

γ(h) = γ(−h) (1.25)

for all h. This property follows because shifting the series by h means that

γ(h) = γ(t+ h− t)= E[(xt+h − µ)(xt − µ)]= E[(xt − µ)(xt+h − µ)]= γ(t− (t+ h))= γ(−h),

which shows how to use the notation as well as proving the result.When several series are available, a notion of stationarity still applies with

additional conditions.

Definition 1.10 Two time series, say, xt and yt, are said to be jointly sta-tionary if they are each stationary, and the cross-covariance function

γxy(h) = cov(xt+h, yt) = E[(xt+h − µx)(yt − µy)] (1.26)

is a function only of lag h.

Definition 1.11 The cross-correlation function (CCF) of jointly station-ary time series xt and yt is defined as

ρxy(h) =γxy(h)√γx(0)γy(0)

. (1.27)

Page 39: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

26 1 Characteristics of Time Series

Again, we have the result −1 ≤ ρxy(h) ≤ 1 which enables comparison withthe extreme values −1 and 1 when looking at the relation between xt+h andyt. The cross-correlation function is not generally symmetric about zero [i.e.,typically ρxy(h) 6= ρxy(−h)]; however, it is the case that

ρxy(h) = ρyx(−h), (1.28)

which can be shown by manipulations similar to those used to show (1.25).

Example 1.21 Joint Stationarity

Consider the two series, xt and yt, formed from the sum and difference oftwo successive values of a white noise process, say,

xt = wt + wt−1

andyt = wt − wt−1,

where wt are independent random variables with zero means and varianceσ2w. It is easy to show that γx(0) = γy(0) = 2σ2

w and γx(1) = γx(−1) =σ2w, γy(1) = γy(−1) = −σ2

w. Also,

γxy(1) = cov(xt+1, yt) = cov(wt+1 + wt, wt − wt−1) = σ2w

because only one term is nonzero (recall footnote 3 on page 20). Similarly,γxy(0) = 0, γxy(−1) = −σ2

w. We obtain, using (1.27),

ρxy(h) =

0 h = 0,

1/2 h = 1,

−1/2 h = −1,

0 |h| ≥ 2.

Clearly, the autocovariance and cross-covariance functions depend only onthe lag separation, h, so the series are jointly stationary.

Example 1.22 Prediction Using Cross-Correlation

As a simple example of cross-correlation, consider the problem of determin-ing possible leading or lagging relations between two series xt and yt. If themodel

yt = Axt−` + wt

holds, the series xt is said to lead yt for ` > 0 and is said to lag yt for ` < 0.Hence, the analysis of leading and lagging relations might be important inpredicting the value of yt from xt. Assuming, for convenience, that xt andyt have zero means, and the noise wt is uncorrelated with the xt series, thecross-covariance function can be computed as

Page 40: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

1.5 Stationary Time Series 27

γyx(h) = cov(yt+h, xt) = cov(Axt+h−` + wt+h, xt)

= cov(Axt+h−`, xt) = Aγx(h− `).

The cross-covariance function will look like the autocovariance of the inputseries xt, with a peak on the positive side if xt leads yt and a peak on thenegative side if xt lags yt.

The concept of weak stationarity forms the basis for much of the analy-sis performed with time series. The fundamental properties of the mean andautocovariance functions (1.21) and (1.22) are satisfied by many theoreticalmodels that appear to generate plausible sample realizations. In Examples 1.9and 1.10, two series were generated that produced stationary looking realiza-tions, and in Example 1.20, we showed that the series in Example 1.9 was, infact, weakly stationary. Both examples are special cases of the so-called linearprocess.

Definition 1.12 A linear process, xt, is defined to be a linear combinationof white noise variates wt, and is given by

xt = µ+∞∑

j=−∞ψjwt−j ,

∞∑j=−∞

|ψj | <∞. (1.29)

For the linear process (see Problem 1.11), we may show that the autoco-variance function is given by

γ(h) = σ2w

∞∑j=−∞

ψj+hψj (1.30)

for h ≥ 0; recall that γ(−h) = γ(h). This method exhibits the autocovariancefunction of the process in terms of the lagged products of the coefficients. Notethat, for Example 1.9, we have ψ0 = ψ−1 = ψ1 = 1/3 and the result in Ex-ample 1.20 comes out immediately. The autoregressive series in Example 1.10can also be put in this form, as can the general autoregressive moving averageprocesses considered in Chapter 3.

Finally, as previously mentioned, an important case in which a weaklystationary series is also strictly stationary is the normal or Gaussian series.

Definition 1.13 A process, {xt}, is said to be a Gaussian process if then-dimensional vectors xxx = (xt1 , xt2 , . . . , xtn)′, for every collection of timepoints t1, t2, . . . , tn, and every positive integer n, have a multivariate normaldistribution.

Defining the n × 1 mean vector E(xxx) ≡ µµµ = (µt1 , µt2 , . . . , µtn)′ and then× n covariance matrix as var(xxx) ≡ Γ = {γ(ti, tj); i, j = 1, . . . , n}, which is

Page 41: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

28 1 Characteristics of Time Series

assumed to be positive definite, the multivariate normal density function canbe written as

f(xxx) = (2π)−n/2|Γ |−1/2 exp

{−1

2(xxx− µµµ)′Γ−1(xxx− µµµ)

}, (1.31)

where |·| denotes the determinant. This distribution forms the basis for solvingproblems involving statistical inference for time series. If a Gaussian timeseries, {xt}, is weakly stationary, then µt = µ and γ(ti, tj) = γ(|ti − tj |),so that the vector µµµ and the matrix Γ are independent of time. These factsimply that all the finite distributions, (1.31), of the series {xt} depend onlyon time lag and not on the actual times, and hence the series must be strictlystationary.

1.6 Estimation of Correlation

Although the theoretical autocorrelation and cross-correlation functions areuseful for describing the properties of certain hypothesized models, most ofthe analyses must be performed using sampled data. This limitation meansthe sampled points x1, x2, . . . , xn only are available for estimating the mean,autocovariance, and autocorrelation functions. From the point of view of clas-sical statistics, this poses a problem because we will typically not have iidcopies of xt that are available for estimating the covariance and correlationfunctions. In the usual situation with only one realization, however, the as-sumption of stationarity becomes critical. Somehow, we must use averagesover this single realization to estimate the population means and covariancefunctions.

Accordingly, if a time series is stationary, the mean function (1.21) µt = µis constant so that we can estimate it by the sample mean,

x =1

n

n∑t=1

xt. (1.32)

The standard error of the estimate is the square root of var(x), which can becomputed using first principles (recall footnote 3 on page 20), and is given by

var(x) = var

(1

n

n∑t=1

xt

)=

1

n2cov

(n∑t=1

xt,

n∑s=1

xs

)

=1

n2

(nγx(0) + (n− 1)γx(1) + (n− 2)γx(2) + · · ·+ γx(n− 1)

+ (n− 1)γx(−1) + (n− 2)γx(−2) + · · ·+ γx(1− n))

=1

n

n∑h=−n

(1− |h|

n

)γx(h). (1.33)

Page 42: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

1.6 Estimation of Correlation 29

If the process is white noise, (1.33) reduces to the familiar σ2x/n recalling that

γx(0) = σ2x. Note that, in the case of dependence, the standard error of x may

be smaller or larger than the white noise case depending on the nature of thecorrelation structure (see Problem 1.19)

The theoretical autocovariance function, (1.22), is estimated by the sampleautocovariance function defined as follows.

Definition 1.14 The sample autocovariance function is defined as

γ(h) = n−1n−h∑t=1

(xt+h − x)(xt − x), (1.34)

with γ(−h) = γ(h) for h = 0, 1, . . . , n− 1.

The sum in (1.34) runs over a restricted range because xt+h is not availablefor t + h > n. The estimator in (1.34) is preferred to the one that would beobtained by dividing by n−h because (1.34) is a non-negative definite function.The autocovariance function, γ(h), of a stationary process is non-negativedefinite (see Problem 1.25) ensuring that variances of linear combinations ofthe variates xt will never be negative. And, because var(a1xt1 + · · ·+ anxtn)is never negative, the estimate of that variance should also be non-negative.The estimator in (1.34) guarantees this result, but no such guarantee exists ifwe divide by n−h; this is explored further in Problem 1.25. Note that neitherdividing by n nor n− h in (1.34) yields an unbiased estimator of γ(h).

Definition 1.15 The sample autocorrelation function is defined, analo-gously to (1.23), as

ρ(h) =γ(h)

γ(0). (1.35)

The sample autocorrelation function has a sampling distribution that al-lows us to assess whether the data comes from a completely random or whiteseries or whether correlations are statistically significant at some lags.

Property 1.1 Large-Sample Distribution of the ACFUnder general conditions,5 if xt is white noise, then for n large, the sample

ACF, ρx(h), for h = 1, 2, . . . ,H, where H is fixed but arbitrary, is approxi-mately normally distributed with zero mean and standard deviation given by

σρx(h) =1√n. (1.36)

5 The general conditions are that xt is iid with finite fourth moment. A sufficientcondition for this to hold is that xt is white Gaussian noise. Precise details aregiven in Theorem A.7 in Appendix A.

Page 43: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

30 1 Characteristics of Time Series

Based on the previous result, we obtain a rough method of assessingwhether peaks in ρ(h) are significant by determining whether the observedpeak is outside the interval ±2/

√n (or plus/minus two standard errors); for

a white noise sequence, approximately 95% of the sample ACFs should bewithin these limits. The applications of this property develop because manystatistical modeling procedures depend on reducing a time series to a whitenoise series using various kinds of transformations. After such a procedure isapplied, the plotted ACFs of the residuals should then lie roughly within thelimits given above.

Definition 1.16 The estimators for the cross-covariance function, γxy(h), asgiven in (1.26) and the cross-correlation, ρxy(h), in (1.27) are given, respec-tively, by the sample cross-covariance function

γxy(h) = n−1n−h∑t=1

(xt+h − x)(yt − y), (1.37)

where γxy(−h) = γyx(h) determines the function for negative lags, and thesample cross-correlation function

ρxy(h) =γxy(h)√γx(0)γy(0)

. (1.38)

The sample cross-correlation function can be examined graphically as afunction of lag h to search for leading or lagging relations in the data usingthe property mentioned in Example 1.22 for the theoretical cross-covariancefunction. Because −1 ≤ ρxy(h) ≤ 1, the practical importance of peaks canbe assessed by comparing their magnitudes with their theoretical maximumvalues. Furthermore, for xt and yt independent linear processes of the form(1.29), we have the following property.

Property 1.2 Large-Sample Distribution of Cross-CorrelationUnder Independence

The large sample distribution of ρxy(h) is normal with mean zero and

σρxy =1√n

(1.39)

if at least one of the processes is independent white noise (see Theorem A.8in Appendix A).

Example 1.23 A Simulated Time Series

To give an example of the procedure for calculating numerically the auto-covariance and cross-covariance functions, consider a contrived set of data

Page 44: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

1.6 Estimation of Correlation 31

Table 1.1. Sample Realization of the Contrived Series yt

t 1 2 3 4 5 6 7 8 9 10

Coin H H T H T T T H T Hxt 1 1 −1 1 −1 −1 −1 1 −1 1yt 6.7 5.3 3.3 6.7 3.3 4.7 4.7 6.7 3.3 6.7

yt − y 1.56 .16 −1.84 1.56 −1.84 −.44 −.44 1.56 −1.84 1.56

generated by tossing a fair coin, letting xt = 1 when a head is obtained andxt = −1 when a tail is obtained. Construct yt as

yt = 5 + xt − .7xt−1. (1.40)

Table 1.1 shows sample realizations of the appropriate processes with x0 =−1 and n = 10.

The sample autocorrelation for the series yt can be calculated using (1.34)and (1.35) for h = 0, 1, 2, . . .. It is not necessary to calculate for negativevalues because of the symmetry. For example, for h = 3, the autocorrelationbecomes the ratio of

γy(3) = 110

7∑t=1

(yt+3 − y)(yt − y)

= 110

[(1.56)(1.56) + (−1.84)(.16) + (−.44)(−1.84) + (−.44)(1.56)

+ (1.56)(−1.84) + (−1.84)(−.44) + (1.56)(−.44)]

= −.048

toγy(0) = 1

10 [(1.56)2 + (.16)2 + · · ·+ (1.56)2] = 2.030

so that

ρy(3) =−.048

2.030= −.024.

The theoretical ACF can be obtained from the model (1.40) using the factthat the mean of xt is zero and the variance of xt is one. It can be shownthat

ρy(1) =−.7

1 + .72= −.47

and ρy(h) = 0 for |h| > 1 (Problem 1.24). Table 1.2 compares the theo-retical ACF with sample ACFs for a realization where n = 10 and anotherrealization where n = 100; we note the increased variability in the smallersize sample.

Page 45: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

32 1 Characteristics of Time Series

Table 1.2. Theoretical and Sample ACFsfor n = 10 and n = 100

n = 10 n = 100h ρy(h) ρy(h) ρy(h)

0 1.00 1.00 1.00±1 −.47 −.55 −.45±2 .00 .17 −.12±3 .00 −.02 .14±4 .00 .15 .01±5 .00 −.46 −.01

Example 1.24 ACF of a Speech Signal

Computing the sample ACF as in the previous example can be thought ofas matching the time series h units in the future, say, xt+h against itself, xt.Figure 1.13 shows the ACF of the speech series of Figure 1.3. The originalseries appears to contain a sequence of repeating short signals. The ACFconfirms this behavior, showing repeating peaks spaced at about 106-109points. Autocorrelation functions of the short signals appear, spaced at theintervals mentioned above. The distance between the repeating signals isknown as the pitch period and is a fundamental parameter of interest insystems that encode and decipher speech. Because the series is sampled at10,000 points per second, the pitch period appears to be between .0106 and.0109 seconds.

To put the data into speech as a time series object (if it is not therealready from Example 1.3) and compute the sample ACF in R, use

1 acf(speech, 250)

Example 1.25 SOI and Recruitment Correlation Analysis

The autocorrelation and cross-correlation functions are also useful for an-alyzing the joint behavior of two stationary series whose behavior may berelated in some unspecified way. In Example 1.5 (see Figure 1.5), we haveconsidered simultaneous monthly readings of the SOI and the number ofnew fish (Recruitment) computed from a model. Figure 1.14 shows the au-tocorrelation and cross-correlation functions (ACFs and CCF) for these twoseries. Both of the ACFs exhibit periodicities corresponding to the correla-tion between values separated by 12 units. Observations 12 months or oneyear apart are strongly positively correlated, as are observations at multiplessuch as 24, 36, 48, . . . Observations separated by six months are negativelycorrelated, showing that positive excursions tend to be associated with nega-tive excursions six months removed. This appearance is rather characteristicof the pattern that would be produced by a sinusoidal component with a pe-riod of 12 months. The cross-correlation function peaks at h = −6, showingthat the SOI measured at time t− 6 months is associated with the Recruit-ment series at time t. We could say the SOI leads the Recruitment series by

Page 46: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

1.7 Vector-Valued and Multidimensional Series 33

0 50 100 150 200 250

−0.5

0.0

0.5

1.0

Lag

ACF

Fig. 1.13. ACF of the speech series.

six months. The sign of the ACF is negative, leading to the conclusion thatthe two series move in different directions; that is, increases in SOI lead todecreases in Recruitment and vice versa. Again, note the periodicity of 12months in the CCF. The flat lines shown on the plots indicate ±2/

√453,

so that upper values would be exceeded about 2.5% of the time if the noisewere white [see (1.36) and (1.39)].

To reproduce Figure 1.14 in R, use the following commands:1 par(mfrow=c(3,1))

2 acf(soi, 48, main="Southern Oscillation Index")

3 acf(rec, 48, main="Recruitment")

4 ccf(soi, rec, 48, main="SOI vs Recruitment", ylab="CCF")

1.7 Vector-Valued and Multidimensional Series

We frequently encounter situations in which the relationships between a num-ber of jointly measured time series are of interest. For example, in the previoussections, we considered discovering the relationships between the SOI and Re-cruitment series. Hence, it will be useful to consider the notion of a vector timeseries xxxt = (xt1, xt2, . . . , xtp)

′, which contains as its components p univariatetime series. We denote the p × 1 column vector of the observed series as xxxt.The row vector xxx′t is its transpose. For the stationary case, the p × 1 meanvector

µµµ = E(xxxt) (1.41)

of the form µµµ = (µt1, µt2, . . . , µtp)′ and the p× p autocovariance matrix

Page 47: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

34 1 Characteristics of Time Series

0 1 2 3 4

−0.4

0.0

0.4

0.8

AC

FSouthern Oscillation Index

0 1 2 3 4

−0.2

0.2

0.6

1.0

AC

F

Recruitment

−4 −2 0 2 4

−0.6

−0.2

0.2

Lag

CC

F

SOI vs Recruitment

Fig. 1.14. Sample ACFs of the SOI series (top) and of the Recruitment series(middle), and the sample CCF of the two series (bottom); negative lags indicateSOI leads Recruitment. The lag axes are in terms of seasons (12 months).

Γ (h) = E[(xxxt+h − µµµ)(xxxt − µµµ)′] (1.42)

can be defined, where the elements of the matrix Γ (h) are the cross-covariancefunctions

γij(h) = E[(xt+h,i − µi)(xtj − µj)] (1.43)

for i, j = 1, . . . , p. Because γij(h) = γji(−h), it follows that

Γ (−h) = Γ ′(h). (1.44)

Now, the sample autocovariance matrix of the vector series xxxt is the p× pmatrix of sample cross-covariances, defined as

Γ (h) = n−1n−h∑t=1

(xxxt+h − xxx)(xxxt − xxx)′, (1.45)

Page 48: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

1.7 Vector-Valued and Multidimensional Series 35

rows

20

40

60

cols

10

20

30

temperature

4

6

8

10

Fig. 1.15. Two-dimensional time series of temperature measurements taken on arectangular field (64× 36 with 17-foot spacing). Data are from Bazza et al. (1988).

where

xxx = n−1n∑t=1

xxxt (1.46)

denotes the p×1 sample mean vector. The symmetry property of the theoret-ical autocovariance (1.44) extends to the sample autocovariance (1.45), whichis defined for negative values by taking

Γ (−h) = Γ (h)′. (1.47)

In many applied problems, an observed series may be indexed by morethan time alone. For example, the position in space of an experimental unitmight be described by two coordinates, say, s1 and s2. We may proceed inthese cases by defining a multidimensional process xsss as a function of the r×1vector sss = (s1, s2, . . . , sr)

′, where si denotes the coordinate of the ith index.

Example 1.26 Soil Surface Temperatures

As an example, the two-dimensional (r = 2) temperature series xs1,s2 inFigure 1.15 is indexed by a row number s1 and a column number s2 that

Page 49: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

36 1 Characteristics of Time Series

represent positions on a 64× 36 spatial grid set out on an agricultural field.The value of the temperature measured at row s1 and column s2, is denotedby xsss = xs1,s2. We can note from the two-dimensional plot that a distinctchange occurs in the character of the two-dimensional surface starting atabout row 40, where the oscillations along the row axis become fairly stableand periodic. For example, averaging over the 36 columns, we may computean average value for each s1 as in Figure 1.16. It is clear that the noisepresent in the first part of the two-dimensional series is nicely averaged out,and we see a clear and consistent temperature signal.

To generate Figures 1.15 and 1.16 in R, use the following commands:1 persp(1:64, 1:36, soiltemp, phi=30, theta=30, scale=FALSE, expand=4,

ticktype="detailed", xlab="rows", ylab="cols",

zlab="temperature")

2 plot.ts(rowMeans(soiltemp), xlab="row", ylab="Average Temperature")

The autocovariance function of a stationary multidimensional process, xsss,can be defined as a function of the multidimensional lag vector, say, hhh =(h1, h2, . . . , hr)

′, as

γ(hhh) = E[(xsss+hhh − µ)(xsss − µ)], (1.48)

whereµ = E(xsss) (1.49)

does not depend on the spatial coordinate sss. For the two dimensional tem-perature process, (1.48) becomes

γ(h1, h2) = E[(xs1+h1,s2+h2− µ)(xs1,s2 − µ)], (1.50)

which is a function of lag, both in the row (h1) and column (h2) directions.The multidimensional sample autocovariance function is defined as

γ(hhh) = (S1S2 · · ·Sr)−1∑s1

∑s2

· · ·∑sr

(xsss+hhh − x)(xsss − x), (1.51)

where sss = (s1, s2, . . . , sr)′ and the range of summation for each argument is

1 ≤ si ≤ Si−hi, for i = 1, . . . , r. The mean is computed over the r-dimensionalarray, that is,

x = (S1S2 · · ·Sr)−1∑s1

∑s2

· · ·∑sr

xs1,s2,··· ,sr , (1.52)

where the arguments si are summed over 1 ≤ si ≤ Si. The multidimensionalsample autocorrelation function follows, as usual, by taking the scaled ratio

ρ(hhh) =γ(hhh)

γ(0). (1.53)

Page 50: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

1.7 Vector-Valued and Multidimensional Series 37

row

Aver

age

Tem

pera

ture

0 10 20 30 40 50 60

5.5

6.0

6.5

7.0

7.5

Fig. 1.16. Row averages of the two-dimensional soil temperature profile. xs1 =∑s2xs1,s2/36.

Example 1.27 Sample ACF of the Soil Temperature Series

The autocorrelation function of the two-dimensional (2d) temperature pro-cess can be written in the form

ρ(h1, h2) =γ(h1, h2)

γ(0, 0),

where

γ(h1, h2) = (S1S2)−1∑s1

∑s2

(xs1+h1,s2+h2− x)(xs1,s2 − x)

Figure 1.17 shows the autocorrelation function for the temperature data,and we note the systematic periodic variation that appears along the rows.The autocovariance over columns seems to be strongest for h1 = 0, implyingcolumns may form replicates of some underlying process that has a period-icity over the rows. This idea can be investigated by examining the meanseries over columns as shown in Figure 1.16.

The easiest way (that we know of) to calculate a 2d ACF in R is byusing the fast Fourier transform (FFT) as shown below. Unfortunately, thematerial needed to understand this approach is given in Chapter 4, §4.4. The2d autocovariance function is obtained in two steps and is contained in cs

below; γ(0, 0) is the (1,1) element so that ρ(h1, h2) is obtained by dividingeach element by that value. The 2d ACF is contained in rs below, and therest of the code is simply to arrange the results to yield a nice display.

Page 51: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

38 1 Characteristics of Time Series

row lags

−40

−20

0

20

40

colu

mn

lags

−20

−10

0

10

20

AC

F

0.0

0.2

0.4

0.6

0.8

1.0

Fig. 1.17. Two-dimensional autocorrelation function for the soil temperature data.

1 fs = abs(fft(soiltemp-mean(soiltemp)))^2/(64*36)

2 cs = Re(fft(fs, inverse=TRUE)/sqrt(64*36)) # ACovF

3 rs = cs/cs[1,1] # ACF

4 rs2 = cbind(rs[1:41,21:2], rs[1:41,1:21])

5 rs3 = rbind(rs2[41:2,], rs2)

6 par(mar = c(1,2.5,0,0)+.1)

7 persp(-40:40, -20:20, rs3, phi=30, theta=30, expand=30,

scale="FALSE", ticktype="detailed", xlab="row lags",

ylab="column lags", zlab="ACF")

The sampling requirements for multidimensional processes are rather se-vere because values must be available over some uniform grid in order tocompute the ACF. In some areas of application, such as in soil science, wemay prefer to sample a limited number of rows or transects and hope theseare essentially replicates of the basic underlying phenomenon of interest. One-dimensional methods can then be applied. When observations are irregular intime space, modifications to the estimators need to be made. Systematic ap-proaches to the problems introduced by irregularly spaced observations havebeen developed by Journel and Huijbregts (1978) or Cressie (1993). We shallnot pursue such methods in detail here, but it is worth noting that the intro-duction of the variogram

Page 52: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

Problems 39

2Vx(hhh) = var{xsss+hhh − xsss} (1.54)

and its sample estimator

2Vx(hhh) =1

N(hhh)

∑sss

(xsss+hhh − xsss)2 (1.55)

play key roles, where N(hhh) denotes both the number of points located withinhhh, and the sum runs over the points in the neighborhood. Clearly, substantialindexing difficulties will develop from estimators of the kind, and often itwill be difficult to find non-negative definite estimators for the covariancefunction. Problem 1.27 investigates the relation between the variogram andthe autocovariance function in the stationary case.

Problems

Section 1.2

1.1 To compare the earthquake and explosion signals, plot the data displayedin Figure 1.7 on the same graph using different colors or different line typesand comment on the results. (The R code in Example 1.11 may be of help onhow to add lines to existing plots.)

1.2 Consider a signal-plus-noise model of the general form xt = st + wt,where wt is Gaussian white noise with σ2

w = 1. Simulate and plot n = 200observations from each of the following two models (Save the data or yourcode for use in Problem 1.22 ):

(a) xt = st + wt, for t = 1, ..., 200, where

st =

{0, t = 1, . . . , 100

10 exp{− (t−100)20 } cos(2πt/4), t = 101, . . . , 200.

Hint:1 s = c(rep(0,100), 10*exp(-(1:100)/20)*cos(2*pi*1:100/4))

2 x = ts(s + rnorm(200, 0, 1))

3 plot(x)

(b)xt = st + wt, for t = 1, . . . , 200, where

st =

{0, t = 1, . . . , 100

10 exp{− (t−100)200 } cos(2πt/4), t = 101, . . . , 200.

(c) Compare the general appearance of the series (a) and (b) with the earth-quake series and the explosion series shown in Figure 1.7. In addition, plot(or sketch) and compare the signal modulators (a) exp{−t/20} and (b)exp{−t/200}, for t = 1, 2, . . . , 100.

Page 53: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

40 1 Characteristics of Time Series

Section 1.3

1.3 (a) Generate n = 100 observations from the autoregression

xt = −.9xt−2 + wt

with σw = 1, using the method described in Example 1.10, page 13. Next,apply the moving average filter

vt = (xt + xt−1 + xt−2 + xt−3)/4

to xt, the data you generated. Now plot xt as a line and superimpose vtas a dashed line. Comment on the behavior of xt and how applying themoving average filter changes that behavior. [Hints: Use v = filter(x,

rep(1/4, 4), sides = 1) for the filter and note that the R code in Ex-ample 1.11 may be of help on how to add lines to existing plots.]

(b) Repeat (a) but withxt = cos(2πt/4).

(c) Repeat (b) but with added N(0, 1) noise,

xt = cos(2πt/4) + wt.

(d) Compare and contrast (a)–(c).

Section 1.4

1.4 Show that the autocovariance function can be written as

γ(s, t) = E[(xs − µs)(xt − µt)] = E(xsxt)− µsµt,

where E[xt] = µt.

1.5 For the two series, xt, in Problem 1.2 (a) and (b):

(a) Compute and plot the mean functions µx(t), for t = 1, . . . , 200.(b) Calculate the autocovariance functions, γx(s, t), for s, t = 1, . . . , 200.

Section 1.5

1.6 Consider the time series

xt = β1 + β2t+ wt,

where β1 and β2 are known constants and wt is a white noise process withvariance σ2

w.

(a) Determine whether xt is stationary.(b) Show that the process yt = xt − xt−1 is stationary.

Page 54: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

Problems 41

(c) Show that the mean of the moving average

vt =1

2q + 1

q∑j=−q

xt−j

is β1+β2t, and give a simplified expression for the autocovariance function.

1.7 For a moving average process of the form

xt = wt−1 + 2wt + wt+1,

where wt are independent with zero means and variance σ2w, determine the

autocovariance and autocorrelation functions as a function of lag h = s − tand plot the ACF as a function of h.

1.8 Consider the random walk with drift model

xt = δ + xt−1 + wt,

for t = 1, 2, . . . , with x0 = 0, where wt is white noise with variance σ2w.

(a) Show that the model can be written as xt = δt+∑tk=1 wk.

(b) Find the mean function and the autocovariance function of xt.(c) Argue that xt is not stationary.

(d) Show ρx(t− 1, t) =√

t−1t → 1 as t→∞. What is the implication of this

result?(e) Suggest a transformation to make the series stationary, and prove that the

transformed series is stationary. (Hint: See Problem 1.6b.)

1.9 A time series with a periodic component can be constructed from

xt = U1 sin(2πω0t) + U2 cos(2πω0t),

where U1 and U2 are independent random variables with zero means andE(U2

1 ) = E(U22 ) = σ2. The constant ω0 determines the period or time it

takes the process to make one complete cycle. Show that this series is weaklystationary with autocovariance function

γ(h) = σ2 cos(2πω0h).

1.10 Suppose we would like to predict a single stationary series xt with zeromean and autocorrelation function γ(h) at some time in the future, say, t+ `,for ` > 0.

(a) If we predict using only xt and some scale multiplier A, show that themean-square prediction error

MSE(A) = E[(xt+` −Axt)2]

is minimized by the valueA = ρ(`).

Page 55: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

42 1 Characteristics of Time Series

(b) Show that the minimum mean-square prediction error is

MSE(A) = γ(0)[1− ρ2(`)].

(c) Show that if xt+` = Axt, then ρ(`) = 1 if A > 0, and ρ(`) = −1 if A < 0.

1.11 Consider the linear process defined in (1.29).

(a) Verify that the autocovariance function of the process is given by (1.30).Use the result to verify your answer to Problem 1.7.

(b) Show that xt exists as a limit in mean square (see Appendix A).

1.12 For two weakly stationary series xt and yt, verify (1.28).

1.13 Consider the two seriesxt = wt

yt = wt − θwt−1 + ut,

where wt and ut are independent white noise series with variances σ2w and σ2

u,respectively, and θ is an unspecified constant.

(a) Express the ACF, ρy(h), for h = 0,±1,±2, . . . of the series yt as a functionof σ2

w, σ2u, and θ.

(b) Determine the CCF, ρxy(h) relating xt and yt.(c) Show that xt and yt are jointly stationary.

1.14 Let xt be a stationary normal process with mean µx and autocovariancefunction γ(h). Define the nonlinear time series

yt = exp{xt}.

(a) Express the mean function E(yt) in terms of µx and γ(0). The momentgenerating function of a normal random variable x with mean µ and vari-ance σ2 is

Mx(λ) = E[exp{λx}] = exp

{µλ+

1

2σ2λ2

}.

(b) Determine the autocovariance function of yt. The sum of the two normalrandom variables xt+h + xt is still a normal random variable.

1.15 Let wt, for t = 0,±1,±2, . . . be a normal white noise process, and con-sider the series

xt = wtwt−1.

Determine the mean and autocovariance function of xt, and state whether itis stationary.

Page 56: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

Problems 43

1.16 Consider the seriesxt = sin(2πUt),

t = 1, 2, . . ., where U has a uniform distribution on the interval (0, 1).

(a) Prove xt is weakly stationary.(b) Prove xt is not strictly stationary. [Hint: consider the joint bivariate cdf

(1.18) at the points t = 1, s = 2 with h = 1, and find values of ct, cs wherestrict stationarity does not hold.]

1.17 Suppose we have the linear process xt generated by

xt = wt − θwt−1,

t = 0, 1, 2, . . ., where {wt} is independent and identically distributed withcharacteristic function φw(·), and θ is a fixed constant. [Replace “characteristicfunction” with “moment generating function” if instructed to do so.]

(a) Express the joint characteristic function of x1, x2, . . . , xn, say,

φx1,x2,...,xn(λ1, λ2, . . . , λn),

in terms of φw(·).(b) Deduce from (a) that xt is strictly stationary.

1.18 Suppose that xt is a linear process of the form (1.29). Prove

∞∑h=−∞

|γ(h)| <∞.

Section 1.6

1.19 Suppose x1, . . . , xn is a sample from the process xt = µ + wt − .8wt−1,where wt ∼ wn(0, σ2

w).

(a) Show that mean function is E(xt) = µ.(b) Use (1.33) to calculate the standard error of x for estimating µ.(c) Compare (b) to the case where xt is white noise and show that (b) is

smaller. Explain the result.

1.20 (a) Simulate a series of n = 500 Gaussian white noise observations as inExample 1.8 and compute the sample ACF, ρ(h), to lag 20. Compare thesample ACF you obtain to the actual ACF, ρ(h). [Recall Example 1.19.]

(b) Repeat part (a) using only n = 50. How does changing n affect the results?

1.21 (a) Simulate a series of n = 500 moving average observations as in Ex-ample 1.9 and compute the sample ACF, ρ(h), to lag 20. Compare thesample ACF you obtain to the actual ACF, ρ(h). [Recall Example 1.20.]

(b) Repeat part (a) using only n = 50. How does changing n affect the results?

Page 57: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

44 1 Characteristics of Time Series

1.22 Although the model in Problem 1.2(a) is not stationary (Why?), thesample ACF can be informative. For the data you generated in that problem,calculate and plot the sample ACF, and then comment.

1.23 Simulate a series of n = 500 observations from the signal-plus-noisemodel presented in Example 1.12 with σ2

w = 1. Compute the sample ACF tolag 100 of the data you generated and comment.

1.24 For the time series yt described in Example 1.23, verify the stated resultthat ρy(1) = −.47 and ρy(h) = 0 for h > 1.

1.25 A real-valued function g(t), defined on the integers, is non-negative def-inite if and only if

n∑i=1

n∑j=1

aig(ti − tj)aj ≥ 0

for all positive integers n and for all vectors aaa = (a1, a2, . . . , an)′ and ttt =(t1, t2, . . . , tn)′. For the matrix G = {g(ti− tj); i, j = 1, 2, . . . , n}, this impliesthat aaa′Gaaa ≥ 0 for all vectors aaa. It is called positive definite if we can replace‘≥’ with ‘>’ for all aaa 6= 000, the zero vector.

(a) Prove that γ(h), the autocovariance function of a stationary process, is anon-negative definite function.

(b) Verify that the sample autocovariance γ(h) is a non-negative definite func-tion.

Section 1.7

1.26 Consider a collection of time series x1t, x2t, . . . , xNt that are observingsome common signal µt observed in noise processes e1t, e2t, . . . , eNt, with amodel for the j-th observed series given by

xjt = µt + ejt.

Suppose the noise series have zero means and are uncorrelated for differentj. The common autocovariance functions of all series are given by γe(s, t).Define the sample mean

xt =1

N

N∑j=1

xjt.

(a) Show that E[xt] = µt.(b) Show that E[(xt − µ)2)] = N−1γe(t, t).(c) How can we use the results in estimating the common signal?

Page 58: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

Problems 45

1.27 A concept used in geostatistics, see Journel and Huijbregts (1978) orCressie (1993), is that of the variogram, defined for a spatial process xsss,sss = (s1, s2), for s1, s2 = 0,±1,±2, ..., as

Vx(hhh) =1

2E[(xsss+hhh − xsss)2],

where hhh = (h1, h2), for h1, h2 = 0,±1,±2, ... Show that, for a stationaryprocess, the variogram and autocovariance functions can be related through

Vx(hhh) = γ(000)− γ(hhh),

where γ(hhh) is the usual lag hhh covariance function and 000 = (0, 0). Note the easyextension to any spatial dimension.

The following problems require the material given in Appendix A

1.28 Suppose xt = β0 +β1t, where β0 and β1 are constants. Prove as n→∞,ρx(h)→ 1 for fixed h, where ρx(h) is the ACF (1.35).

1.29 (a) Suppose xt is a weakly stationary time series with mean zero andwith absolutely summable autocovariance function, γ(h), such that

∞∑h=−∞

γ(h) = 0.

Prove that√n x

p→ 0, where x is the sample mean (1.32).(b) Give an example of a process that satisfies the conditions of part (a). What

is special about this process?

1.30 Let xt be a linear process of the form (A.43)–(A.44). If we define

γ(h) = n−1n∑t=1

(xt+h − µx)(xt − µx),

show thatn1/2

(γ(h)− γ(h)

)= op(1).

Hint: The Markov Inequality

P{|x| ≥ ε} < E|x|ε

can be helpful for the cross-product terms.

1.31 For a linear process of the form

xt =∞∑j=0

φjwt−j ,

Page 59: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

46 1 Characteristics of Time Series

where {wt} satisfies the conditions of Theorem A.7 and |φ| < 1, show that

√n

(ρx(1)− ρx(1))√1− ρ2x(1)

d→ N(0, 1),

and construct a 95% confidence interval for φ when ρx(1) = .64 and n = 100.

1.32 Let {xt; t = 0,±1,±2, . . .} be iid(0, σ2).

(a) For h ≥ 1 and k ≥ 1, show that xtxt+h and xsxs+k are uncorrelated forall s 6= t.

(b) For fixed h ≥ 1, show that the h× 1 vector

σ−2n−1/2n∑t=1

(xtxt+1, . . . , xtxt+h)′d→ (z1, . . . , zh)′

where z1, . . . , zh are iid N(0, 1) random variables. [Note: the sequence{xtxt+h; t = 1, 2, . . .} is h-dependent and white noise (0, σ4). Also, re-call the Cramer-Wold device.]

(c) Show, for each h ≥ 1,

n−1/2

[n∑t=1

xtxt+h −n−h∑t=1

(xt − x)(xt+h − x)

]p→ 0 as n→∞

where x = n−1∑nt=1 xt.

(d) Noting that n−1∑nt=1 x

2tp→ σ2, conclude that

n1/2 [ρ(1), . . . , ρ(h)]′ d→ (z1, . . . , zh)′

where ρ(h) is the sample ACF of the data x1, . . . , xn.

Page 60: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

2

Time Series Regression and ExploratoryData Analysis

2.1 Introduction

The linear model and its applications are at least as dominant in the timeseries context as in classical statistics. Regression models are important fortime domain models discussed in Chapters 3, 5, and 6, and in the frequencydomain models considered in Chapters 4 and 7. The primary ideas dependon being able to express a response series, say xt, as a linear combinationof inputs, say zt1, zt2, . . . , ztq. Estimating the coefficients β1, β2, . . . , βq in thelinear combinations by least squares provides a method for modeling xt interms of the inputs.

In the time domain applications of Chapter 3, for example, we will expressxt as a linear combination of previous values xt−1, xt−2, . . . , xt−p, of the cur-rently observed series. The outputs xt may also depend on lagged values ofanother series, say yt−1, yt−2, . . . , yt−q, that have influence. It is easy to seethat forecasting becomes an option when prediction models can be formulatedin this form. Time series smoothing and filtering can be expressed in termsof local regression models. Polynomials and regression splines also provideimportant techniques for smoothing.

If one admits sines and cosines as inputs, the frequency domain ideas thatlead to the periodogram and spectrum of Chapter 4 follow from a regressionmodel. Extensions to filters of infinite extent can be handled using regressionin the frequency domain. In particular, many regression problems in the fre-quency domain can be carried out as a function of the periodic componentsof the input and output series, providing useful scientific intuition into fieldslike acoustics, oceanographics, engineering, biomedicine, and geophysics.

The above considerations motivate us to include a separate chapter onregression and some of its applications that is written on an elementary leveland is formulated in terms of time series. The assumption of linearity, sta-tionarity, and homogeneity of variances over time is critical in the regression

© Springer Science+Business Media, LLC 2011

R.H. Shumway and D.S. Stoffer, Time Series Analysis and Its Applications: With R Examples,Springer Texts in Statistics, DOI 10.1007/978-1-4419-7865-3_2,

47

Page 61: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

48 2 Time Series Regression and Exploratory Data Analysis

context, and therefore we include some material on transformations and othertechniques useful in exploratory data analysis.

2.2 Classical Regression in the Time Series Context

We begin our discussion of linear regression in the time series context byassuming some output or dependent time series, say, xt, for t = 1, . . . , n,is being influenced by a collection of possible inputs or independent series,say, zt1, zt2, . . . , ztq, where we first regard the inputs as fixed and known.This assumption, necessary for applying conventional linear regression, willbe relaxed later on. We express this relation through the linear regressionmodel

xt = β1zt1 + β2zt2 + · · ·+ βqztq + wt, (2.1)

where β1, β2, . . . , βq are unknown fixed regression coefficients, and {wt} isa random error or noise process consisting of independent and identicallydistributed (iid) normal variables with mean zero and variance σ2

w; we willrelax the iid assumption later. A more general setting within which to embedmean square estimation and linear regression is given in Appendix B, wherewe introduce Hilbert spaces and the Projection Theorem.

Example 2.1 Estimating a Linear Trend

Consider the global temperature data, say xt, shown in Figures 1.2 and 2.1.As discussed in Example 1.2, there is an apparent upward trend in the seriesthat has been used to argue the global warming hypothesis. We might usesimple linear regression to estimate that trend by fitting the model

xt = β1 + β2t+ wt, t = 1880, 1857, . . . , 2009.

This is in the form of the regression model (2.1) when we make the identifi-cation q = 2, zt1 = 1 and zt2 = t. Note that we are making the assumptionthat the errors, wt, are an iid normal sequence, which may not be true.We will address this problem further in §2.3; the problem of autocorrelatederrors is discussed in detail in §5.5. Also note that we could have used, forexample, t = 1, . . . , 130, without affecting the interpretation of the slopecoefficient, β2; only the intercept, β1, would be affected.

Using simple linear regression, we obtained the estimated coefficients β1 =−11.2, and β2 = .006 (with a standard error of .0003) yielding a highlysignificant estimated increase of .6 degrees centigrade per 100 years. Wediscuss the precise way in which the solution was accomplished after theexample. Finally, Figure 2.1 shows the global temperature data, say xt, withthe estimated trend, say xt = −11.2 + .006t, superimposed. It is apparentthat the estimated trend line obtained via simple linear regression does notquite capture the trend of the data and better models will be needed.

To perform this analysis in R, use the following commands:

Page 62: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

2.2 Classical Regression in the Time Series Context 49

Fig. 2.1. Global temperature deviations shown in Figure 1.2 with fitted linear trendline.

1 summary(fit <- lm(gtemp~time(gtemp))) # regress gtemp on time

2 plot(gtemp, type="o", ylab="Global Temperature Deviation")

3 abline(fit) # add regression line to the plot

The linear model described by (2.1) above can be conveniently written ina more general notation by defining the column vectors zzzt = (zt1, zt2, . . . , ztq)

and βββ = (β1, β2, . . . , βq)′, where ′ denotes transpose, so (2.1) can be written

in the alternate formxt = βββ′zzzt + wt. (2.2)

where wt ∼ iid N(0, σ2w). It is natural to consider estimating the unknown

coefficient vector βββ by minimizing the error sum of squares

Q =

n∑t=1

w2t =

n∑t=1

(xt − βββ′zzzt)2, (2.3)

with respect to β1, β2, . . . , βq. Minimizing Q yields the ordinary least squaresestimator of βββ. This minimization can be accomplished by differentiating (2.3)with respect to the vector βββ or by using the properties of projections. In thenotation above, this procedure gives the normal equations( n∑

t=1

zzztzzz′t

)βββ =

n∑t=1

zzztxt. (2.4)

The notation can be simplified by defining Z = [zzz1 | zzz2 | · · · | zzzn]′ as then × q matrix composed of the n samples of the input variables, the ob-served n × 1 vector xxx = (x1, x2, . . . , xn)′ and the n × 1 vector of errors

Page 63: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

50 2 Time Series Regression and Exploratory Data Analysis

www = (w1, w2, . . . , wn)′. In this case, model (2.2) may be written as

xxx = Zβββ +www. (2.5)

The normal equations, (2.4), can now be written as

(Z ′Z) βββ = Z ′xxx (2.6)

and the solutionβββ = (Z ′Z)−1Z ′xxx (2.7)

when the matrix Z ′Z is nonsingular. The minimized error sum of squares(2.3), denoted SSE, can be written as

SSE =n∑t=1

(xt − βββ′zzzt)

2

= (xxx− Zβββ)′(xxx− Zβββ)

= xxx′xxx− βββ′Z ′xxx

= xxx′xxx− xxx′Z(Z ′Z)−1Z ′xxx,

(2.8)

to give some useful versions for later reference. The ordinary least squaresestimators are unbiased, i.e., E(βββ) = βββ, and have the smallest variance withinthe class of linear unbiased estimators.

If the errors wt are normally distributed, βββ is also the maximum likelihoodestimator for βββ and is normally distributed with

cov(βββ) = σ2w

( n∑t=1

zzztzzz′t

)−1= σ2

w(Z ′Z)−1 = σ2wC, (2.9)

whereC = (Z ′Z)−1 (2.10)

is a convenient notation for later equations. An unbiased estimator for thevariance σ2

w is

s2w = MSE =SSE

n− q, (2.11)

where MSE denotes the mean squared error, which is contrasted with themaximum likelihood estimator σ2

w = SSE/n. Under the normal assumption,s2w is distributed proportionally to a chi-squared random variable with n− qdegrees of freedom, denoted by χ2

n−q, and independently of β. It follows that

tn−q =(βi − βi)sw√cii

(2.12)

has the t-distribution with n − q degrees of freedom; cii denotes the i-thdiagonal element of C, as defined in (2.10).

Page 64: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

2.2 Classical Regression in the Time Series Context 51

Table 2.1. Analysis of Variance for Regression

Source df Sum of Squares Mean Square

zt,r+1, . . . , zt,q q − r SSR = SSEr − SSE MSR = SSR/(q − r)Error n− q SSE MSE = SSE/(n− q)Total n− r SSEr

Various competing models are of interest to isolate or select the best subsetof independent variables. Suppose a proposed model specifies that only asubset r < q independent variables, say, zzzt:r = (zt1, zt2, . . . , ztr)

′ is influencingthe dependent variable xt. The reduced model is

xxx = Zrβββr +www (2.13)

where βββr = (β1, β2, . . . , βr)′ is a subset of coefficients of the original q variables

and Zr = [zzz1:r | · · · | zzzn:r]′ is the n × r matrix of inputs. The null hypothesisin this case is H0: βr+1 = · · · = βq = 0. We can test the reduced model (2.13)against the full model (2.2) by comparing the error sums of squares under thetwo models using the F -statistic

Fq−r,n−q =(SSEr − SSE)/(q − r)

SSE/(n− q), (2.14)

which has the central F -distribution with q − r and n− q degrees of freedomwhen (2.13) is the correct model. Note that SSEr is the error sum of squaresunder the reduced model (2.13) and it can be computed by replacing Z withZr in (2.8). The statistic, which follows from applying the likelihood ratiocriterion, has the improvement per number of parameters added in the nu-merator compared with the error sum of squares under the full model in thedenominator. The information involved in the test procedure is often summa-rized in an Analysis of Variance (ANOVA) table as given in Table 2.1 for thisparticular case. The difference in the numerator is often called the regressionsum of squares

In terms of Table 2.1, it is conventional to write the F -statistic (2.14) asthe ratio of the two mean squares, obtaining

Fq−r,n−q =MSR

MSE, (2.15)

where MSR, the mean squared regression, is the numerator of (2.14). A specialcase of interest is r = 1 and zt1 ≡ 1, when the model in (2.13) becomes

xt = β1 + wt,

and we may measure the proportion of variation accounted for by the othervariables using

Page 65: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

52 2 Time Series Regression and Exploratory Data Analysis

R2 =SSE1 − SSE

SSE1, (2.16)

where the residual sum of squares under the reduced model

SSE1 =n∑t=1

(xt − x)2, (2.17)

in this case is just the sum of squared deviations from the mean x. The mea-sure R2 is also the squared multiple correlation between xt and the variableszt2, zt3, . . . , ztq.

The techniques discussed in the previous paragraph can be used to testvarious models against one another using the F test given in (2.14), (2.15),and the ANOVA table. These tests have been used in the past in a stepwisemanner, where variables are added or deleted when the values from the F -test either exceed or fail to exceed some predetermined levels. The procedure,called stepwise multiple regression, is useful in arriving at a set of usefulvariables. An alternative is to focus on a procedure for model selection thatdoes not proceed sequentially, but simply evaluates each model on its ownmerits. Suppose we consider a normal regression model with k coefficientsand denote the maximum likelihood estimator for the variance as

σ2k =

SSEkn

, (2.18)

where SSEk denotes the residual sum of squares under the model with kregression coefficients. Then, Akaike (1969, 1973, 1974) suggested measuringthe goodness of fit for this particular model by balancing the error of the fitagainst the number of parameters in the model; we define the following.1

Definition 2.1 Akaike’s Information Criterion (AIC)

AIC = log σ2k +

n+ 2k

n, (2.19)

where σ2k is given by (2.18) and k is the number of parameters in the model.

The value of k yielding the minimum AIC specifies the best model. Theidea is roughly that minimizing σ2

k would be a reasonable objective, exceptthat it decreases monotonically as k increases. Therefore, we ought to penalizethe error variance by a term proportional to the number of parameters. Thechoice for the penalty term given by (2.19) is not the only one, and a consid-erable literature is available advocating different penalty terms. A corrected

1 Formally, AIC is defined as −2 logLk + 2k where Lk is the maximized log-likelihood and k is the number of parameters in the model. For the normal regres-sion problem, AIC can be reduced to the form given by (2.19). AIC is an estimateof the Kullback-Leibler discrepency between a true model and a candidate model;see Problems 2.4 and 2.5 for further details.

Page 66: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

2.2 Classical Regression in the Time Series Context 53

form, suggested by Sugiura (1978), and expanded by Hurvich and Tsai (1989),can be based on small-sample distributional results for the linear regressionmodel (details are provided in Problems 2.4 and 2.5). The corrected form isdefined as follows.

Definition 2.2 AIC, Bias Corrected (AICc)

AICc = log σ2k +

n+ k

n− k − 2, (2.20)

where σ2k is given by (2.18), k is the number of parameters in the model, and

n is the sample size.

We may also derive a correction term based on Bayesian arguments, as inSchwarz (1978), which leads to the following.

Definition 2.3 Bayesian Information Criterion (BIC)

BIC = log σ2k +

k log n

n, (2.21)

using the same notation as in Definition 2.2.

BIC is also called the Schwarz Information Criterion (SIC); see also Ris-sanen (1978) for an approach yielding the same statistic based on a minimumdescription length argument. Various simulation studies have tended to ver-ify that BIC does well at getting the correct order in large samples, whereasAICc tends to be superior in smaller samples where the relative number ofparameters is large; see McQuarrie and Tsai (1998) for detailed comparisons.In fitting regression models, two measures that have been used in the past areadjusted R-squared, which is essentially s2w, and Mallows Cp, Mallows (1973),which we do not consider in this context.

Example 2.2 Pollution, Temperature and Mortality

The data shown in Figure 2.2 are extracted series from a study by Shumwayet al. (1988) of the possible effects of temperature and pollution on weeklymortality in Los Angeles County. Note the strong seasonal components in allof the series, corresponding to winter-summer variations and the downwardtrend in the cardiovascular mortality over the 10-year period.

A scatterplot matrix, shown in Figure 2.3, indicates a possible linear rela-tion between mortality and the pollutant particulates and a possible relationto temperature. Note the curvilinear shape of the temperature mortalitycurve, indicating that higher temperatures as well as lower temperaturesare associated with increases in cardiovascular mortality.

Based on the scatterplot matrix, we entertain, tentatively, four modelswhere Mt denotes cardiovascular mortality, Tt denotes temperature and Ptdenotes the particulate levels. They are

Page 67: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

54 2 Time Series Regression and Exploratory Data Analysis

Cardiovascular Mortality

1970 1972 1974 1976 1978 1980

7080

9011

013

0

Temperature

1970 1972 1974 1976 1978 1980

5060

7080

9010

0

Particulates

1970 1972 1974 1976 1978 1980

2040

6080

100

Fig. 2.2. Average weekly cardiovascular mortality (top), temperature (middle)and particulate pollution (bottom) in Los Angeles County. There are 508 six-daysmoothed averages obtained by filtering daily values over the 10 year period 1970-1979.

Mt = β1 + β2t+ wt (2.22)

Mt = β1 + β2t+ β3(Tt − T·) + wt (2.23)

Mt = β1 + β2t+ β3(Tt − T·) + β4(Tt − T·)2 + wt (2.24)

Mt = β1 + β2t+ β3(Tt − T·) + β4(Tt − T·)2 + β5Pt + wt (2.25)

where we adjust temperature for its mean, T· = 74.6, to avoid scaling prob-lems. It is clear that (2.22) is a trend only model, (2.23) is linear temperature,(2.24) is curvilinear temperature and (2.25) is curvilinear temperature andpollution. We summarize some of the statistics given for this particular casein Table 2.2. The values of R2 were computed by noting that SSE1 = 50, 687using (2.17).

We note that each model does substantially better than the one beforeit and that the model including temperature, temperature squared, andparticulates does the best, accounting for some 60% of the variability andwith the best value for AIC and BIC (because of the large sample size, AIC

Page 68: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

2.2 Classical Regression in the Time Series Context 55

Mortality

50 60 70 80 90 100

7080

9010

012

0

5060

7080

9010

0

Temperature

70 80 90 100 120 20 40 60 80 10020

4060

8010

0

Particulates

Fig. 2.3. Scatterplot matrix showing plausible relations between mortality, temper-ature, and pollution.

Table 2.2. Summary Statistics for Mortality Models

Model k SSE df MSE R2 AIC BIC

(2.22) 2 40,020 506 79.0 .21 5.38 5.40(2.23) 3 31,413 505 62.2 .38 5.14 5.17(2.24) 4 27,985 504 55.5 .45 5.03 5.07(2.25) 5 20,508 503 40.8 .60 4.72 4.77

and AICc are nearly the same). Note that one can compare any two modelsusing the residual sums of squares and (2.14). Hence, a model with onlytrend could be compared to the full model using q = 5, r = 2, n = 508, so

F3,503 =(40, 020− 20, 508)/3

20, 508/503= 160,

Page 69: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

56 2 Time Series Regression and Exploratory Data Analysis

which exceeds F3,503(.001) = 5.51. We obtain the best prediction model,

Mt = 81.59− .027(.002)t− .473(.032)(Tt − 74.6)

+ .023(.003)(Tt − 74.6)2 + .255(.019)Pt,

for mortality, where the standard errors, computed from (2.9)-(2.11), aregiven in parentheses. As expected, a negative trend is present in time aswell as a negative coefficient for adjusted temperature. The quadratic effectof temperature can clearly be seen in the scatterplots of Figure 2.3. Pollutionweights positively and can be interpreted as the incremental contribution todaily deaths per unit of particulate pollution. It would still be essential tocheck the residuals wt = Mt − Mt for autocorrelation (of which there is asubstantial amount), but we defer this question to to §5.6 when we discussregression with correlated errors.

Below is the R code to plot the series, display the scatterplot matrix, fitthe final regression model (2.25), and compute the corresponding values ofAIC, AICc and BIC.2 Finally, the use of na.action in lm() is to retain thetime series attributes for the residuals and fitted values.

1 par(mfrow=c(3,1))

2 plot(cmort, main="Cardiovascular Mortality", xlab="", ylab="")

3 plot(tempr, main="Temperature", xlab="", ylab="")

4 plot(part, main="Particulates", xlab="", ylab="")

5 dev.new() # open a new graphic device for the scatterplot matrix

6 pairs(cbind(Mortality=cmort, Temperature=tempr, Particulates=part))

7 temp = tempr-mean(tempr) # center temperature

8 temp2 = temp^2

9 trend = time(cmort) # time

10 fit = lm(cmort~ trend + temp + temp2 + part, na.action=NULL)

11 summary(fit) # regression results

12 summary(aov(fit)) # ANOVA table (compare to next line)

13 summary(aov(lm(cmort~cbind(trend, temp, temp2, part)))) # Table 2.1

14 num = length(cmort) # sample size

15 AIC(fit)/num - log(2*pi) # AIC

16 AIC(fit, k=log(num))/num - log(2*pi) # BIC

17 (AICc = log(sum(resid(fit)^2)/num) + (num+5)/(num-5-2)) # AICc

As previously mentioned, it is possible to include lagged variables in timeseries regression models and we will continue to discuss this type of problemthroughout the text. This concept is explored further in Problems 2.2 and2.11. The following is a simple example of lagged regression.

2 The easiest way to extract AIC and BIC from an lm() run in R is to use thecommand AIC(). Our definitions differ from R by terms that do not change frommodel to model. In the example, we show how to obtain (2.19) and (2.21) fromthe R output. It is more difficult to obtain AICc.

Page 70: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

2.3 Exploratory Data Analysis 57

Example 2.3 Regression With Lagged Variables

In Example 1.25, we discovered that the Southern Oscillation Index (SOI)measured at time t− 6 months is associated with the Recruitment series attime t, indicating that the SOI leads the Recruitment series by six months.Although there is evidence that the relationship is not linear (this is dis-cussed further in Example 2.7), we may consider the following regression,

Rt = β1 + β2St−6 + wt, (2.26)

where Rt denotes Recruitment for month t and St−6 denotes SOI six monthsprior. Assuming the wt sequence is white, the fitted model is

Rt = 65.79− 44.28(2.78)St−6 (2.27)

with σw = 22.5 on 445 degrees of freedom. This result indicates the strongpredictive ability of SOI for Recruitment six months in advance. Of course,it is still essential to check the the model assumptions, but again we deferthis until later.

Performing lagged regression in R is a little difficult because the seriesmust be aligned prior to running the regression. The easiest way to do thisis to create a data frame that we call fish using ts.intersect, which alignsthe lagged series.

1 fish = ts.intersect(rec, soiL6=lag(soi,-6), dframe=TRUE)

2 summary(lm(rec~soiL6, data=fish, na.action=NULL))

2.3 Exploratory Data Analysis

In general, it is necessary for time series data to be stationary, so averag-ing lagged products over time, as in the previous section, will be a sensiblething to do. With time series data, it is the dependence between the valuesof the series that is important to measure; we must, at least, be able to es-timate autocorrelations with precision. It would be difficult to measure thatdependence if the dependence structure is not regular or is changing at everytime point. Hence, to achieve any meaningful statistical analysis of time seriesdata, it will be crucial that, if nothing else, the mean and the autocovariancefunctions satisfy the conditions of stationarity (for at least some reasonablestretch of time) stated in Definition 1.7. Often, this is not the case, and wewill mention some methods in this section for playing down the effects ofnonstationarity so the stationary properties of the series may be studied.

A number of our examples came from clearly nonstationary series. TheJohnson & Johnson series in Figure 1.1 has a mean that increases exponen-tially over time, and the increase in the magnitude of the fluctuations aroundthis trend causes changes in the covariance function; the variance of the pro-cess, for example, clearly increases as one progresses over the length of theseries. Also, the global temperature series shown in Figure 1.2 contains some

Page 71: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

58 2 Time Series Regression and Exploratory Data Analysis

evidence of a trend over time; human-induced global warming advocates seizeon this as empirical evidence to advance their hypothesis that temperaturesare increasing.

Perhaps the easiest form of nonstationarity to work with is the trend sta-tionary model wherein the process has stationary behavior around a trend.We may write this type of model as

xt = µt + yt (2.28)

where xt are the observations, µt denotes the trend, and yt is a stationaryprocess. Quite often, strong trend, µt, will obscure the behavior of the sta-tionary process, yt, as we shall see in numerous examples. Hence, there is someadvantage to removing the trend as a first step in an exploratory analysis ofsuch time series. The steps involved are to obtain a reasonable estimate of thetrend component, say µt, and then work with the residuals

yt = xt − µt. (2.29)

Consider the following example.

Example 2.4 Detrending Global Temperature

Here we suppose the model is of the form of (2.28),

xt = µt + yt,

where, as we suggested in the analysis of the global temperature data pre-sented in Example 2.1, a straight line might be a reasonable model for thetrend, i.e.,

µt = β1 + β2 t.

In that example, we estimated the trend using ordinary least squares3 andfound

µt = −11.2 + .006 t.

Figure 2.1 shows the data with the estimated trend line superimposed. Toobtain the detrended series we simply subtract µt from the observations, xt,to obtain the detrended series

yt = xt + 11.2− .006 t.

The top graph of Figure 2.4 shows the detrended series. Figure 2.5 showsthe ACF of the original data (top panel) as well as the ACF of the detrendeddata (middle panel).

3 Because the error term, yt, is not assumed to be iid, the reader may feel thatweighted least squares is called for in this case. The problem is, we do not knowthe behavior of yt and that is precisely what we are trying to assess at this stage.A notable result by Grenander and Rosenblatt (1957, Ch 7), however, is thatunder mild conditions on yt, for polynomial regression or periodic regression,asymptotically, ordinary least squares is equivalent to weighted least squares.

Page 72: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

2.3 Exploratory Data Analysis 59

Fig. 2.4. Detrended (top) and differenced (bottom) global temperature series. Theoriginal data are shown in Figures 1.2 and 2.1.

To detrend in the series in R, use the following commands. We also showhow to difference and plot the differenced data; we discuss differencing af-ter this example. In addition, we show how to generate the sample ACFsdisplayed in Figure 2.5.

1 fit = lm(gtemp~time(gtemp), na.action=NULL) # regress gtemp on time

2 par(mfrow=c(2,1))

3 plot(resid(fit), type="o", main="detrended")

4 plot(diff(gtemp), type="o", main="first difference")

5 par(mfrow=c(3,1)) # plot ACFs

6 acf(gtemp, 48, main="gtemp")

7 acf(resid(fit), 48, main="detrended")

8 acf(diff(gtemp), 48, main="first difference")

In Example 1.11 and the corresponding Figure 1.10 we saw that a randomwalk might also be a good model for trend. That is, rather than modelingtrend as fixed (as in Example 2.4), we might model trend as a stochasticcomponent using the random walk with drift model,

µt = δ + µt−1 + wt, (2.30)

Page 73: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

60 2 Time Series Regression and Exploratory Data Analysis

0 10 20 30 40

−0.2

0.2

0.6

1.0

ACF

gtemp

0 10 20 30 40

−0.2

0.2

0.6

1.0

ACF

detrended

0 10 20 30 40

−0.2

0.2

0.6

1.0

Lag

ACF

first difference

Fig. 2.5. Sample ACFs of the global temperature (top), and of the detrended(middle) and the differenced (bottom) series.

where wt is white noise and is independent of yt. If the appropriate model is(2.28), then differencing the data, xt, yields a stationary process; that is,

xt − xt−1 = (µt + yt)− (µt−1 + yt−1) (2.31)

= δ + wt + yt − yt−1.

It is easy to show zt = yt − yt−1 is stationary using footnote 3 of Chapter 1on page 20. That is, because yt is stationary,

γz(h) = cov(zt+h, zt) = cov(yt+h − yt+h−1, yt − yt−1)

= 2γy(h)− γy(h+ 1)− γy(h− 1)

is independent of time; we leave it as an exercise (Problem 2.7) to show thatxt − xt−1 in (2.31) is stationary.

Page 74: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

2.3 Exploratory Data Analysis 61

One advantage of differencing over detrending to remove trend is thatno parameters are estimated in the differencing operation. One disadvantage,however, is that differencing does not yield an estimate of the stationaryprocess yt as can be seen in (2.31). If an estimate of yt is essential, thendetrending may be more appropriate. If the goal is to coerce the data tostationarity, then differencing may be more appropriate. Differencing is alsoa viable tool if the trend is fixed, as in Example 2.4. That is, e.g., if µt =β1 + β2 t in the model (2.28), differencing the data produces stationarity (seeProblem 2.6):

xt − xt−1 = (µt + yt)− (µt−1 + yt−1) = β2 + yt − yt−1.

Because differencing plays a central role in time series analysis, it receivesits own notation. The first difference is denoted as

∇xt = xt − xt−1. (2.32)

As we have seen, the first difference eliminates a linear trend. A second differ-ence, that is, the difference of (2.32), can eliminate a quadratic trend, and soon. In order to define higher differences, we need a variation in notation thatwe will use often in our discussion of ARIMA models in Chapter 3.

Definition 2.4 We define the backshift operator by

Bxt = xt−1

and extend it to powers B2xt = B(Bxt) = Bxt−1 = xt−2, and so on. Thus,

Bkxt = xt−k. (2.33)

It is clear that we may then rewrite (2.32) as

∇xt = (1−B)xt, (2.34)

and we may extend the notion further. For example, the second differencebecomes

∇2xt = (1−B)2xt = (1− 2B +B2)xt= xt − 2xt−1 + xt−2

by the linearity of the operator. To check, just take the difference of the firstdifference ∇(∇xt) = ∇(xt − xt−1) = (xt − xt−1)− (xt−1 − xt−2).

Definition 2.5 Differences of order d are defined as

∇d = (1−B)d, (2.35)

where we may expand the operator (1−B)d algebraically to evaluate for higherinteger values of d. When d = 1, we drop it from the notation.

Page 75: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

62 2 Time Series Regression and Exploratory Data Analysis

The first difference (2.32) is an example of a linear filter applied to elim-inate a trend. Other filters, formed by averaging values near xt, can pro-duce adjusted series that eliminate other kinds of unwanted fluctuations, asin Chapter 3. The differencing technique is an important component of theARIMA model of Box and Jenkins (1970) (see also Box et al., 1994), to bediscussed in Chapter 3.

Example 2.5 Differencing Global Temperature

The first difference of the global temperature series, also shown in Figure 2.4,produces different results than removing trend by detrending via regression.For example, the differenced series does not contain the long middle cyclewe observe in the detrended series. The ACF of this series is also shown inFigure 2.5. In this case it appears that the differenced process shows minimalautocorrelation, which may imply the global temperature series is nearly arandom walk with drift. It is interesting to note that if the series is a randomwalk with drift, the mean of the differenced series, which is an estimate ofthe drift, is about .0066 (but with a large standard error):

1 mean(diff(gtemp)) # = 0.00659 (drift)

2 sd(diff(gtemp))/sqrt(length(diff(gtemp))) # = 0.00966 (SE)

An alternative to differencing is a less-severe operation that still assumesstationarity of the underlying time series. This alternative, called fractionaldifferencing, extends the notion of the difference operator (2.35) to fractionalpowers −.5 < d < .5, which still define stationary processes. Granger andJoyeux (1980) and Hosking (1981) introduced long memory time series, whichcorresponds to the case when 0 < d < .5. This model is often used for en-vironmental time series arising in hydrology. We will discuss long memoryprocesses in more detail in §5.2.

Often, obvious aberrations are present that can contribute nonstationaryas well as nonlinear behavior in observed time series. In such cases, transfor-mations may be useful to equalize the variability over the length of a singleseries. A particularly useful transformation is

yt = log xt, (2.36)

which tends to suppress larger fluctuations that occur over portions of theseries where the underlying values are larger. Other possibilities are powertransformations in the Box–Cox family of the form

yt =

{(xλt − 1)/λ λ 6= 0,

log xt λ = 0.(2.37)

Methods for choosing the power λ are available (see Johnson and Wichern,1992, §4.7) but we do not pursue them here. Often, transformations are alsoused to improve the approximation to normality or to improve linearity inpredicting the value of one series from another.

Page 76: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

2.3 Exploratory Data Analysis 63

varve

0 100 200 300 400 500 600

050

100

150

log(varve)

0 100 200 300 400 500 600

23

45

Fig. 2.6. Glacial varve thicknesses (top) from Massachusetts for n = 634 yearscompared with log transformed thicknesses (bottom).

Example 2.6 Paleoclimatic Glacial Varves

Melting glaciers deposit yearly layers of sand and silt during the springmelting seasons, which can be reconstructed yearly over a period rangingfrom the time deglaciation began in New England (about 12,600 years ago)to the time it ended (about 6,000 years ago). Such sedimentary deposits,called varves, can be used as proxies for paleoclimatic parameters, such astemperature, because, in a warm year, more sand and silt are depositedfrom the receding glacier. Figure 2.6 shows the thicknesses of the yearlyvarves collected from one location in Massachusetts for 634 years, beginning11,834 years ago. For further information, see Shumway and Verosub (1992).Because the variation in thicknesses increases in proportion to the amountdeposited, a logarithmic transformation could remove the nonstationarityobservable in the variance as a function of time. Figure 2.6 shows the originaland transformed varves, and it is clear that this improvement has occurred.We may also plot the histogram of the original and transformed data, asin Problem 2.8, to argue that the approximation to normality is improved.The ordinary first differences (2.34) are also computed in Problem 2.8, andwe note that the first differences have a significant negative correlation at

Page 77: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

64 2 Time Series Regression and Exploratory Data Analysis

lag h = 1. Later, in Chapter 5, we will show that perhaps the varve serieshas long memory and will propose using fractional differencing.

Figure 2.6 was generated in R as follows:1 par(mfrow=c(2,1))

2 plot(varve, main="varve", ylab="")

3 plot(log(varve), main="log(varve)", ylab="" )

Next, we consider another preliminary data processing technique that isused for the purpose of visualizing the relations between series at different lags,namely, scatterplot matrices. In the definition of the ACF, we are essentiallyinterested in relations between xt and xt−h; the autocorrelation function tellsus whether a substantial linear relation exists between the series and its ownlagged values. The ACF gives a profile of the linear correlation at all possiblelags and shows which values of h lead to the best predictability. The restrictionof this idea to linear predictability, however, may mask a possible nonlinearrelation between current values, xt, and past values, xt−h. This idea extendsto two series where one may be interested in examining scatterplots of ytversus xt−h

Example 2.7 Scatterplot Matrices, SOI and Recruitment

To check for nonlinear relations of this form, it is convenient to display alagged scatterplot matrix, as in Figure 2.7, that displays values of the SOI,St, on the vertical axis plotted against St−h on the horizontal axis. Thesample autocorrelations are displayed in the upper right-hand corner andsuperimposed on the scatterplots are locally weighted scatterplot smoothing(lowess) lines that can be used to help discover any nonlinearities. We discusssmoothing in the next section, but for now, think of lowess as a robustmethod for fitting nonlinear regression.

In Figure 2.7, we notice that the lowess fits are approximately linear,so that the sample autocorrelations are meaningful. Also, we see strongpositive linear relations at lags h = 1, 2, 11, 12, that is, between St andSt−1, St−2, St−11, St−12, and a negative linear relation at lags h = 6, 7. Theseresults match up well with peaks noticed in the ACF in Figure 1.14.

Similarly, we might want to look at values of one series, say Recruitment,denoted Rt plotted against another series at various lags, say the SOI, St−h,to look for possible nonlinear relations between the two series. Because,for example, we might wish to predict the Recruitment series, Rt, fromcurrent or past values of the SOI series, St−h, for h = 0, 1, 2, ... it would beworthwhile to examine the scatterplot matrix. Figure 2.8 shows the laggedscatterplot of the Recruitment series Rt on the vertical axis plotted againstthe SOI index St−h on the horizontal axis. In addition, the figure exhibitsthe sample cross-correlations as well as lowess fits.

Figure 2.8 shows a fairly strong nonlinear relationship between Recruit-ment, Rt, and the SOI series at St−5, St−6, St−7, St−8, indicating the SOIseries tends to lead the Recruitment series and the coefficients are negative,implying that increases in the SOI lead to decreases in the Recruitment. The

Page 78: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

2.3 Exploratory Data Analysis 65

−1.0 −0.5 0.0 0.5 1.0

−1.

00.

00.

51.

0soi(t−1)

soi(t

)

0.6

−1.0 −0.5 0.0 0.5 1.0

−1.

00.

00.

51.

0

soi(t−2)

soi(t

)

0.37

−1.0 −0.5 0.0 0.5 1.0

−1.

00.

00.

51.

0

soi(t−3)

soi(t

)

0.21

−1.0 −0.5 0.0 0.5 1.0

−1.

00.

00.

51.

0

soi(t−4)

soi(t

)

0.05

−1.0 −0.5 0.0 0.5 1.0

−1.

00.

00.

51.

0

soi(t−5)

soi(t

)−0.11

−1.0 −0.5 0.0 0.5 1.0

−1.

00.

00.

51.

0

soi(t−6)

soi(t

)

−0.19

−1.0 −0.5 0.0 0.5 1.0

−1.

00.

00.

51.

0

soi(t−7)

soi(t

)

−0.18

−1.0 −0.5 0.0 0.5 1.0

−1.

00.

00.

51.

0

soi(t−8)

soi(t

)

−0.1

−1.0 −0.5 0.0 0.5 1.0−

1.0

0.0

0.5

1.0

soi(t−9)

soi(t

)

0.05

−1.0 −0.5 0.0 0.5 1.0

−1.

00.

00.

51.

0

soi(t−10)

soi(t

)

0.22

−1.0 −0.5 0.0 0.5 1.0

−1.

00.

00.

51.

0

soi(t−11)

soi(t

)

0.36

−1.0 −0.5 0.0 0.5 1.0

−1.

00.

00.

51.

0

soi(t−12)

soi(t

)

0.41

Fig. 2.7. Scatterplot matrix relating current SOI values, St, to past SOI values,St−h, at lags h = 1, 2, ..., 12. The values in the upper right corner are the sampleautocorrelations and the lines are a lowess fit.

nonlinearity observed in the scatterplots (with the help of the superimposedlowess fits) indicate that the behavior between Recruitment and the SOI isdifferent for positive values of SOI than for negative values of SOI.

Simple scatterplot matrices for one series can be obtained in R usingthe lag.plot command. Figures 2.7 and 2.8 may be reproduced using thefollowing scripts provided with the text (see Appendix R for detials):

1 lag.plot1(soi, 12) # Fig 2.7

2 lag.plot2(soi, rec, 8) # Fig 2.8

As a final exploratory tool, we discuss assessing periodic behavior in timeseries data using regression analysis and the periodogram; this material maybe thought of as an introduction to spectral analysis, which we discuss in

Page 79: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

66 2 Time Series Regression and Exploratory Data Analysis

−1.0 −0.5 0.0 0.5 1.0

020

4060

8010

0soi(t−0)

rec(

t)

0.02

−1.0 −0.5 0.0 0.5 1.0

020

4060

8010

0

soi(t−1)

rec(

t)

0.01

−1.0 −0.5 0.0 0.5 1.0

020

4060

8010

0

soi(t−2)

rec(

t)

−0.04

−1.0 −0.5 0.0 0.5 1.0

020

4060

8010

0

soi(t−3)

rec(

t)

−0.15

−1.0 −0.5 0.0 0.5 1.0

020

4060

8010

0soi(t−4)

rec(

t)

−0.3

−1.0 −0.5 0.0 0.5 1.0

020

4060

8010

0

soi(t−5)

rec(

t)

−0.53

−1.0 −0.5 0.0 0.5 1.0

020

4060

8010

0

soi(t−6)

rec(

t)

−0.6

−1.0 −0.5 0.0 0.5 1.0

020

4060

8010

0

soi(t−7)

rec(

t)

−0.6

−1.0 −0.5 0.0 0.5 1.0

020

4060

8010

0soi(t−8)re

c(t)

−0.56

Fig. 2.8. Scatterplot matrix of the Recruitment series, Rt, on the vertical axisplotted against the SOI series, St−h, on the horizontal axis at lags h = 0, 1, . . . , 8.The values in the upper right corner are the sample cross-correlations and the linesare a lowess fit.

detail in Chapter 4. In Example 1.12, we briefly discussed the problem ofidentifying cyclic or periodic signals in time series. A number of the timeseries we have seen so far exhibit periodic behavior. For example, the datafrom the pollution study example shown in Figure 2.2 exhibit strong yearlycycles. Also, the Johnson & Johnson data shown in Figure 1.1 make one cycleevery year (four quarters) on top of an increasing trend and the speech datain Figure 1.2 is highly repetitive. The monthly SOI and Recruitment series inFigure 1.6 show strong yearly cycles, but hidden in the series are clues to theEl Nino cycle.

Page 80: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

2.3 Exploratory Data Analysis 67

Example 2.8 Using Regression to Discover a Signal in Noise

In Example 1.12, we generated n = 500 observations from the model

xt = A cos(2πωt+ φ) + wt, (2.38)

where ω = 1/50, A = 2, φ = .6π, and σw = 5; the data are shown onthe bottom panel of Figure 1.11 on page 16. At this point we assume thefrequency of oscillation ω = 1/50 is known, but A and φ are unknownparameters. In this case the parameters appear in (2.38) in a nonlinear way,so we use a trigonometric identity4 and write

A cos(2πωt+ φ) = β1 cos(2πωt) + β2 sin(2πωt),

where β1 = A cos(φ) and β2 = −A sin(φ). Now the model (2.38) can bewritten in the usual linear regression form given by (no intercept term isneeded here)

xt = β1 cos(2πt/50) + β2 sin(2πt/50) + wt. (2.39)

Using linear regression on the generated data, the fitted model is

xt = −.71(.30) cos(2πt/50)− 2.55(.30) sin(2πt/50) (2.40)

with σw = 4.68, where the values in parentheses are the standard er-rors. We note the actual values of the coefficients for this example areβ1 = 2 cos(.6π) = −.62 and β2 = −2 sin(.6π) = −1.90. Because the pa-rameter estimates are significant and close to the actual values, it is clearthat we are able to detect the signal in the noise using regression, eventhough the signal appears to be obscured by the noise in the bottom panelof Figure 1.11. Figure 2.9 shows data generated by (2.38) with the fittedline, (2.40), superimposed.

To reproduce the analysis and Figure 2.9 in R, use the following com-mands:

1 set.seed(1000) # so you can reproduce these results

2 x = 2*cos(2*pi*1:500/50 + .6*pi) + rnorm(500,0,5)

3 z1 = cos(2*pi*1:500/50); z2 = sin(2*pi*1:500/50)

4 summary(fit <- lm(x~0+z1+z2)) # zero to exclude the intercept

5 plot.ts(x, lty="dashed")

6 lines(fitted(fit), lwd=2)

Example 2.9 Using the Periodogram to Discover a Signal in Noise

The analysis in Example 2.8 may seem like cheating because we assumed weknew the value of the frequency parameter ω. If we do not know ω, we couldtry to fit the model (2.38) using nonlinear regression with ω as a parameter.Another method is to try various values of ω in a systematic way. Using the

4 cos(α± β) = cos(α) cos(β)∓ sin(α) sin(β).

Page 81: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

68 2 Time Series Regression and Exploratory Data Analysis

Time

x

0 100 200 300 400 500

−15

−10

−50

510

Fig. 2.9. Data generated by (2.38) [dashed line] with the fitted [solid] line, (2.40),superimposed.

regression results of §2.2, we can show the estimated regression coefficientsin Example 2.8 take on the special form given by

β1 =

∑nt=1 xt cos(2πt/50)∑nt=1 cos2(2πt/50)

=2

n

n∑t=1

xt cos(2πt/50); (2.41)

β2 =

∑nt=1 xt sin(2πt/50)∑nt=1 sin2(2πt/50)

=2

n

n∑t=1

xt sin(2πt/50). (2.42)

This suggests looking at all possible regression parameter estimates,5 say

β1(j/n) =2

n

n∑t=1

xt cos(2πt j/n); (2.43)

β2(j/n) =2

n

n∑t=1

xt sin(2πt j/n), (2.44)

where, n = 500 and j = 1, . . . , n2 − 1, and inspecting the results for large

values. For the endpoints, j = 0 and j = n/2, we have β1(0) = n−1∑nt=1 xt

and β1( 12 ) = n−1

∑nt=1(−1)txt, and β2(0) = β2( 1

2 ) = 0.For this particular example, the values calculated in (2.41) and (2.42) are

β1(10/500) and β2(10/500). By doing this, we have regressed a series, xt, of

5 In the notation of §2.2, the estimates are of the form∑nt=1 xtzt

/ ∑nt=1 z

2t where

zt = cos(2πtj/n) or zt = sin(2πtj/n). In this setup, unless j = 0 or j = n/2 if nis even,

∑nt=1 z

2t = n/2; see Problem 2.10.

Page 82: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

2.3 Exploratory Data Analysis 69

0.0 0.1 0.2 0.3 0.4 0.5

01

23

45

67

Frequency

Sca

led

Perio

dogr

am

Fig. 2.10. The scaled periodogram, (2.45), of the 500 observations generated by(2.38); the data are displayed in Figures 1.11 and 2.9.

length n using n regression parameters, so that we will have a perfect fit.The point, however, is that if the data contain any cyclic behavior we arelikely to catch it by performing these saturated regressions.

Next, note that the regression coefficients β1(j/n) and β2(j/n), for eachj, are essentially measuring the correlation of the data with a sinusoid os-cillating at j cycles in n time points.6 Hence, an appropriate measure of thepresence of a frequency of oscillation of j cycles in n time points in the datawould be

P (j/n) = β21(j/n) + β2

2(j/n), (2.45)

which is basically a measure of squared correlation. The quantity (2.45)is sometimes called the periodogram, but we will call P (j/n) the scaledperiodogram and we will investigate its properties in Chapter 4. Figure 2.10shows the scaled periodogram for the data generated by (2.38), and it easilydiscovers the periodic component with frequency ω = .02 = 10/500 eventhough it is difficult to visually notice that component in Figure 1.11 dueto the noise.

Finally, we mention that it is not necessary to run a large regression

xt =

n/2∑j=0

β1(j/n) cos(2πtj/n) + β2(j/n) sin(2πtj/n) (2.46)

to obtain the values of β1(j/n) and β2(j/n) [with β2(0) = β2(1/2) = 0]because they can be computed quickly if n (assumed even here) is a highly

6 Sample correlations are of the form∑t xtzt

/ (∑t x

2t

∑t z

2t

)1/2.

Page 83: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

70 2 Time Series Regression and Exploratory Data Analysis

composite integer. There is no error in (2.46) because there are n obser-vations and n parameters; the regression fit will be perfect. The discreteFourier transform (DFT) is a complex-valued weighted average of the datagiven by

d(j/n) = n−1/2n∑t=1

xt exp(−2πitj/n)

= n−1/2

(n∑t=1

xt cos(2πtj/n)− in∑t=1

xt sin(2πtj/n)

) (2.47)

where the frequencies j/n are called the Fourier or fundamental frequencies.Because of a large number of redundancies in the calculation, (2.47) may becomputed quickly using the fast Fourier transform (FFT)7, which is availablein many computing packages such as Matlab R©, S-PLUS R© and R. Note that8

|d(j/n)|2 =1

n

(n∑t=1

xt cos(2πtj/n)

)2

+1

n

(n∑t=1

xt sin(2πtj/n)

)2

(2.48)

and it is this quantity that is called the periodogram; we will write

I(j/n) = |d(j/n)|2.

We may calculate the scaled periodogram, (2.45), using the periodogram as

P (j/n) =4

nI(j/n). (2.49)

We will discuss this approach in more detail and provide examples with datain Chapter 4.

Figure 2.10 can be created in R using the following commands (and thedata already generated in x):

1 I = abs(fft(x))^2/500 # the periodogram

2 P = (4/500)*I[1:250] # the scaled periodogram

3 f = 0:249/500 # frequencies

4 plot(f, P, type="l", xlab="Frequency", ylab="Scaled Periodogram")

2.4 Smoothing in the Time Series Context

In §1.4, we introduced the concept of smoothing a time series, and in Ex-ample 1.9, we discussed using a moving average to smooth white noise. Thismethod is useful in discovering certain traits in a time series, such as long-term

7 Different packages scale the FFT differently; consult the documentation. R cal-culates (2.47) without scaling by n−1/2.

8 If z = a− ib is complex, then |z|2 = zz = (a− ib)(a+ ib) = a2 + b2.

Page 84: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

2.4 Smoothing in the Time Series Context 71

Fig. 2.11. The weekly cardiovascular mortality series discussed in Example 2.2smoothed using a five-week moving average and a 53-week moving average.

trend and seasonal components. In particular, if xt represents the observations,then

mt =k∑

j=−k

ajxt−j , (2.50)

where aj = a−j ≥ 0 and∑kj=−k aj = 1 is a symmetric moving average of the

data.

Example 2.10 Moving Average Smoother

For example, Figure 2.11 shows the weekly mortality series discussed inExample 2.2, a five-point moving average (which is essentially a monthlyaverage with k = 2) that helps bring out the seasonal component and a53-point moving average (which is essentially a yearly average with k = 26)that helps bring out the (negative) trend in cardiovascular mortality. In bothcases, the weights, a−k, . . . , a0, . . . , ak, we used were all the same, and equalto 1/(2k + 1).9

To reproduce Figure 2.11 in R:1 ma5 = filter(cmort, sides=2, rep(1,5)/5)

2 ma53 = filter(cmort, sides=2, rep(1,53)/53)

3 plot(cmort, type="p", ylab="mortality")

4 lines(ma5); lines(ma53)

9 Sometimes, the end weights, a−k and ak are set equal to half the value of theother weights.

Page 85: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

72 2 Time Series Regression and Exploratory Data Analysis

Fig. 2.12. The weekly cardiovascular mortality series with a cubic trend and cubictrend plus periodic regression.

Many other techniques are available for smoothing time series data basedon methods from scatterplot smoothers. The general setup for a time plot is

xt = ft + yt, (2.51)

where ft is some smooth function of time, and yt is a stationary process. Wemay think of the moving average smoother mt, given in (2.50), as an estimatorof ft. An obvious choice for ft in (2.51) is polynomial regression

ft = β0 + β1t+ · · ·+ βptp. (2.52)

We have seen the results of a linear fit on the global temperature data inExample 2.1. For periodic data, one might employ periodic regression

ft = α0 + α1 cos(2πω1t) + β1 sin(2πω1t)

+ · · ·+ αp cos(2πωpt) + βp sin(2πωpt), (2.53)

where ω1, . . . , ωp are distinct, specified frequencies. In addition, one mightconsider combining (2.52) and (2.53). These smoothers can be applied usingclassical linear regression.

Example 2.11 Polynomial and Periodic Regression Smoothers

Figure 2.12 shows the weekly mortality series with an estimated (via ordi-nary least squares) cubic smoother

ft = β0 + β1t+ β2t2 + β3t

3

Page 86: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

2.4 Smoothing in the Time Series Context 73

superimposed to emphasize the trend, and an estimated (via ordinary leastsquares) cubic smoother plus a periodic regression

ft = β0 + β1t+ β2t2 + β3t

3 + α1 cos(2πt/52) + α2 sin(2πt/52)

superimposed to emphasize trend and seasonality.The R commands for this example are as follows (we note that the sam-

pling rate is 1/52, so that wk below is essentially t/52).1 wk = time(cmort) - mean(time(cmort))

2 wk2 = wk^2; wk3 = wk^3

3 cs = cos(2*pi*wk); sn = sin(2*pi*wk)

4 reg1 = lm(cmort~wk + wk2 + wk3, na.action=NULL)

5 reg2 = lm(cmort~wk + wk2 + wk3 + cs + sn, na.action=NULL)

6 plot(cmort, type="p", ylab="mortality")

7 lines(fitted(reg1)); lines(fitted(reg2))

Modern regression techniques can be used to fit general smoothers to thepairs of points (t, xt) where the estimate of ft is smooth. Many of the tech-niques can easily be applied to time series data using the R or S-PLUS sta-tistical packages; see Venables and Ripley (1994, Chapter 10) for details onapplying these methods in S-PLUS (R is similar). A problem with the tech-niques used in Example 2.11 is that they assume ft is the same function overthe range of time, t; we might say that the technique is global. The movingaverage smoothers in Example 2.10 fit the data better because the techniqueis local; that is, moving average smoothers allow for the possibility that ft isa different function over time. We describe some other local methods in thefollowing examples.

Example 2.12 Kernel Smoothing

Kernel smoothing is a moving average smoother that uses a weight function,or kernel, to average the observations. Figure 2.13 shows kernel smoothingof the mortality series, where ft in (2.51) is estimated by

ft =n∑i=1

wi(t)xi, (2.54)

where

wi(t) = K(t−ib

) / n∑j=1

K(t−jb

). (2.55)

are the weights and K(·) is a kernel function. This estimator, which wasoriginally explored by Parzen (1962) and Rosenblatt (1956b), is often calledthe Nadaraya–Watson estimator (Watson, 1966); typically, the normal ker-nel, K(z) = 1√

2πexp(−z2/2), is used. To implement this in R, use the

ksmooth function. The wider the bandwidth, b, the smoother the result.In Figure 2.13, the values of b for this example were b = 5/52 (roughly

Page 87: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

74 2 Time Series Regression and Exploratory Data Analysis

Fig. 2.13. Kernel smoothers of the mortality data.

weighted two to three week averages because b/2 is the inner quartile rangeof the kernel) for the seasonal component, and b = 104/52 = 2 (roughlyweighted yearly averages) for the trend component.

Figure 2.13 can be reproduced in R (or S-PLUS) as follows.1 plot(cmort, type="p", ylab="mortality")

2 lines(ksmooth(time(cmort), cmort, "normal", bandwidth=5/52))

3 lines(ksmooth(time(cmort), cmort, "normal", bandwidth=2))

Example 2.13 Lowess and Nearest Neighbor Regression

Another approach to smoothing a time plot is nearest neighbor regression.The technique is based on k-nearest neighbors linear regression, wherein oneuses the data {xt−k/2, . . . , xt, . . . , xt+k/2} to predict xt using linear regres-

sion; the result is ft. For example, Figure 2.14 shows cardiovascular mor-tality and the nearest neighbor method using the R (or S-PLUS) smoothersupsmu. We used k = n/2 to estimate the trend and k = n/100 to esti-mate the seasonal component. In general, supsmu uses a variable windowfor smoothing (see Friedman, 1984), but it can be used for correlated databy fixing the smoothing window, as was done here.

Lowess is a method of smoothing that is rather complex, but the basic ideais close to nearest neighbor regression. Figure 2.14 shows smoothing of mor-tality using the R or S-PLUS function lowess (see Cleveland, 1979). First,a certain proportion of nearest neighbors to xt are included in a weightingscheme; values closer to xt in time get more weight. Then, a robust weightedregression is used to predict xt and obtain the smoothed estimate of ft. Thelarger the fraction of nearest neighbors included, the smoother the estimate

Page 88: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

2.4 Smoothing in the Time Series Context 75

nearest neighborm

orta

lity

1970 1972 1974 1976 1978 1980

7080

9011

013

0

lowess

mor

talit

y

1970 1972 1974 1976 1978 1980

7080

9011

013

0

Fig. 2.14. Nearest neighbor (supsmu) and locally weighted regression (lowess)smoothers of the mortality data.

ft will be. In Figure 2.14, the smoother uses about two-thirds of the datato obtain an estimate of the trend component, and the seasonal componentuses 2% of the data.

Figure 2.14 can be reproduced in R or S-PLUS as follows.1 par(mfrow=c(2,1))

2 plot(cmort, type="p", ylab="mortality", main="nearest neighbor")

3 lines(supsmu(time(cmort), cmort, span=.5))

4 lines(supsmu(time(cmort), cmort, span=.01))

5 plot(cmort, type="p", ylab="mortality", main="lowess")

6 lines(lowess(cmort, f=.02)); lines(lowess(cmort, f=2/3))

Example 2.14 Smoothing Splines

An extension of polynomial regression is to first divide time t = 1, . . . , n,into k intervals, [t0 = 1, t1], [t1 + 1, t2] , . . . , [tk−1 + 1, tk = n]. The valuest0, t1, . . . , tk are called knots. Then, in each interval, one fits a regression ofthe form (2.52); typically, p = 3, and this is called cubic splines.

A related method is smoothing splines, which minimizes a compromisebetween the fit and the degree of smoothness given by

Page 89: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

76 2 Time Series Regression and Exploratory Data Analysis

Fig. 2.15. Smoothing splines fit to the mortality data.

n∑t=1

[xt − ft]2 + λ

∫ (f′′

t

)2dt, (2.56)

where ft is a cubic spline with a knot at each t. The degree of smoothness iscontrolled by λ > 0. There is a relationship between smoothing splines andstate space models, which is investigated in Problem 6.7.

In R, the smoothing parameter is called spar and it is monotonicallyrelated to λ; type ?smooth.spline to view the help file for details. Fig-ure 2.15 shows smoothing spline fits on the mortality data using generalizedcross-validation, which uses the data to “optimally” assess the smoothingparameter, for the seasonal component, and spar=1 for the trend. The figurecan be reproduced in R as follows.

1 plot(cmort, type="p", ylab="mortality")

2 lines(smooth.spline(time(cmort), cmort))

3 lines(smooth.spline(time(cmort), cmort, spar=1))

Example 2.15 Smoothing One Series as a Function of Another

In addition to smoothing time plots, smoothing techniques can be appliedto smoothing a time series as a function of another time series. In this ex-ample, we smooth the scatterplot of two contemporaneously measured timeseries, mortality as a function of temperature. In Example 2.2, we discov-ered a nonlinear relationship between mortality and temperature. Continu-ing along these lines, Figure 2.16 shows scatterplots of mortality, Mt, andtemperature, Tt, along with Mt smoothed as a function of Tt using lowessand using smoothing splines. In both cases, mortality increases at extreme

Page 90: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

2.4 Smoothing in the Time Series Context 77

50 60 70 80 90 100

7080

9011

013

0

lowess

Temperature

Mor

talit

y

50 60 70 80 90 100

7080

9011

013

0

smoothing splines

Temperature

Mor

talit

y

Fig. 2.16. Smoothers of mortality as a function of temperature using lowess andsmoothing splines.

temperatures, but in an asymmetric way; mortality is higher at colder tem-peratures than at hotter temperatures. The minimum mortality rate seemsto occur at approximately 80◦ F.

Figure 2.16 can be reproduced in R as follows.1 par(mfrow=c(2,1), mar=c(3,2,1,0)+.5, mgp=c(1.6,.6,0))

2 plot(tempr, cmort, main="lowess", xlab="Temperature",

ylab="Mortality")

3 lines(lowess(tempr,cmort))

4 plot(tempr, cmort, main="smoothing splines", xlab="Temperature",

ylab="Mortality")

5 lines(smooth.spline(tempr, cmort))

As a final word of caution, the methods mentioned in this section may nottake into account the fact that the data are serially correlated, and most ofthe techniques have been designed for independent observations. That is, forexample, the smoothers shown in Figure 2.16 are calculated under the falseassumption that the pairs (Mt, Tt), are iid pairs of observations. In addition,

Page 91: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

78 2 Time Series Regression and Exploratory Data Analysis

the degree of smoothness used in the previous examples were chosen arbitrarilyto bring out what might be considered obvious features in the data set.

Problems

Section 2.2

2.1 For the Johnson & Johnson data, say yt, shown in Figure 1.1, let xt =log(yt).

(a) Fit the regression model

xt = βt+ α1Q1(t) + α2Q2(t) + α3Q3(t) + α4Q4(t) + wt

where Qi(t) = 1 if time t corresponds to quarter i = 1, 2, 3, 4, and zerootherwise. The Qi(t)’s are called indicator variables. We will assume fornow that wt is a Gaussian white noise sequence. What is the interpreta-tion of the parameters β, α1, α2, α3, and α4? (Detailed code is given inAppendix R on page 574.)

(b) What happens if you include an intercept term in the model in (a)?(c) Graph the data, xt, and superimpose the fitted values, say xt, on the

graph. Examine the residuals, xt− xt, and state your conclusions. Does itappear that the model fits the data well (do the residuals look white)?

2.2 For the mortality data examined in Example 2.2:

(a) Add another component to the regression in (2.25) that accounts for theparticulate count four weeks prior; that is, add Pt−4 to the regression in(2.25). State your conclusion.

(b) Draw a scatterplot matrix of Mt, Tt, Pt and Pt−4 and then calculate thepairwise correlations between the series. Compare the relationship betweenMt and Pt versus Mt and Pt−4.

2.3 Repeat the following exercise six times and then discuss the results. Gen-erate a random walk with drift, (1.4), of length n = 100 with δ = .01 andσw = 1. Call the data xt for t = 1, . . . , 100. Fit the regression xt = βt + wtusing least squares. Plot the data, the mean function (i.e., µt = .01 t) and the

fitted line, xt = β t, on the same graph. Discuss your results.

The following R code may be useful:

1 par(mfcol = c(3,2)) # set up graphics

2 for (i in 1:6){

3 x = ts(cumsum(rnorm(100,.01,1))) # the data

4 reg = lm(x~0+time(x), na.action=NULL) # the regression

5 plot(x) # plot data

6 lines(.01*time(x), col="red", lty="dashed") # plot mean

7 abline(reg, col="blue") } # plot regression line

Page 92: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

Problems 79

2.4 Kullback-Leibler Information. Given the random vector yyy, we define theinformation for discriminating between two densities in the same family, in-dexed by a parameter θθθ, say f(yyy; θθθ1) and f(yyy; θθθ2), as

I(θθθ1; θθθ2) =1

nE1 log

f(yyy; θθθ1)

f(yyy; θθθ2), (2.57)

where E1 denotes expectation with respect to the density determined by θθθ1.For the Gaussian regression model, the parameters are θθθ = (βββ′, σ2)′. Showthat we obtain

I(θθθ1; θθθ2) =1

2

(σ21

σ22

− logσ21

σ22

− 1

)+

1

2

(βββ1 − βββ2)′Z ′Z(βββ1 − βββ2)

nσ22

(2.58)

in that case.

2.5 Model Selection. Both selection criteria (2.19) and (2.20) are derived frominformation theoretic arguments, based on the well-known Kullback-Leiblerdiscrimination information numbers (see Kullback and Leibler, 1951, Kull-back, 1958). We give an argument due to Hurvich and Tsai (1989). We thinkof the measure (2.58) as measuring the discrepancy between the two densities,characterized by the parameter values θθθ′1 = (βββ′1, σ

21)′ and θθθ′2 = (βββ′2, σ

22)′. Now,

if the true value of the parameter vector is θθθ1, we argue that the best modelwould be one that minimizes the discrepancy between the theoretical valueand the sample, say I(θθθ1; θθθ). Because θθθ1 will not be known, Hurvich and Tsai

(1989) considered finding an unbiased estimator for E1[I(βββ1, σ21 ; βββ,σ

2)], where

I(βββ1, σ21 ; βββ,σ

2) =1

2

(σ21

σ2− log

σ21

σ2− 1

)+

1

2

(βββ1 − βββ)′Z ′Z(βββ1 − βββ)

nσ2

and βββ is a k × 1 regression vector. Show that

E1[I(βββ1, σ21 ; βββ,σ

2)] =1

2

(− log σ2

1 + E1 log σ2 +n+ k

n− k − 2− 1

), (2.59)

using the distributional properties of the regression coefficients and error vari-ance. An unbiased estimator for E1 log σ2 is log σ2. Hence, we have shownthat the expectation of the above discrimination information is as claimed.As models with differing dimensions k are considered, only the second andthird terms in (2.59) will vary and we only need unbiased estimators for thosetwo terms. This gives the form of AICc quoted in (2.20) in the chapter. Youwill need the two distributional results

nσ2

σ21

∼ χ2n−k and

(βββ − βββ1)′Z ′Z(βββ − βββ1)

σ21

∼ χ2k

The two quantities are distributed independently as chi-squared distributionswith the indicated degrees of freedom. If x ∼ χ2

n, E(1/x) = 1/(n− 2).

Page 93: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

80 2 Time Series Regression and Exploratory Data Analysis

Section 2.3

2.6 Consider a process consisting of a linear trend with an additive noise termconsisting of independent random variables wt with zero means and variancesσ2w, that is,

xt = β0 + β1t+ wt,

where β0, β1 are fixed constants.

(a) Prove xt is nonstationary.(b) Prove that the first difference series ∇xt = xt − xt−1 is stationary by

finding its mean and autocovariance function.(c) Repeat part (b) if wt is replaced by a general stationary process, say yt,

with mean function µy and autocovariance function γy(h).

2.7 Show (2.31) is stationary.

2.8 The glacial varve record plotted in Figure 2.6 exhibits some nonstationar-ity that can be improved by transforming to logarithms and some additionalnonstationarity that can be corrected by differencing the logarithms.

(a) Argue that the glacial varves series, say xt, exhibits heteroscedasticity bycomputing the sample variance over the first half and the second half ofthe data. Argue that the transformation yt = log xt stabilizes the vari-ance over the series. Plot the histograms of xt and yt to see whether theapproximation to normality is improved by transforming the data.

(b) Plot the series yt. Do any time intervals, of the order 100 years, existwhere one can observe behavior comparable to that observed in the globaltemperature records in Figure 1.2?

(c) Examine the sample ACF of yt and comment.(d) Compute the difference ut = yt − yt−1, examine its time plot and sample

ACF, and argue that differencing the logged varve data produces a rea-sonably stationary series. Can you think of a practical interpretation forut? Hint: For |p| close to zero, log(1 + p) ≈ p; let p = (yt − yt−1)/yt−1.

(e) Based on the sample ACF of the differenced transformed series computedin (c), argue that a generalization of the model given by Example 1.23might be reasonable. Assume

ut = µ+ wt − θwt−1

is stationary when the inputs wt are assumed independent with mean 0and variance σ2

w. Show that

γu(h) =

σ2w(1 + θ2) if h = 0,

−θ σ2w if h = ±1,

0 if |h| > 1.

Page 94: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

Problems 81

(f) Based on part (e), use ρu(1) and the estimate of the variance of ut, γu(0),to derive estimates of θ and σ2

w. This is an application of the method ofmoments from classical statistics, where estimators of the parameters arederived by equating sample moments to theoretical moments.

2.9 In this problem, we will explore the periodic nature of St, the SOI seriesdisplayed in Figure 1.5.

(a) Detrend the series by fitting a regression of St on time t. Is there a signif-icant trend in the sea surface temperature? Comment.

(b) Calculate the periodogram for the detrended series obtained in part (a).Identify the frequencies of the two main peaks (with an obvious one at thefrequency of one cycle every 12 months). What is the probable El Ninocycle indicated by the minor peak?

2.10 Consider the model (2.46) used in Example 2.9,

xt =n∑j=0

β1(j/n) cos(2πtj/n) + β2(j/n) sin(2πtj/n).

(a) Display the model design matrix Z [see (2.5)] for n = 4.(b) Show numerically that the columns of Z in part (a) satisfy part (d) and

then display (Z ′Z)−1 for this case.(c) If x1, x2, x3, x4 are four observations, write the estimates of the four betas,

β1(0), β1(1/4), β2(1/4), β1(1/2), in terms of the observations.(d) Verify that for any positive integer n and j, k = 0, 1, . . . , [[n/2]], where [[·]]

denotes the greatest integer function:10

(i) Except for j = 0 or j = n/2,

n∑t=1

cos2(2πtj/n) =

n∑t=1

sin2(2πtj/n) = n/2

.(ii) When j = 0 or j = n/2,

n∑t=1

cos2(2πtj/n) = n but

n∑t=1

sin2(2πtj/n) = 0.

(iii) For j 6= k,

n∑t=1

cos(2πtj/n) cos(2πtk/n) =n∑t=1

sin(2πtj/n) sin(2πtk/n) = 0.

Also, for any j and k,

n∑t=1

cos(2πtj/n) sin(2πtk/n) = 0.

10 Some useful facts: 2 cos(α) = eiα + e−iα, 2i sin(α) = eiα − e−iα, and∑nt=1 z

t =z(1− zn)/(1− z) for z 6= 1.

Page 95: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

82 2 Time Series Regression and Exploratory Data Analysis

Section 2.4

2.11 Consider the two weekly time series oil and gas. The oil series is indollars per barrel, while the gas series is in cents per gallon; see Appendix Rfor details.

(a) Plot the data on the same graph. Which of the simulated series displayed in§1.3 do these series most resemble? Do you believe the series are stationary(explain your answer)?

(b) In economics, it is often the percentage change in price (termed growth rateor return), rather than the absolute price change, that is important. Arguethat a transformation of the form yt = ∇ log xt might be applied to thedata, where xt is the oil or gas price series [see the hint in Problem 2.8(d)].

(c) Transform the data as described in part (b), plot the data on the samegraph, look at the sample ACFs of the transformed data, and comment.[Hint: poil = diff(log(oil)) and pgas = diff(log(gas)).]

(d) Plot the CCF of the transformed data and comment The small, but signif-icant values when gas leads oil might be considered as feedback. [Hint:ccf(poil, pgas) will have poil leading for negative lag values.]

(e) Exhibit scatterplots of the oil and gas growth rate series for up to threeweeks of lead time of oil prices; include a nonparametric smoother in eachplot and comment on the results (e.g., Are there outliers? Are the rela-tionships linear?). [Hint: lag.plot2(poil, pgas, 3).]

(f) There have been a number of studies questioning whether gasoline pricesrespond more quickly when oil prices are rising than when oil prices arefalling (“asymmetry”). We will attempt to explore this question here withsimple lagged regression; we will ignore some obvious problems such asoutliers and autocorrelated errors, so this will not be a definitive analysis.Let Gt and Ot denote the gas and oil growth rates.

(i) Fit the regression (and comment on the results)

Gt = α1 + α2It + β1Ot + β2Ot−1 + wt,

where It = 1 if Ot ≥ 0 and 0 otherwise (It is the indicator of nogrowth or positive growth in oil price). Hint:

1 indi = ifelse(poil < 0, 0, 1)

2 mess = ts.intersect(pgas, poil, poilL = lag(poil,-1), indi)

3 summary(fit <- lm(pgas~ poil + poilL + indi, data=mess))

(ii) What is the fitted model when there is negative growth in oil price attime t? What is the fitted model when there is no or positive growthin oil price? Do these results support the asymmetry hypothesis?

(iii) Analyze the residuals from the fit and comment.

2.12 Use two different smoothing techniques described in §2.4 to estimate thetrend in the global temperature series displayed in Figure 1.2. Comment.

Page 96: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

3

ARIMA Models

3.1 Introduction

In Chapters 1 and 2, we introduced autocorrelation and cross-correlation func-tions (ACFs and CCFs) as tools for clarifying relations that may occur withinand between time series at various lags. In addition, we explained how tobuild linear models based on classical regression theory for exploiting the as-sociations indicated by large values of the ACF or CCF. The time domain, orregression, methods of this chapter are appropriate when we are dealing withpossibly nonstationary, shorter time series; these series are the rule ratherthan the exception in many applications. In addition, if the emphasis is onforecasting future values, then the problem is easily treated as a regressionproblem. This chapter develops a number of regression techniques for time se-ries that are all related to classical ordinary and weighted or correlated leastsquares.

Classical regression is often insufficient for explaining all of the interestingdynamics of a time series. For example, the ACF of the residuals of the sim-

Chapter 2) reveals additional structure in the data that the regression did notcapture. Instead, the introduction of correlation as a phenomenon that maybe generated through lagged linear relations leads to proposing the autore-gressive (AR) and autoregressive moving average (ARMA) models. Addingnonstationary models to the mix leads to the autoregressive integrated mov-ing average (ARIMA) model popularized in the landmark work by Box andJenkins (1970). The Box–Jenkins method for identifying a plausible ARIMAmodel is given in this chapter along with techniques for parameter estimationand forecasting for these models. A partial theoretical justification of the useof ARMA models is discussed in Appendix B, §B.4.

© Springer Science+Business Media, LLC 2011

R.H. Shumway and D.S. Stoffer, Time Series Analysis and Its Applications: With R Examples, 83

ple linear regression fit to the global temperature data (see Example 2.4 of

Springer Texts in Statistics, DOI 10.1007/978-1-4419-7865-3_3,

Page 97: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

84 3 ARIMA Models

3.2 Autoregressive Moving Average Models

The classical regression model of Chapter 2 was developed for the static case,namely, we only allow the dependent variable to be influenced by currentvalues of the independent variables. In the time series case, it is desirableto allow the dependent variable to be influenced by the past values of theindependent variables and possibly by its own past values. If the present canbe plausibly modeled in terms of only the past values of the independentinputs, we have the enticing prospect that forecasting will be possible.

Introduction to Autoregressive Models

Autoregressive models are based on the idea that the current value of theseries, xt, can be explained as a function of p past values, xt−1, xt−2, . . . , xt−p,where p determines the number of steps into the past needed to forecastthe current value. As a typical case, recall Example 1.10 in which data weregenerated using the model

xt = xt−1 − .90xt−2 + wt,

where wt is white Gaussian noise with σ2w = 1. We have now assumed the

current value is a particular linear function of past values. The regularity thatpersists in Figure 1.9 gives an indication that forecasting for such a modelmight be a distinct possibility, say, through some version such as

xnn+1 = xn − .90xn−1,

where the quantity on the left-hand side denotes the forecast at the nextperiod n + 1 based on the observed data, x1, x2, . . . , xn. We will make thisnotion more precise in our discussion of forecasting (§3.5).

The extent to which it might be possible to forecast a real data series fromits own past values can be assessed by looking at the autocorrelation functionand the lagged scatterplot matrices discussed in Chapter 2. For example, thelagged scatterplot matrix for the Southern Oscillation Index (SOI), shownin Figure 2.7, gives a distinct indication that lags 1 and 2, for example, arelinearly associated with the current value. The ACF shown in Figure 1.14shows relatively large positive values at lags 1, 2, 12, 24, and 36 and largenegative values at 18, 30, and 42. We note also the possible relation betweenthe SOI and Recruitment series indicated in the scatterplot matrix shown inFigure 2.8. We will indicate in later sections on transfer function and vectorAR modeling how to handle the dependence on values taken by other series.

The preceding discussion motivates the following definition.

Definition 3.1 An autoregressive model of order p, abbreviated AR(p),is of the form

xt = φ1xt−1 + φ2xt−2 + · · ·+ φpxt−p + wt, (3.1)

Page 98: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

3.2 Autoregressive Moving Average Models 85

where xt is stationary, and φ1, φ2, . . . , φp are constants (φp 6= 0). Although itis not necessary yet, we assume that wt is a Gaussian white noise series withmean zero and variance σ2

w, unless otherwise stated. The mean of xt in (3.1)is zero. If the mean, µ, of xt is not zero, replace xt by xt − µ in (3.1),

xt − µ = φ1(xt−1 − µ) + φ2(xt−2 − µ) + · · ·+ φp(xt−p − µ) + wt,

or writext = α+ φ1xt−1 + φ2xt−2 + · · ·+ φpxt−p + wt, (3.2)

where α = µ(1− φ1 − · · · − φp).

We note that (3.2) is similar to the regression model of §2.2, and hencethe term auto (or self) regression. Some technical difficulties, however, developfrom applying that model because the regressors, xt−1, . . . , xt−p, are randomcomponents, whereas zzzt was assumed to be fixed. A useful form follows byusing the backshift operator (2.33) to write the AR(p) model, (3.1), as

(1− φ1B − φ2B2 − · · · − φpBp)xt = wt, (3.3)

or even more concisely asφ(B)xt = wt. (3.4)

The properties of φ(B) are important in solving (3.4) for xt. This leads to thefollowing definition.

Definition 3.2 The autoregressive operator is defined to be

φ(B) = 1− φ1B − φ2B2 − · · · − φpBp. (3.5)

We initiate the investigation of AR models by considering the first-ordermodel, AR(1), given by xt = φxt−1 +wt. Iterating backwards k times, we get

xt = φxt−1 + wt = φ(φxt−2 + wt−1) + wt= φ2xt−2 + φwt−1 + wt...

= φkxt−k +

k−1∑j=0

φjwt−j .

This method suggests that, by continuing to iterate backward, and providedthat |φ| < 1 and xt is stationary, we can represent an AR(1) model as a linearprocess given by1

xt =∞∑j=0

φjwt−j . (3.6)

1 Note that limk→∞E(xt −

∑k−1j=0 φ

jwt−j)2

= limk→∞ φ2kE

(x2t−k

)= 0, so (3.6)

exists in the mean square sense (see Appendix A for a definition).

Page 99: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

86 3 ARIMA Models

The AR(1) process defined by (3.6) is stationary with mean

E(xt) =

∞∑j=0

φjE(wt−j) = 0,

and autocovariance function,

γ(h) = cov(xt+h, xt) = E

( ∞∑j=0

φjwt+h−j

)( ∞∑k=0

φkwt−k

)= E

[(wt+h + · · ·+ φhwt + φh+1wt−1 + · · ·

)(wt + φwt−1 + · · · )

]= σ2

w

∞∑j=0

φh+jφj = σ2wφ

h∞∑j=0

φ2j =σ2wφ

h

1− φ2, h ≥ 0.

(3.7)

Recall that γ(h) = γ(−h), so we will only exhibit the autocovariance functionfor h ≥ 0. From (3.7), the ACF of an AR(1) is

ρ(h) =γ(h)

γ(0)= φh, h ≥ 0, (3.8)

and ρ(h) satisfies the recursion

ρ(h) = φ ρ(h− 1), h = 1, 2, . . . . (3.9)

We will discuss the ACF of a general AR(p) model in §3.4.

Example 3.1 The Sample Path of an AR(1) Process

Figure 3.1 shows a time plot of two AR(1) processes, one with φ = .9 andone with φ = −.9; in both cases, σ2

w = 1. In the first case, ρ(h) = .9h, forh ≥ 0, so observations close together in time are positively correlated witheach other. This result means that observations at contiguous time pointswill tend to be close in value to each other; this fact shows up in the topof Figure 3.1 as a very smooth sample path for xt. Now, contrast this withthe case in which φ = −.9, so that ρ(h) = (−.9)h, for h ≥ 0. This resultmeans that observations at contiguous time points are negatively correlatedbut observations two time points apart are positively correlated. This factshows up in the bottom of Figure 3.1, where, for example, if an observation,xt, is positive, the next observation, xt+1, is typically negative, and the nextobservation, xt+2, is typically positive. Thus, in this case, the sample pathis very choppy.

The following R code can be used to obtain a figure similar to Figure 3.1:1 par(mfrow=c(2,1))

2 plot(arima.sim(list(order=c(1,0,0), ar=.9), n=100), ylab="x",

main=(expression(AR(1)~~~phi==+.9)))

3 plot(arima.sim(list(order=c(1,0,0), ar=-.9), n=100), ylab="x",

main=(expression(AR(1)~~~phi==-.9)))

Page 100: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

3.2 Autoregressive Moving Average Models 87

AR(1) = +.9

0 20 40 60 80 100

−6−4

−20

24

AR(1) = −.9

0 20 40 60 80 100

−6−4

−20

24

6

Fig. 3.1. Simulated AR(1) models: φ = .9 (top); φ = −.9 (bottom).

Example 3.2 Explosive AR Models and Causality

In Example 1.18, it was discovered that the random walk xt = xt−1 + wtis not stationary. We might wonder whether there is a stationary AR(1)process with |φ| > 1. Such processes are called explosive because the valuesof the time series quickly become large in magnitude. Clearly, because |φ|jincreases without bound as j →∞,

∑k−1j=0 φ

jwt−j will not converge (in meansquare) as k →∞, so the intuition used to get (3.6) will not work directly.We can, however, modify that argument to obtain a stationary model asfollows. Write xt+1 = φxt + wt+1, in which case,

xt = φ−1xt+1 − φ−1wt+1 = φ−1(φ−1xt+2 − φ−1wt+2

)− φ−1wt+1

...

= φ−kxt+k −k−1∑j=1

φ−jwt+j , (3.10)

by iterating forward k steps. Because |φ|−1 < 1, this result suggests thestationary future dependent AR(1) model

Page 101: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

88 3 ARIMA Models

xt = −∞∑j=1

φ−jwt+j . (3.11)

The reader can verify that this is stationary and of the AR(1) form xt =φxt−1 + wt. Unfortunately, this model is useless because it requires us toknow the future to be able to predict the future. When a process does notdepend on the future, such as the AR(1) when |φ| < 1, we will say the processis causal. In the explosive case of this example, the process is stationary, butit is also future dependent, and not causal.

Example 3.3 Every Explosion Has a Cause

Excluding explosive models from consideration is not a problem because themodels have causal counterparts. For example, if

xt = φxt−1 + wt with |φ| > 1

and wt ∼ iid N(0, σ2w), then using (3.11), {xt} is a non-causal stationary

Gaussian process with E(xt) = 0 and

γx(h) = cov(xt+h, xt) = cov

− ∞∑j=1

φ−jwt+h+j , −∞∑k=1

φ−kwt+k

= σ2

wφ−2 φ−h/(1− φ−2).

Thus, using (3.7), the causal process defined by

yt = φ−1yt−1 + vt

where vt ∼ iid N(0, σ2wφ−2) is stochastically equal to the xt process (i.e.,

all finite distributions of the processes are the same). For example, if xt =2xt−1 +wt with σ2

w = 1, then yt = 12yt−1 +vt with σ2

v = 1/4 is an equivalentcausal process (see Problem 3.3). This concept generalizes to higher orders,but it is easier to show using Chapter 4 techniques; see Example 4.7.

The technique of iterating backward to get an idea of the stationary so-lution of AR models works well when p = 1, but not for larger orders. Ageneral technique is that of matching coefficients. Consider the AR(1) modelin operator form

φ(B)xt = wt, (3.12)

where φ(B) = 1 − φB, and |φ| < 1. Also, write the model in equation (3.6)using operator form as

xt =∞∑j=0

ψjwt−j = ψ(B)wt, (3.13)

Page 102: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

3.2 Autoregressive Moving Average Models 89

where ψ(B) =∑∞j=0 ψjB

j and ψj = φj . Suppose we did not know that

ψj = φj . We could substitute ψ(B)wt from (3.13) for xt in (3.12) to obtain

φ(B)ψ(B)wt = wt. (3.14)

The coefficients of B on the left-hand side of (3.14) must be equal to those onright-hand side of (3.14), which means

(1− φB)(1 + ψ1B + ψ2B2 + · · ·+ ψjB

j + · · · ) = 1. (3.15)

Reorganizing the coefficients in (3.15),

1 + (ψ1 − φ)B + (ψ2 − ψ1φ)B2 + · · ·+ (ψj − ψj−1φ)Bj + · · · = 1,

we see that for each j = 1, 2, . . ., the coefficient of Bj on the left must be zerobecause it is zero on the right. The coefficient of B on the left is (ψ1−φ), andequating this to zero, ψ1− φ = 0, leads to ψ1 = φ. Continuing, the coefficientof B2 is (ψ2 − ψ1φ), so ψ2 = φ2. In general,

ψj = ψj−1φ,

with ψ0 = 1, which leads to the solution ψj = φj .Another way to think about the operations we just performed is to consider

the AR(1) model in operator form, φ(B)xt = wt. Now multiply both sides byφ−1(B) (assuming the inverse operator exists) to get

φ−1(B)φ(B)xt = φ−1(B)wt,

orxt = φ−1(B)wt.

We know already that

φ−1(B) = 1 + φB + φ2B2 + · · ·+ φjBj + · · · ,

that is, φ−1(B) is ψ(B) in (3.13). Thus, we notice that working with operatorsis like working with polynomials. That is, consider the polynomial φ(z) =1− φz, where z is a complex number and |φ| < 1. Then,

φ−1(z) =1

(1− φz)= 1 + φz + φ2z2 + · · ·+ φjzj + · · · , |z| ≤ 1,

and the coefficients of Bj in φ−1(B) are the same as the coefficients of zj inφ−1(z). In other words, we may treat the backshift operator, B, as a com-plex number, z. These results will be generalized in our discussion of ARMAmodels. We will find the polynomials corresponding to the operators useful inexploring the general properties of ARMA models.

Page 103: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

90 3 ARIMA Models

Introduction to Moving Average Models

As an alternative to the autoregressive representation in which the xt on theleft-hand side of the equation are assumed to be combined linearly, the movingaverage model of order q, abbreviated as MA(q), assumes the white noise wton the right-hand side of the defining equation are combined linearly to formthe observed data.

Definition 3.3 The moving average model of order q, or MA(q) model,is defined to be

xt = wt + θ1wt−1 + θ2wt−2 + · · ·+ θqwt−q, (3.16)

where there are q lags in the moving average and θ1, θ2, . . . , θq (θq 6= 0) areparameters.2 Although it is not necessary yet, we assume that wt is a Gaussianwhite noise series with mean zero and variance σ2

w, unless otherwise stated.

The system is the same as the infinite moving average defined as the linearprocess (3.13), where ψ0 = 1, ψj = θj , for j = 1, . . . , q, and ψj = 0 for othervalues. We may also write the MA(q) process in the equivalent form

xt = θ(B)wt, (3.17)

using the following definition.

Definition 3.4 The moving average operator is

θ(B) = 1 + θ1B + θ2B2 + · · ·+ θqB

q. (3.18)

Unlike the autoregressive process, the moving average process is stationaryfor any values of the parameters θ1, . . . , θq; details of this result are providedin §3.4.

Example 3.4 The MA(1) Process

Consider the MA(1) model xt = wt + θwt−1. Then, E(xt) = 0,

γ(h) =

(1 + θ2)σ2

w h = 0,

θσ2w h = 1,

0 h > 1,

and the ACF is

ρ(h) =

θ

(1+θ2)h = 1,

0 h > 1.

Note |ρ(1)| ≤ 1/2 for all values of θ (Problem 3.1). Also, xt is correlated withxt−1, but not with xt−2, xt−3, . . . . Contrast this with the case of the AR(1)

2 Some texts and software packages write the MA model with negative coefficients;that is, xt = wt − θ1wt−1 − θ2wt−2 − · · · − θqwt−q.

Page 104: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

3.2 Autoregressive Moving Average Models 91

MA(1) = +.5

0 20 40 60 80 100

−2−1

01

2

MA(1) = −.5

0 20 40 60 80 100

−2−1

01

2

Fig. 3.2. Simulated MA(1) models: θ = .5 (top); θ = −.5 (bottom).

model in which the correlation between xt and xt−k is never zero. Whenθ = .5, for example, xt and xt−1 are positively correlated, and ρ(1) = .4.When θ = −.5, xt and xt−1 are negatively correlated, ρ(1) = −.4. Figure 3.2shows a time plot of these two processes with σ2

w = 1. The series in Figure 3.2where θ = .5 is smoother than the series in Figure 3.2, where θ = −.5.

A figure similar to Figure 3.2 can be created in R as follows:1 par(mfrow = c(2,1))

2 plot(arima.sim(list(order=c(0,0,1), ma=.5), n=100), ylab="x",

main=(expression(MA(1)~~~theta==+.5)))

3 plot(arima.sim(list(order=c(0,0,1), ma=-.5), n=100), ylab="x",

main=(expression(MA(1)~~~theta==-.5)))

Example 3.5 Non-uniqueness of MA Models and Invertibility

Using Example 3.4, we note that for an MA(1) model, ρ(h) is the same forθ and 1

θ ; try 5 and 15 , for example. In addition, the pair σ2

w = 1 and θ = 5yield the same autocovariance function as the pair σ2

w = 25 and θ = 1/5,namely,

Page 105: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

92 3 ARIMA Models

γ(h) =

26 h = 0,

5 h = 1,

0 h > 1.

Thus, the MA(1) processes

xt = wt + 15wt−1, wt ∼ iid N(0, 25)

andyt = vt + 5vt−1, vt ∼ iid N(0, 1)

are the same because of normality (i.e., all finite distributions are the same).We can only observe the time series, xt or yt, and not the noise, wt or vt,so we cannot distinguish between the models. Hence, we will have to chooseonly one of them. For convenience, by mimicking the criterion of causalityfor AR models, we will choose the model with an infinite AR representation.Such a process is called an invertible process.

To discover which model is the invertible model, we can reverse the rolesof xt and wt (because we are mimicking the AR case) and write the MA(1)model as wt = −θwt−1 + xt. Following the steps that led to (3.6), if |θ| < 1,then wt =

∑∞j=0(−θ)jxt−j , which is the desired infinite AR representation

of the model. Hence, given a choice, we will choose the model with σ2w = 25

and θ = 1/5 because it is invertible.

As in the AR case, the polynomial, θ(z), corresponding to the movingaverage operators, θ(B), will be useful in exploring general properties of MAprocesses. For example, following the steps of equations (3.12)–(3.15), we canwrite the MA(1) model as xt = θ(B)wt, where θ(B) = 1 + θB. If |θ| < 1,then we can write the model as π(B)xt = wt, where π(B) = θ−1(B). Letθ(z) = 1 + θz, for |z| ≤ 1, then π(z) = θ−1(z) = 1/(1 + θz) =

∑∞j=0(−θ)jzj ,

and we determine that π(B) =∑∞j=0(−θ)jBj .

Autoregressive Moving Average Models

We now proceed with the general development of autoregressive, moving aver-age, and mixed autoregressive moving average (ARMA), models for stationarytime series.

Definition 3.5 A time series {xt; t = 0,±1,±2, . . .} is ARMA(p, q) if it isstationary and

xt = φ1xt−1 + · · ·+ φpxt−p + wt + θ1wt−1 + · · ·+ θqwt−q, (3.19)

with φp 6= 0, θq 6= 0, and σ2w > 0. The parameters p and q are called the

autoregressive and the moving average orders, respectively. If xt has a nonzeromean µ, we set α = µ(1− φ1 − · · · − φp) and write the model as

xt = α+ φ1xt−1 + · · ·+ φpxt−p + wt + θ1wt−1 + · · ·+ θqwt−q. (3.20)

Page 106: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

3.2 Autoregressive Moving Average Models 93

Although it is not necessary yet, we assume that wt is a Gaussian white noiseseries with mean zero and variance σ2

w, unless otherwise stated.

As previously noted, when q = 0, the model is called an autoregressivemodel of order p, AR(p), and when p = 0, the model is called a moving averagemodel of order q, MA(q). To aid in the investigation of ARMA models, it willbe useful to write them using the AR operator, (3.5), and the MA operator,(3.18). In particular, the ARMA(p, q) model in (3.19) can then be written inconcise form as

φ(B)xt = θ(B)wt. (3.21)

Before we discuss the conditions under which (3.19) is causal and invertible,we point out a potential problem with the ARMA model.

Example 3.6 Parameter Redundancy

Consider a white noise process xt = wt. Equivalently, we can write this as.5xt−1 = .5wt−1 by shifting back one unit of time and multiplying by .5.Now, subtract the two representations to obtain

xt − .5xt−1 = wt − .5wt−1,

orxt = .5xt−1 − .5wt−1 + wt, (3.22)

which looks like an ARMA(1, 1) model. Of course, xt is still white noise;nothing has changed in this regard [i.e., xt = wt is the solution to (3.22)],but we have hidden the fact that xt is white noise because of the parameterredundancy or over-parameterization. Write the parameter redundant modelin operator form as φ(B)xt = θ(B)wt, or

(1− .5B)xt = (1− .5B)wt.

Apply the operator φ(B)−1 = (1− .5B)−1 to both sides to obtain

xt = (1− .5B)−1(1− .5B)xt = (1− .5B)−1(1− .5B)wt = wt,

which is the original model. We can easily detect the problem of over-parameterization with the use of the operators or their associated polynomi-als. That is, write the AR polynomial φ(z) = (1− .5z), the MA polynomialθ(z) = (1 − .5z), and note that both polynomials have a common factor,namely (1 − .5z). This common factor immediately identifies the parame-ter redundancy. Discarding the common factor in each leaves φ(z) = 1 andθ(z) = 1, from which we conclude φ(B) = 1 and θ(B) = 1, and we de-duce that the model is actually white noise. The consideration of parameterredundancy will be crucial when we discuss estimation for general ARMAmodels. As this example points out, we might fit an ARMA(1, 1) model towhite noise data and find that the parameter estimates are significant. Ifwe were unaware of parameter redundancy, we might claim the data arecorrelated when in fact they are not (Problem 3.20).

Page 107: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

94 3 ARIMA Models

Examples 3.2, 3.5, and 3.6 point to a number of problems with the generaldefinition of ARMA(p, q) models, as given by (3.19), or, equivalently, by (3.21).To summarize, we have seen the following problems:

(i) parameter redundant models,(ii) stationary AR models that depend on the future, and

(iii) MA models that are not unique.

To overcome these problems, we will require some additional restrictionson the model parameters. First, we make the following definitions.

Definition 3.6 The AR and MA polynomials are defined as

φ(z) = 1− φ1z − · · · − φpzp, φp 6= 0, (3.23)

andθ(z) = 1 + θ1z + · · ·+ θqz

q, θq 6= 0, (3.24)

respectively, where z is a complex number.

To address the first problem, we will henceforth refer to an ARMA(p, q)model to mean that it is in its simplest form. That is, in addition to theoriginal definition given in equation (3.19), we will also require that φ(z)and θ(z) have no common factors. So, the process, xt = .5xt−1− .5wt−1 +wt,discussed in Example 3.6 is not referred to as an ARMA(1, 1) process because,in its reduced form, xt is white noise.

To address the problem of future-dependent models, we formally introducethe concept of causality.

Definition 3.7 An ARMA(p, q) model is said to be causal, if the time series{xt; t = 0,±1,±2, . . .} can be written as a one-sided linear process:

xt =∞∑j=0

ψjwt−j = ψ(B)wt, (3.25)

where ψ(B) =∑∞j=0 ψjB

j, and∑∞j=0 |ψj | <∞; we set ψ0 = 1.

In Example 3.2, the AR(1) process, xt = φxt−1 + wt, is causal only when|φ| < 1. Equivalently, the process is causal only when the root of φ(z) = 1−φzis bigger than one in absolute value. That is, the root, say, z0, of φ(z) isz0 = 1/φ (because φ(z0) = 0) and |z0| > 1 because |φ| < 1. In general, wehave the following property.

Property 3.1 Causality of an ARMA(p, q) ProcessAn ARMA(p, q) model is causal if and only if φ(z) 6= 0 for |z| ≤ 1. The

coefficients of the linear process given in (3.25) can be determined by solving

ψ(z) =

∞∑j=0

ψjzj =

θ(z)

φ(z), |z| ≤ 1.

Page 108: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

3.2 Autoregressive Moving Average Models 95

Another way to phrase Property 3.1 is that an ARMA process is causalonly when the roots of φ(z) lie outside the unit circle; that is, φ(z) = 0only when |z| > 1. Finally, to address the problem of uniqueness discussedin Example 3.5, we choose the model that allows an infinite autoregressiverepresentation.

Definition 3.8 An ARMA(p, q) model is said to be invertible, if the timeseries {xt; t = 0,±1,±2, . . .} can be written as

π(B)xt =

∞∑j=0

πjxt−j = wt, (3.26)

where π(B) =∑∞j=0 πjB

j, and∑∞j=0 |πj | <∞; we set π0 = 1.

Analogous to Property 3.1, we have the following property.

Property 3.2 Invertibility of an ARMA(p, q) ProcessAn ARMA(p, q) model is invertible if and only if θ(z) 6= 0 for |z| ≤ 1. The

coefficients πj of π(B) given in (3.26) can be determined by solving

π(z) =

∞∑j=0

πjzj =

φ(z)

θ(z), |z| ≤ 1.

Another way to phrase Property 3.2 is that an ARMA process is invertibleonly when the roots of θ(z) lie outside the unit circle; that is, θ(z) = 0 onlywhen |z| > 1. The proof of Property 3.1 is given in Appendix B (the proof ofProperty 3.2 is similar and, hence, is not provided). The following examplesillustrate these concepts.

Example 3.7 Parameter Redundancy, Causality, Invertibility

Consider the process

xt = .4xt−1 + .45xt−2 + wt + wt−1 + .25wt−2,

or, in operator form,

(1− .4B − .45B2)xt = (1 +B + .25B2)wt.

At first, xt appears to be an ARMA(2, 2) process. But, the associatedpolynomials

φ(z) = 1− .4z − .45z2 = (1 + .5z)(1− .9z)

θ(z) = (1 + z + .25z2) = (1 + .5z)2

have a common factor that can be canceled. After cancellation, the poly-nomials become φ(z) = (1 − .9z) and θ(z) = (1 + .5z), so the model is anARMA(1, 1) model, (1− .9B)xt = (1 + .5B)wt, or

Page 109: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

96 3 ARIMA Models

xt = .9xt−1 + .5wt−1 + wt. (3.27)

The model is causal because φ(z) = (1 − .9z) = 0 when z = 10/9, whichis outside the unit circle. The model is also invertible because the root ofθ(z) = (1 + .5z) is z = −2, which is outside the unit circle.

To write the model as a linear process, we can obtain the ψ-weights usingProperty 3.1, φ(z)ψ(z) = θ(z), or

(1− .9z)(ψ0 + ψ1z + ψ2z2 + · · · ) = (1 + .5z).

Matching coefficients we get ψ0 = 1, ψ1 = .5 + .9 = 1.4, and ψj = .9ψj−1for j > 1. Thus, ψj = 1.4(.9)j−1 for j ≥ 1 and (3.27) can be written as

xt = wt + 1.4∞∑j=1

.9j−1wt−j .

Similarly, the invertible representation using Property 3.2 is

xt = 1.4∞∑j=1

(−.5)j−1xt−j + wt.

Example 3.8 Causal Conditions for an AR(2) Process

For an AR(1) model, (1−φB)xt = wt, to be causal, the root of φ(z) = 1−φzmust lie outside of the unit circle. In this case, the root (or zero) occurs atz0 = 1/φ [i.e., φ(z0) = 0], so it is easy to go from the causal requirement onthe root, |1/φ| > 1, to a requirement on the parameter, |φ| < 1. It is not soeasy to establish this relationship for higher order models.

For example, the AR(2) model, (1− φ1B − φ2B2)xt = wt, is causal whenthe two roots of φ(z) = 1 − φ1z − φ2z2 lie outside of the unit circle. Usingthe quadratic formula, this requirement can be written as∣∣∣∣∣φ1 ±

√φ21 + 4φ2−2φ2

∣∣∣∣∣ > 1.

The roots of φ(z) may be real and distinct, real and equal, or a complexconjugate pair. If we denote those roots by z1 and z2, we can write φ(z) =(1−z−11 z)(1−z−12 z); note that φ(z1) = φ(z2) = 0. The model can be writtenin operator form as (1−z−11 B)(1−z−12 B)xt = wt. From this representation,it follows that φ1 = (z−11 + z−12 ) and φ2 = −(z1z2)−1. This relationship andthe fact that |z1| > 1 and |z2| > 1 can be used to establish the followingequivalent condition for causality:

φ1 + φ2 < 1, φ2 − φ1 < 1, and |φ2| < 1. (3.28)

This causality condition specifies a triangular region in the parameter space.We leave the details of the equivalence to the reader (Problem 3.5).

Page 110: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

3.3 Difference Equations 97

3.3 Difference Equations

The study of the behavior of ARMA processes and their ACFs is greatlyenhanced by a basic knowledge of difference equations, simply because theyare difference equations. This topic is also useful in the study of time domainmodels and stochastic processes in general. We will give a brief and heuristicaccount of the topic along with some examples of the usefulness of the theory.For details, the reader is referred to Mickens (1990).

Suppose we have a sequence of numbers u0, u1, u2, . . . such that

un − αun−1 = 0, α 6= 0, n = 1, 2, . . . . (3.29)

For example, recall (3.9) in which we showed that the ACF of an AR(1)process is a sequence, ρ(h), satisfying

ρ(h)− φρ(h− 1) = 0, h = 1, 2, . . . .

Equation (3.29) represents a homogeneous difference equation of order 1. Tosolve the equation, we write:

u1 = αu0

u2 = αu1 = α2u0...

un = αun−1 = αnu0.

Given an initial condition u0 = c, we may solve (3.29), namely, un = αnc.In operator notation, (3.29) can be written as (1− αB)un = 0. The poly-

nomial associated with (3.29) is α(z) = 1 − αz, and the root, say, z0, of thispolynomial is z0 = 1/α; that is α(z0) = 0. We know a solution (in fact, thesolution) to (3.29), with initial condition u0 = c, is

un = αnc =(z−10

)nc. (3.30)

That is, the solution to the difference equation (3.29) depends only on theinitial condition and the inverse of the root to the associated polynomial α(z).

Now suppose that the sequence satisfies

un − α1un−1 − α2un−2 = 0, α2 6= 0, n = 2, 3, . . . (3.31)

This equation is a homogeneous difference equation of order 2. The corre-sponding polynomial is

α(z) = 1− α1z − α2z2,

which has two roots, say, z1 and z2; that is, α(z1) = α(z2) = 0. We willconsider two cases. First suppose z1 6= z2. Then the general solution to (3.31)is

Page 111: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

98 3 ARIMA Models

un = c1z−n1 + c2z

−n2 , (3.32)

where c1 and c2 depend on the initial conditions. The claim that is a solutioncan be verified by direct substitution of (3.32) into (3.31):(c1z−n1 + c2z

−n2

)− α1

(c1z−(n−1)1 + c2z

−(n−1)2

)− α2

(c1z−(n−2)1 + c2z

−(n−2)2

)= c1z

−n1

(1− α1z1 − α2z

21

)+ c2z

−n2

(1− α1z2 − α2z

22

)= c1z

−n1 α(z1) + c2z

−n2 α(z2) = 0.

Given two initial conditions u0 and u1, we may solve for c1 and c2:

u0 = c1 + c2 and u1 = c1z−11 + c2z

−12 ,

where z1 and z2 can be solved for in terms of α1 and α2 using the quadraticformula, for example.

When the roots are equal, z1 = z2 (= z0), a general solution to (3.31) is

un = z−n0 (c1 + c2n). (3.33)

This claim can also be verified by direct substitution of (3.33) into (3.31):

z−n0 (c1 + c2n)− α1

(z−(n−1)0 [c1 + c2(n− 1)]

)− α2

(z−(n−2)0 [c1 + c2(n− 2)]

)= z−n0 (c1 + c2n)

(1− α1z0 − α2z

20

)+ c2z

−n+10 (α1 + 2α2z0)

= c2z−n+10 (α1 + 2α2z0) .

To show that (α1 + 2α2z0) = 0, write 1 − α1z − α2z2 = (1 − z−10 z)2, and

take derivatives with respect to z on both sides of the equation to obtain(α1 + 2α2z) = 2z−10 (1 − z−10 z). Thus, (α1 + 2α2z0) = 2z−10 (1 − z−10 z0) = 0,as was to be shown. Finally, given two initial conditions, u0 and u1, we cansolve for c1 and c2:

u0 = c1 and u1 = (c1 + c2)z−10 .

It can also be shown that these solutions are unique.To summarize these results, in the case of distinct roots, the solution to

the homogeneous difference equation of degree two was

un = z−n1 × (a polynomial in n of degree m1 − 1)

+ z−n2 × (a polynomial in n of degree m2 − 1),(3.34)

wherem1 is the multiplicity of the root z1 andm2 is the multiplicity of the rootz2. In this example, of course, m1 = m2 = 1, and we called the polynomialsof degree zero c1 and c2, respectively. In the case of the repeated root, thesolution was

un = z−n0 × (a polynomial in n of degree m0 − 1), (3.35)

where m0 is the multiplicity of the root z0; that is, m0 = 2. In this case, wewrote the polynomial of degree one as c1 + c2n. In both cases, we solved forc1 and c2 given two initial conditions, u0 and u1.

Page 112: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

3.3 Difference Equations 99

Example 3.9 The ACF of an AR(2) Process

Suppose xt = φ1xt−1 +φ2xt−2 +wt is a causal AR(2) process. Multiply eachside of the model by xt−h for h > 0, and take expectation:

E(xtxt−h) = φ1E(xt−1xt−h) + φ2E(xt−2xt−h) + E(wtxt−h).

The result is

γ(h) = φ1γ(h− 1) + φ2γ(h− 2), h = 1, 2, . . . . (3.36)

In (3.36), we used the fact that E(xt) = 0 and for h > 0,

E(wtxt−h) = E(wt

∞∑j=0

ψjwt−h−j

)= 0.

Divide (3.36) through by γ(0) to obtain the difference equation for the ACFof the process:

ρ(h)− φ1ρ(h− 1)− φ2ρ(h− 2) = 0, h = 1, 2, . . . . (3.37)

The initial conditions are ρ(0) = 1 and ρ(−1) = φ1/(1 − φ2), which isobtained by evaluating (3.37) for h = 1 and noting that ρ(1) = ρ(−1).

Using the results for the homogeneous difference equation of order two, letz1 and z2 be the roots of the associated polynomial, φ(z) = 1− φ1z − φ2z2.Because the model is causal, we know the roots are outside the unit circle:|z1| > 1 and |z2| > 1. Now, consider the solution for three cases:(i) When z1 and z2 are real and distinct, then

ρ(h) = c1z−h1 + c2z

−h2 ,

so ρ(h)→ 0 exponentially fast as h→∞.(ii) When z1 = z2 (= z0) are real and equal, then

ρ(h) = z−h0 (c1 + c2h),

so ρ(h)→ 0 exponentially fast as h→∞.(iii) When z1 = z2 are a complex conjugate pair, then c2 = c1 (because ρ(h)

is real), andρ(h) = c1z

−h1 + c1z

−h1 .

Write c1 and z1 in polar coordinates, for example, z1 = |z1|eiθ, where θis the angle whose tangent is the ratio of the imaginary part and the realpart of z1 (sometimes called arg(z1); the range of θ is [−π, π]). Then,using the fact that eiα + e−iα = 2 cos(α), the solution has the form

ρ(h) = a|z1|−h cos(hθ + b),

where a and b are determined by the initial conditions. Again, ρ(h)dampens to zero exponentially fast as h → ∞, but it does so in a si-nusoidal fashion. The implication of this result is shown in the nextexample.

Page 113: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

100 3 ARIMA Models

Example 3.10 An AR(2) with Complex Roots

Figure 3.3 shows n = 144 observations from the AR(2) model

xt = 1.5xt−1 − .75xt−2 + wt,

with σ2w = 1, and with complex roots chosen so the process exhibits pseudo-

cyclic behavior at the rate of one cycle every 12 time points. The autore-gressive polynomial for this model is φ(z) = 1 − 1.5z + .75z2. The roots ofφ(z) are 1± i/

√3, and θ = tan−1(1/

√3) = 2π/12 radians per unit time. To

convert the angle to cycles per unit time, divide by 2π to get 1/12 cyclesper unit time. The ACF for this model is shown in §3.4, Figure 3.4.

To calculate the roots of the polynomial and solve for arg in R:1 z = c(1,-1.5,.75) # coefficients of the polynomial

2 (a = polyroot(z)[1]) # print one root: 1+0.57735i = 1 + i/sqrt(3)

3 arg = Arg(a)/(2*pi) # arg in cycles/pt

4 1/arg # = 12, the pseudo period

To reproduce Figure 3.3:1 set.seed(90210)

2 ar2 = arima.sim(list(order=c(2,0,0), ar=c(1.5,-.75)), n = 144)

3 plot(1:144/12, ar2, type="l", xlab="Time (one unit = 12 points)")

4 abline(v=0:12, lty="dotted", lwd=2)

To calculate and display the ACF for this model:1 ACF = ARMAacf(ar=c(1.5,-.75), ma=0, 50)

2 plot(ACF, type="h", xlab="lag")

3 abline(h=0)

We now exhibit the solution for the general homogeneous difference equa-tion of order p:

un − α1un−1 − · · · − αpun−p = 0, αp 6= 0, n = p, p+ 1, . . . . (3.38)

The associated polynomial is

α(z) = 1− α1z − · · · − αpzp.

Suppose α(z) has r distinct roots, z1 with multiplicity m1, z2 with multiplicitym2, . . . , and zr with multiplicity mr, such that m1 +m2 + · · ·+mr = p. Thegeneral solution to the difference equation (3.38) is

un = z−n1 P1(n) + z−n2 P2(n) + · · ·+ z−nr Pr(n), (3.39)

where Pj(n), for j = 1, 2, . . . , r, is a polynomial in n, of degree mj − 1. Givenp initial conditions u0, . . . , up−1, we can solve for the Pj(n) explicitly.

Example 3.11 The ψψψ-weights for an ARMA Model

For a causal ARMA(p, q) model, φ(B)xt = θ(B)wt, where the zeros of φ(z)are outside the unit circle, recall that we may write

Page 114: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

3.3 Difference Equations 101

0 2 4 6 8 10 12

−50

5

Time (one unit = 12 points)

ar2

Fig. 3.3. Simulated AR(2) model, n = 144 with φ1 = 1.5 and φ2 = −.75.

xt =∞∑j=0

ψjwt−j ,

where the ψ-weights are determined using Property 3.1.For the pure MA(q) model, ψ0 = 1, ψj = θj , for j = 1, . . . , q, and ψj = 0,

otherwise. For the general case of ARMA(p, q) models, the task of solvingfor the ψ-weights is much more complicated, as was demonstrated in Exam-ple 3.7. The use of the theory of homogeneous difference equations can helphere. To solve for the ψ-weights in general, we must match the coefficientsin φ(z)ψ(z) = θ(z):

(1− φ1z − φ2z2 − · · · )(ψ0 + ψ1z + ψ2z2 + · · · ) = (1 + θ1z + θ2z

2 + · · · ).

The first few values are

ψ0 = 1ψ1 − φ1ψ0 = θ1

ψ2 − φ1ψ1 − φ2ψ0 = θ2ψ3 − φ1ψ2 − φ2ψ1 − φ3ψ0 = θ3

...

where we would take φj = 0 for j > p, and θj = 0 for j > q. The ψ-weightssatisfy the homogeneous difference equation given by

ψj −p∑k=1

φkψj−k = 0, j ≥ max(p, q + 1), (3.40)

Page 115: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

102 3 ARIMA Models

with initial conditions

ψj −j∑

k=1

φkψj−k = θj , 0 ≤ j < max(p, q + 1). (3.41)

The general solution depends on the roots of the AR polynomial φ(z) =1 − φ1z − · · · − φpz

p, as seen from (3.40). The specific solution will, ofcourse, depend on the initial conditions.

Consider the ARMA process given in (3.27), xt = .9xt−1 + .5wt−1 + wt.Because max(p, q+ 1) = 2, using (3.41), we have ψ0 = 1 and ψ1 = .9 + .5 =1.4. By (3.40), for j = 2, 3, . . . , the ψ-weights satisfy ψj − .9ψj−1 = 0. Thegeneral solution is ψj = c .9j . To find the specific solution, use the initialcondition ψ1 = 1.4, so 1.4 = .9c or c = 1.4/.9. Finally, ψj = 1.4(.9)j−1, forj ≥ 1, as we saw in Example 3.7.

To view, for example, the first 50 ψ-weights in R, use:1 ARMAtoMA(ar=.9, ma=.5, 50) # for a list

2 plot(ARMAtoMA(ar=.9, ma=.5, 50)) # for a graph

3.4 Autocorrelation and Partial Autocorrelation

We begin by exhibiting the ACF of an MA(q) process, xt = θ(B)wt, whereθ(B) = 1+θ1B+ · · ·+θqB

q. Because xt is a finite linear combination of whitenoise terms, the process is stationary with mean

E(xt) =

q∑j=0

θjE(wt−j) = 0,

where we have written θ0 = 1, and with autocovariance function

γ(h) = cov (xt+h, xt) = cov( q∑j=0

θjwt+h−j ,

q∑k=0

θkwt−k

)

=

{σ2w

∑q−hj=0 θjθj+h, 0 ≤ h ≤ q

0 h > q.(3.42)

Recall that γ(h) = γ(−h), so we will only display the values for h ≥ 0. Thecutting off of γ(h) after q lags is the signature of the MA(q) model. Dividing(3.42) by γ(0) yields the ACF of an MA(q):

ρ(h) =

∑q−hj=0 θjθj+h

1 + θ21 + · · ·+ θ2q1 ≤ h ≤ q

0 h > q.

(3.43)

For a causal ARMA(p, q) model, φ(B)xt = θ(B)wt, where the zeros ofφ(z) are outside the unit circle, write

Page 116: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

3.4 Autocorrelation and Partial Autocorrelation 103

xt =

∞∑j=0

ψjwt−j . (3.44)

It follows immediately that E(xt) = 0. Also, the autocovariance function ofxt can be written as

γ(h) = cov(xt+h, xt) = σ2w

∞∑j=0

ψjψj+h, h ≥ 0. (3.45)

We could then use (3.40) and (3.41) to solve for the ψ-weights. In turn, wecould solve for γ(h), and the ACF ρ(h) = γ(h)/γ(0). As in Example 3.9, it isalso possible to obtain a homogeneous difference equation directly in terms ofγ(h). First, we write

γ(h) = cov(xt+h, xt) = cov( p∑j=1

φjxt+h−j +

q∑j=0

θjwt+h−j , xt

)=

p∑j=1

φjγ(h− j) + σ2w

q∑j=h

θjψj−h, h ≥ 0,

(3.46)

where we have used the fact that, for h ≥ 0,

cov(wt+h−j , xt) = cov(wt+h−j ,

∞∑k=0

ψkwt−k

)= ψj−hσ

2w.

From (3.46), we can write a general homogeneous equation for the ACF of acausal ARMA process:

γ(h)− φ1γ(h− 1)− · · · − φpγ(h− p) = 0, h ≥ max(p, q + 1), (3.47)

with initial conditions

γ(h)−p∑j=1

φjγ(h− j) = σ2w

q∑j=h

θjψj−h, 0 ≤ h < max(p, q + 1). (3.48)

Dividing (3.47) and (3.48) through by γ(0) will allow us to solve for the ACF,ρ(h) = γ(h)/γ(0).

Example 3.12 The ACF of an AR(p)

In Example 3.9 we considered the case where p = 2. For the general case, itfollows immediately from (3.47) that

ρ(h)− φ1ρ(h− 1)− · · · − φpρ(h− p) = 0, h ≥ p. (3.49)

Let z1, . . . , zr denote the roots of φ(z), each with multiplicity m1, . . . ,mr,respectively, where m1+· · ·+mr = p. Then, from (3.39), the general solutionis

Page 117: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

104 3 ARIMA Models

ρ(h) = z−h1 P1(h) + z−h2 P2(h) + · · ·+ z−hr Pr(h), h ≥ p, (3.50)

where Pj(h) is a polynomial in h of degree mj − 1.Recall that for a causal model, all of the roots are outside the unit cir-

cle, |zi| > 1, for i = 1, . . . , r. If all the roots are real, then ρ(h) dampensexponentially fast to zero as h→∞. If some of the roots are complex, thenthey will be in conjugate pairs and ρ(h) will dampen, in a sinusoidal fash-ion, exponentially fast to zero as h→∞. In the case of complex roots, thetime series will appear to be cyclic in nature. This, of course, is also true forARMA models in which the AR part has complex roots.

Example 3.13 The ACF of an ARMA(1, 1)

Consider the ARMA(1, 1) process xt = φxt−1 + θwt−1 + wt, where |φ| < 1.Based on (3.47), the autocovariance function satisfies

γ(h)− φγ(h− 1) = 0, h = 2, 3, . . . ,

and it follows from (3.29)–(3.30) that the general solution is

γ(h) = c φh, h = 1, 2, . . . . (3.51)

To obtain the initial conditions, we use (3.48):

γ(0) = φγ(1) + σ2w[1 + θφ+ θ2] and γ(1) = φγ(0) + σ2

wθ.

Solving for γ(0) and γ(1), we obtain:

γ(0) = σ2w

1 + 2θφ+ θ2

1− φ2and γ(1) = σ2

w

(1 + θφ)(φ+ θ)

1− φ2.

To solve for c, note that from (3.51), γ(1) = c φ or c = γ(1)/φ. Hence, thespecific solution for h ≥ 1 is

γ(h) =γ(1)

φφh = σ2

w

(1 + θφ)(φ+ θ)

1− φ2φh−1.

Finally, dividing through by γ(0) yields the ACF

ρ(h) =(1 + θφ)(φ+ θ)

1 + 2θφ+ θ2φh−1, h ≥ 1. (3.52)

Notice that the general pattern of ρ(h) in (3.52) is not different from thatof an AR(1) given in (3.8). Hence, it is unlikely that we will be able totell the difference between an ARMA(1,1) and an AR(1) based solely on anACF estimated from a sample. This consideration will lead us to the partialautocorrelation function.

Page 118: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

3.4 Autocorrelation and Partial Autocorrelation 105

The Partial Autocorrelation Function (PACF)

We have seen in (3.43), for MA(q) models, the ACF will be zero for lagsgreater than q. Moreover, because θq 6= 0, the ACF will not be zero at lagq. Thus, the ACF provides a considerable amount of information about theorder of the dependence when the process is a moving average process. If theprocess, however, is ARMA or AR, the ACF alone tells us little about theorders of dependence. Hence, it is worthwhile pursuing a function that willbehave like the ACF of MA models, but for AR models, namely, the partialautocorrelation function (PACF).

To motivate the idea, consider a causal AR(1) model, xt = φxt−1 + wt.Then,

γx(2) = cov(xt, xt−2) = cov(φxt−1 + wt, xt−2)

= cov(φ2xt−2 + φwt−1 + wt, xt−2) = φ2γx(0).

This result follows from causality because xt−2 involves {wt−2, wt−3, . . .},which are all uncorrelated with wt and wt−1. The correlation between xtand xt−2 is not zero, as it would be for an MA(1), because xt is dependent onxt−2 through xt−1. Suppose we break this chain of dependence by removing(or partial out) the effect xt−1. That is, we consider the correlation betweenxt − φxt−1 and xt−2 − φxt−1, because it is the correlation between xt andxt−2 with the linear dependence of each on xt−1 removed. In this way, wehave broken the dependence chain between xt and xt−2. In fact,

cov(xt − φxt−1, xt−2 − φxt−1) = cov(wt, xt−2 − φxt−1) = 0.

Hence, the tool we need is partial autocorrelation, which is the correlationbetween xs and xt with the linear effect of everything “in the middle” removed.

To formally define the PACF for mean-zero stationary time series, let xt+h,for h ≥ 2, denote the regression3 of xt+h on {xt+h−1, xt+h−2, . . . , xt+1}, whichwe write as

xt+h = β1xt+h−1 + β2xt+h−2 + · · ·+ βh−1xt+1. (3.53)

No intercept term is needed in (3.53) because the mean of xt is zero (other-wise, replace xt by xt − µx in this discussion). In addition, let xt denote theregression of xt on {xt+1, xt+2, . . . , xt+h−1}, then

xt = β1xt+1 + β2xt+2 + · · ·+ βh−1xt+h−1. (3.54)

Because of stationarity, the coefficients, β1, . . . , βh−1 are the same in (3.53)and (3.54); we will explain this result in the next section.

3 The term regression here refers to regression in the population sense. That is,xt+h is the linear combination of {xt+h−1, xt+h−2, . . . , xt+1} that minimizes themean squared error E(xt+h −

∑h−1j=1 αjxt+j)

2.

Page 119: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

106 3 ARIMA Models

Definition 3.9 The partial autocorrelation function (PACF) of a sta-tionary process, xt, denoted φhh, for h = 1, 2, . . . , is

φ11 = corr(xt+1, xt) = ρ(1) (3.55)

andφhh = corr(xt+h − xt+h, xt − xt), h ≥ 2. (3.56)

Both (xt+h−xt+h) and (xt−xt) are uncorrelated with {xt+1, . . . , xt+h−1}.The PACF, φhh, is the correlation between xt+h and xt with the linear depen-dence of {xt+1, . . . , xt+h−1} on each, removed. If the process xt is Gaussian,then φhh = corr(xt+h, xt | xt+1, . . . , xt+h−1); that is, φhh is the correlationcoefficient between xt+h and xt in the bivariate distribution of (xt+h, xt) con-ditional on {xt+1, . . . , xt+h−1}.

Example 3.14 The PACF of an AR(1)

Consider the PACF of the AR(1) process given by xt = φxt−1 + wt, with|φ| < 1. By definition, φ11 = ρ(1) = φ. To calculate φ22, consider theregression of xt+2 on xt+1, say, xt+2 = βxt+1. We choose β to minimize

E(xt+2 − xt+2)2 = E(xt+2 − βxt+1)2 = γ(0)− 2βγ(1) + β2γ(0).

Taking derivatives with respect to β and setting the result equal to zero, wehave β = γ(1)/γ(0) = ρ(1) = φ. Next, consider the regression of xt on xt+1,say xt = βxt+1. We choose β to minimize

E(xt − xt)2 = E(xt − βxt+1)2 = γ(0)− 2βγ(1) + β2γ(0).

This is the same equation as before, so β = φ. Hence,

φ22 = corr(xt+2 − xt+2, xt − xt) = corr(xt+2 − φxt+1, xt − φxt+1)

= corr(wt+2, xt − φxt+1) = 0

by causality. Thus, φ22 = 0. In the next example, we will see that in thiscase, φhh = 0 for all h > 1.

Example 3.15 The PACF of an AR(p)

The model implies xt+h =∑pj=1 φjxt+h−j + wt+h, where the roots of

φ(z) are outside the unit circle. When h > p, the regression of xt+h on{xt+1, . . . , xt+h−1}, is

xt+h =

p∑j=1

φjxt+h−j .

We have not proved this obvious result yet, but we will prove it in the nextsection. Thus, when h > p,

φhh = corr(xt+h − xt+h, xt − xt) = corr(wt+h, xt − xt) = 0,

Page 120: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

3.4 Autocorrelation and Partial Autocorrelation 107

5 10 15 20

−0.5

0.0

0.5

1.0

lag

AC

F

5 10 15 20

−0.5

0.0

0.5

1.0

lag

PA

CF

Fig. 3.4. The ACF and PACF of an AR(2) model with φ1 = 1.5 and φ2 = −.75.

because, by causality, xt − xt depends only on {wt+h−1, wt+h−2, . . .}; recallequation (3.54). When h ≤ p, φpp is not zero, and φ11, . . . , φp−1,p−1 are notnecessarily zero. We will see later that, in fact, φpp = φp. Figure 3.4 showsthe ACF and the PACF of the AR(2) model presented in Example 3.10.

To reproduce Figure 3.4 in R, use the following commands:1 ACF = ARMAacf(ar=c(1.5,-.75), ma=0, 24)[-1]

2 PACF = ARMAacf(ar=c(1.5,-.75), ma=0, 24, pacf=TRUE)

3 par(mfrow=c(1,2))

4 plot(ACF, type="h", xlab="lag", ylim=c(-.8,1)); abline(h=0)

5 plot(PACF, type="h", xlab="lag", ylim=c(-.8,1)); abline(h=0)

Example 3.16 The PACF of an Invertible MA(q)

For an invertible MA(q), we can write xt = −∑∞j=1 πjxt−j +wt. Moreover,

no finite representation exists. From this result, it should be apparent thatthe PACF will never cut off, as in the case of an AR(p).

For an MA(1), xt = wt + θwt−1, with |θ| < 1, calculations similar toExample 3.14 will yield φ22 = −θ2/(1 + θ2 + θ4). For the MA(1) in general,we can show that

φhh = − (−θ)h(1− θ2)

1− θ2(h+1), h ≥ 1.

In the next section, we will discuss methods of calculating the PACF. ThePACF for MA models behaves much like the ACF for AR models. Also, thePACF for AR models behaves much like the ACF for MA models. Becausean invertible ARMA model has an infinite AR representation, the PACF willnot cut off. We may summarize these results in Table 3.1.

Page 121: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

108 3 ARIMA Models

Table 3.1. Behavior of the ACF and PACF for ARMA Models

AR(p) MA(q) ARMA(p, q)

ACF Tails off Cuts off Tails offafter lag q

PACF Cuts off Tails off Tails offafter lag p

Example 3.17 Preliminary Analysis of the Recruitment Series

We consider the problem of modeling the Recruitment series shown in Fig-ure 1.5. There are 453 months of observed recruitment ranging over theyears 1950-1987. The ACF and the PACF given in Figure 3.5 are con-sistent with the behavior of an AR(2). The ACF has cycles correspond-ing roughly to a 12-month period, and the PACF has large values forh = 1, 2 and then is essentially zero for higher order lags. Based on Ta-ble 3.1, these results suggest that a second-order (p = 2) autoregres-sive model might provide a good fit. Although we will discuss estimationin detail in §3.6, we ran a regression (see §2.2) using the data triplets{(x; z1, z2) : (x3;x2, x1), (x4;x3, x2), . . . , (x453;x452, x451)} to fit a model ofthe form

xt = φ0 + φ1xt−1 + φ2xt−2 + wt

for t = 3, 4, . . . , 453. The values of the estimates were φ0 = 6.74(1.11),

φ1 = 1.35(.04), φ2 = −.46(.04), and σ2w = 89.72, where the estimated standard

errors are in parentheses.The following R code can be used for this analysis. We use the script acf2

to print and plot the ACF and PACF; see Appendix R for details.1 acf2(rec, 48) # will produce values and a graphic

2 (regr = ar.ols(rec, order=2, demean=FALSE, intercept=TRUE))

3 regr$asy.se.coef # standard errors of the estimates

3.5 Forecasting

In forecasting, the goal is to predict future values of a time series, xn+m, m =1, 2, . . ., based on the data collected to the present, xxx = {xn, xn−1, . . . , x1}.Throughout this section, we will assume xt is stationary and the model pa-rameters are known. The problem of forecasting when the model parametersare unknown will be discussed in the next section; also, see Problem 3.26. Theminimum mean square error predictor of xn+m is

xnn+m = E(xn+m∣∣ xxx) (3.57)

because the conditional expectation minimizes the mean square error

Page 122: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

3.5 Forecasting 109

0 1 2 3 4

−0.5

0.0

0.5

1.0

LAG

ACF

0 1 2 3 4

−0.5

0.0

0.5

1.0

LAG

PACF

Fig. 3.5. ACF and PACF of the Recruitment series. Note that the lag axes are interms of season (12 months in this case).

E [xn+m − g(xxx)]2, (3.58)

where g(xxx) is a function of the observations xxx; see Problem 3.14.First, we will restrict attention to predictors that are linear functions of

the data, that is, predictors of the form

xnn+m = α0 +n∑k=1

αkxk, (3.59)

where α0, α1, . . . , αn are real numbers. Linear predictors of the form (3.59)that minimize the mean square prediction error (3.58) are called best linearpredictors (BLPs). As we shall see, linear prediction depends only on thesecond-order moments of the process, which are easy to estimate from thedata. Much of the material in this section is enhanced by the theoreticalmaterial presented in Appendix B. For example, Theorem B.3 states thatif the process is Gaussian, minimum mean square error predictors and bestlinear predictors are the same. The following property, which is based on theProjection Theorem, Theorem B.1 of Appendix B, is a key result.

Property 3.3 Best Linear Prediction for Stationary ProcessesGiven data x1, . . . , xn, the best linear predictor, xnn+m = α0 +

∑nk=1 αkxk,

of xn+m, for m ≥ 1, is found by solving

E[(xn+m − xnn+m

)xk]

= 0, k = 0, 1, . . . , n, (3.60)

where x0 = 1, for α0, α1, . . . αn.

Page 123: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

110 3 ARIMA Models

The equations specified in (3.60) are called the prediction equations, andthey are used to solve for the coefficients {α0, α1, . . . , αn}. If E(xt) = µ, thefirst equation (k = 0) of (3.60) implies

E(xnn+m) = E(xn+m) = µ.

Thus, taking expectation in (3.59), we have

µ = α0 +n∑k=1

αkµ or α0 = µ(

1−n∑k=1

αk

).

Hence, the form of the BLP is

xnn+m = µ+n∑k=1

αk(xk − µ).

Thus, until we discuss estimation, there is no loss of generality in consideringthe case that µ = 0, in which case, α0 = 0.

First, consider one-step-ahead prediction. That is, given {x1, . . . , xn}, wewish to forecast the value of the time series at the next time point, xn+1. TheBLP of xn+1 is of the form

xnn+1 = φn1xn + φn2xn−1 + · · ·+ φnnx1, (3.61)

where, for purposes that will become clear shortly, we have written αk in(3.59), as φn,n+1−k in (3.61), for k = 1, . . . , n. Using Property 3.3, the coeffi-cients {φn1, φn2, . . . , φnn} satisfy

E[(xn+1 −

n∑j=1

φnjxn+1−j

)xn+1−k

]= 0, k = 1, . . . , n,

orn∑j=1

φnjγ(k − j) = γ(k), k = 1, . . . , n. (3.62)

The prediction equations (3.62) can be written in matrix notation as

Γnφφφn = γγγn, (3.63)

where Γn = {γ(k − j)}nj,k=1 is an n × n matrix, φφφn = (φn1, . . . , φnn)′ is an

n× 1 vector, and γγγn = (γ(1), . . . , γ(n))′

is an n× 1 vector.The matrix Γn is nonnegative definite. If Γn is singular, there are many

solutions to (3.63), but, by the Projection Theorem (Theorem B.1), xnn+1 isunique. If Γn is nonsingular, the elements of φφφn are unique, and are given by

φφφn = Γ−1n γγγn. (3.64)

Page 124: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

3.5 Forecasting 111

For ARMA models, the fact that σ2w > 0 and γ(h)→ 0 as h→∞ is enough to

ensure that Γn is positive definite (Problem 3.12). It is sometimes convenientto write the one-step-ahead forecast in vector notation

xnn+1 = φφφ′nxxx, (3.65)

where xxx = (xn, xn−1, . . . , x1)′.The mean square one-step-ahead prediction error is

Pnn+1 = E(xn+1 − xnn+1)2 = γ(0)− γγγ′nΓ−1n γγγn. (3.66)

To verify (3.66) using (3.64) and (3.65),

E(xn+1 − xnn+1)2 = E(xn+1 − φφφ′nxxx)2 = E(xn+1 − γγγ′nΓ−1n xxx)2

= E(x2n+1 − 2γγγ′nΓ−1n xxxxn+1 + γγγ′nΓ

−1n xxxxxx′Γ−1n γγγn)

= γ(0)− 2γγγ′nΓ−1n γγγn + γγγ′nΓ

−1n ΓnΓ

−1n γγγn

= γ(0)− γγγ′nΓ−1n γγγn.

Example 3.18 Prediction for an AR(2)

Suppose we have a causal AR(2) process xt = φ1xt−1+φ2xt−2+wt, and oneobservation x1. Then, using equation (3.64), the one-step-ahead predictionof x2 based on x1 is

x12 = φ11x1 =γ(1)

γ(0)x1 = ρ(1)x1.

Now, suppose we want the one-step-ahead prediction of x3 based on twoobservations x1 and x2; i.e., x23 = φ21x2 + φ22x1. We could use (3.62)

φ21γ(0) + φ22γ(1) = γ(1)

φ21γ(1) + φ22γ(0) = γ(2)

to solve for φ21 and φ22, or use the matrix form in (3.64) and solve(φ21φ22

)=

(γ(0) γ(1)γ(1) γ(0)

)−1(γ(1)γ(2)

),

but, it should be apparent from the model that x23 = φ1x2 + φ2x1. Becauseφ1x2 + φ2x1 satisfies the prediction equations (3.60),

E{[x3 − (φ1x2 + φ2x1)]x1} = E(w3x1) = 0,

E{[x3 − (φ1x2 + φ2x1)]x2} = E(w3x2) = 0,

it follows that, indeed, x23 = φ1x2 + φ2x1, and by the uniqueness of thecoefficients in this case, that φ21 = φ1 and φ22 = φ2. Continuing in this way,it is easy to verify that, for n ≥ 2,

xnn+1 = φ1xn + φ2xn−1.

That is, φn1 = φ1, φn2 = φ2, and φnj = 0, for j = 3, 4, . . . , n.

Page 125: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

112 3 ARIMA Models

From Example 3.18, it should be clear (Problem 3.40) that, if the timeseries is a causal AR(p) process, then, for n ≥ p,

xnn+1 = φ1xn + φ2xn−1 + · · ·+ φpxn−p+1. (3.67)

For ARMA models in general, the prediction equations will not be as simpleas the pure AR case. In addition, for n large, the use of (3.64) is prohibitivebecause it requires the inversion of a large matrix. There are, however, iterativesolutions that do not require any matrix inversion. In particular, we mentionthe recursive solution due to Levinson (1947) and Durbin (1960).

Property 3.4 The Durbin–Levinson AlgorithmEquations (3.64) and (3.66) can be solved iteratively as follows:

φ00 = 0, P 01 = γ(0). (3.68)

For n ≥ 1,

φnn =ρ(n)−

∑n−1k=1 φn−1,k ρ(n− k)

1−∑n−1k=1 φn−1,k ρ(k)

, Pnn+1 = Pn−1n (1− φ2nn), (3.69)

where, for n ≥ 2,

φnk = φn−1,k − φnnφn−1,n−k, k = 1, 2, . . . , n− 1. (3.70)

The proof of Property 3.4 is left as an exercise; see Problem 3.13.

Example 3.19 Using the Durbin–Levinson Algorithm

To use the algorithm, start with φ00 = 0, P 01 = γ(0). Then, for n = 1,

φ11 = ρ(1), P 12 = γ(0)[1− φ211].

For n = 2,

φ22 =ρ(2)− φ11 ρ(1)

1− φ11 ρ(1), φ21 = φ11 − φ22φ11,

P 23 = P 1

2 [1− φ222] = γ(0)[1− φ211][1− φ222].

For n = 3,

φ33 =ρ(3)− φ21 ρ(2)− φ22 ρ(1)

1− φ21 ρ(1)− φ22 ρ(2),

φ32 = φ22 − φ33φ21, φ31 = φ21 − φ33φ22,P 34 = P 2

3 [1− φ233] = γ(0)[1− φ211][1− φ222][1− φ233],

and so on. Note that, in general, the standard error of the one-step-aheadforecast is the square root of

Pnn+1 = γ(0)n∏j=1

[1− φ2jj ]. (3.71)

Page 126: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

3.5 Forecasting 113

An important consequence of the Durbin–Levinson algorithm is (see Prob-lem 3.13) as follows.

Property 3.5 Iterative Solution for the PACFThe PACF of a stationary process xt, can be obtained iteratively via (3.69)

as φnn, for n = 1, 2, . . . .

Using Property 3.5 and putting n = p in (3.61) and (3.67), it follows thatfor an AR(p) model,

xpp+1 = φp1 xp + φp2 xp−1 + · · ·+ φpp x1

= φ1 xp + φ2 xp−1 + · · ·+ φp x1.(3.72)

Result (3.72) shows that for an AR(p) model, the partial autocorrelation coef-ficient at lag p, φpp, is also the last coefficient in the model, φp, as was claimedin Example 3.15.

Example 3.20 The PACF of an AR(2)

We will use the results of Example 3.19 and Property 3.5 to calculate thefirst three values, φ11, φ22, φ33, of the PACF. Recall from Example 3.9 thatρ(h) − φ1ρ(h − 1) − φ2ρ(h − 2) = 0 for h ≥ 1. When h = 1, 2, 3, we haveρ(1) = φ1/(1− φ2), ρ(2) = φ1ρ(1) + φ2, ρ(3)− φ1ρ(2)− φ2ρ(1) = 0. Thus,

φ11 = ρ(1) =φ1

1− φ2

φ22 =ρ(2)− ρ(1)2

1− ρ(1)2=

[φ1

(φ1

1−φ2

)+ φ2

]−(

φ1

1−φ2

)21−

(φ1

1−φ2

)2 = φ2

φ21 = ρ(1)[1− φ2] = φ1

φ33 =ρ(3)− φ1ρ(2)− φ2ρ(1)

1− φ1ρ(1)− φ2ρ(2)= 0.

Notice that, as shown in (3.72), φ22 = φ2 for an AR(2) model.

So far, we have concentrated on one-step-ahead prediction, but Prop-erty 3.3 allows us to calculate the BLP of xn+m for any m ≥ 1. Given data,{x1, . . . , xn}, the m-step-ahead predictor is

xnn+m = φ(m)n1 xn + φ

(m)n2 xn−1 + · · ·+ φ(m)

nn x1, (3.73)

where {φ(m)n1 , φ

(m)n2 , . . . , φ

(m)nn } satisfy the prediction equations,

n∑j=1

φ(m)nj E(xn+1−jxn+1−k) = E(xn+mxn+1−k), k = 1, . . . , n,

or

Page 127: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

114 3 ARIMA Models

n∑j=1

φ(m)nj γ(k − j) = γ(m+ k − 1), k = 1, . . . , n. (3.74)

The prediction equations can again be written in matrix notation as

Γnφφφ(m)n = γγγ(m)

n , (3.75)

where γγγ(m)n = (γ(m), . . . , γ(m+ n− 1))

′, and φφφ(m)

n = (φ(m)n1 , . . . , φ

(m)nn )′ are

n× 1 vectors.The mean square m-step-ahead prediction error is

Pnn+m = E(xn+m − xnn+m

)2= γ(0)− γγγ(m)′

n Γ−1n γγγ(m)n . (3.76)

Another useful algorithm for calculating forecasts was given by Brockwelland Davis (1991, Chapter 5). This algorithm follows directly from applyingthe projection theorem (Theorem B.1) to the innovations, xt − xt−1t , for t =1, . . . , n, using the fact that the innovations xt − xt−1t and xs − xs−1s areuncorrelated for s 6= t (see Problem 3.41). We present the case in which xt isa mean-zero stationary time series.

Property 3.6 The Innovations AlgorithmThe one-step-ahead predictors, xtt+1, and their mean-squared errors, P tt+1,

can be calculated iteratively as

x01 = 0, P 01 = γ(0)

xtt+1 =t∑

j=1

θtj(xt+1−j − xt−jt+1−j), t = 1, 2, . . . (3.77)

P tt+1 = γ(0)−t−1∑j=0

θ2t,t−jPjj+1 t = 1, 2, . . . , (3.78)

where, for j = 0, 1, . . . , t− 1,

θt,t−j =(γ(t− j)−

j−1∑k=0

θj,j−kθt,t−kPkk+1

) /P jj+1. (3.79)

Given data x1, . . . , xn, the innovations algorithm can be calculated suc-cessively for t = 1, then t = 2 and so on, in which case the calculation of xnn+1

and Pnn+1 is made at the final step t = n. The m-step-ahead predictor andits mean-square error based on the innovations algorithm (Problem 3.41) aregiven by

xnn+m =n+m−1∑j=m

θn+m−1,j(xn+m−j − xn+m−j−1n+m−j ), (3.80)

Pnn+m = γ(0)−n+m−1∑j=m

θ2n+m−1,jPn+m−j−1n+m−j , (3.81)

where the θn+m−1,j are obtained by continued iteration of (3.79).

Page 128: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

3.5 Forecasting 115

Example 3.21 Prediction for an MA(1)

The innovations algorithm lends itself well to prediction for moving averageprocesses. Consider an MA(1) model, xt = wt + θwt−1. Recall that γ(0) =(1 + θ2)σ2

w, γ(1) = θσ2w, and γ(h) = 0 for h > 1. Then, using Property 3.6,

we have

θn1 = θσ2w/P

n−1n

θnj = 0, j = 2, . . . , n

P 01 = (1 + θ2)σ2

w

Pnn+1 = (1 + θ2 − θθn1)σ2w.

Finally, from (3.77), the one-step-ahead predictor is

xnn+1 = θ(xn − xn−1n

)σ2w/P

n−1n .

Forecasting ARMA Processes

The general prediction equations (3.60) provide little insight into forecastingfor ARMA models in general. There are a number of different ways to expressthese forecasts, and each aids in understanding the special structure of ARMAprediction. Throughout, we assume xt is a causal and invertible ARMA(p, q)process, φ(B)xt = θ(B)wt, where wt ∼ iid N(0, σ2

w). In the non-zero meancase, E(xt) = µx, simply replace xt with xt − µx in the model. First, weconsider two types of forecasts. We write xnn+m to mean the minimum meansquare error predictor of xn+m based on the data {xn, . . . , x1}, that is,

xnn+m = E(xn+m∣∣ xn, . . . , x1).

For ARMA models, it is easier to calculate the predictor of xn+m, assumingwe have the complete history of the process {xn, xn−1, . . . , x1, x0, x−1, . . .}.We will denote the predictor of xn+m based on the infinite past as

xn+m = E(xn+m∣∣ xn, xn−1, . . . , x1, x0, x−1, . . .).

In general, xnn+m and xn+m are not the same, but the idea here is that, forlarge samples, xn+m will provide a good approximation to xnn+m.

Now, write xn+m in its causal and invertible forms:

xn+m =

∞∑j=0

ψjwn+m−j , ψ0 = 1 (3.82)

wn+m =∞∑j=0

πjxn+m−j , π0 = 1. (3.83)

Page 129: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

116 3 ARIMA Models

Then, taking conditional expectations in (3.82), we have

xn+m =

∞∑j=0

ψjwn+m−j =

∞∑j=m

ψjwn+m−j , (3.84)

because, by causality and invertibility,

wt = E(wt∣∣ xn, xn−1, . . . , x0, x−1, . . .) =

{0 t > n

wt t ≤ n.

Similarly, taking conditional expectations in (3.83), we have

0 = xn+m +∞∑j=1

πj xn+m−j ,

or

xn+m = −m−1∑j=1

πj xn+m−j −∞∑j=m

πjxn+m−j , (3.85)

using the fact E(xt∣∣ xn, xn−1, . . . , x0, x−1, . . .) = xt, for t ≤ n. Prediction is

accomplished recursively using (3.85), starting with the one-step-ahead pre-dictor, m = 1, and then continuing for m = 2, 3, . . .. Using (3.84), we canwrite

xn+m − xn+m =m−1∑j=0

ψjwn+m−j ,

so the mean-square prediction error can be written as

Pnn+m = E(xn+m − xn+m)2 = σ2w

m−1∑j=0

ψ2j . (3.86)

Also, we note, for a fixed sample size, n, the prediction errors are correlated.That is, for k ≥ 1,

E{(xn+m − xn+m)(xn+m+k − xn+m+k)} = σ2w

m−1∑j=0

ψjψj+k. (3.87)

Example 3.22 Long-Range Forecasts

Consider forecasting an ARMA process with mean µx. Replacing xn+m withxn+m − µx in (3.82), and taking conditional expectation as is in (3.84), wededuce that the m-step-ahead forecast can be written as

xn+m = µx +∞∑j=m

ψjwn+m−j . (3.88)

Page 130: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

3.5 Forecasting 117

Noting that the ψ-weights dampen to zero exponentially fast, it is clear that

xn+m → µx (3.89)

exponentially fast (in the mean square sense) as m → ∞. Moreover, by(3.86), the mean square prediction error

Pnn+m → σ2w

∞∑j=0

ψ2j = γx(0) = σ2

x, (3.90)

exponentially fast as m→∞; recall (3.45).It should be clear from (3.89) and (3.90) that ARMA forecasts quickly

settle to the mean with a constant prediction error as the forecast horizon,m, grows. This effect can be seen in Figure 3.6 on page 119 where theRecruitment series is forecast for 24 months; see Example 3.24.

When n is small, the general prediction equations (3.60) can be used easily.When n is large, we would use (3.85) by truncating, because we do not observex0, x−1, x−2, . . ., and only the data x1, x2, . . . , xn are available. In this case,we can truncate (3.85) by setting

∑∞j=n+m πjxn+m−j = 0. The truncated

predictor is then written as

xnn+m = −m−1∑j=1

πj xnn+m−j −

n+m−1∑j=m

πjxn+m−j , (3.91)

which is also calculated recursively, m = 1, 2, . . .. The mean square predictionerror, in this case, is approximated using (3.86).

For AR(p) models, and when n > p, equation (3.67) yields the exactpredictor, xnn+m, of xn+m, and there is no need for approximations. That is,for n > p, xnn+m = xn+m = xnn+m. Also, in this case, the one-step-aheadprediction error is E(xn+1 − xnn+1)2 = σ2

w. For pure MA(q) or ARMA(p, q)models, truncated prediction has a fairly nice form.

Property 3.7 Truncated Prediction for ARMAFor ARMA(p, q) models, the truncated predictors for m = 1, 2, . . . , are

xnn+m = φ1xnn+m−1 + · · ·+φpx

nn+m−p + θ1w

nn+m−1 + · · ·+ θqw

nn+m−q, (3.92)

where xnt = xt for 1 ≤ t ≤ n and xnt = 0 for t ≤ 0. The truncated predictionerrors are given by: wnt = 0 for t ≤ 0 or t > n, and

wnt = φ(B)xnt − θ1wnt−1 − · · · − θqwnt−q

for 1 ≤ t ≤ n.

Page 131: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

118 3 ARIMA Models

Example 3.23 Forecasting an ARMA(1, 1) Series

Given data x1, . . . , xn, for forecasting purposes, write the model as

xn+1 = φxn + wn+1 + θwn.

Then, based on (3.92), the one-step-ahead truncated forecast is

xnn+1 = φxn + 0 + θwnn.

For m ≥ 2, we havexnn+m = φxnn+m−1,

which can be calculated recursively, m = 2, 3, . . . .To calculate wnn, which is needed to initialize the successive forecasts, the

model can be written as wt = xt − φxt−1 − θwt−1 for t = 1, . . . , n. Fortruncated forecasting using (3.92), put wn0 = 0, x0 = 0, and then iterate theerrors forward in time

wnt = xt − φxt−1 − θwnt−1, t = 1, . . . , n.

The approximate forecast variance is computed from (3.86) using the ψ-weights determined as in Example 3.11. In particular, the ψ-weights satisfyψj = (φ+ θ)φj−1, for j ≥ 1. This result gives

Pnn+m = σ2w

[1 + (φ+ θ)2

m−1∑j=1

φ2(j−1)]

= σ2w

[1 +

(φ+ θ)2(1− φ2(m−1))(1− φ2)

].

To assess the precision of the forecasts, prediction intervals are typicallycalculated along with the forecasts. In general, (1−α) prediction intervals areof the form

xnn+m ± cα2√Pnn+m, (3.93)

where cα/2 is chosen to get the desired degree of confidence. For example,if the process is Gaussian, then choosing cα/2 = 2 will yield an approxi-mate 95% prediction interval for xn+m. If we are interested in establishingprediction intervals over more than one time period, then cα/2 should be ad-justed appropriately, for example, by using Bonferroni’s inequality [see (4.55)in Chapter 4 or Johnson and Wichern, 1992, Chapter 5].

Example 3.24 Forecasting the Recruitment Series

Using the parameter estimates as the actual parameter values, Figure 3.6shows the result of forecasting the Recruitment series given in Example 3.17over a 24-month horizon, m = 1, 2, . . . , 24. The actual forecasts are calcu-lated as

xnn+m = 6.74 + 1.35xnn+m−1 − .46xnn+m−2

for n = 453 and m = 1, 2, . . . , 12. Recall that xst = xt when t ≤ s. Theforecasts errors Pnn+m are calculated using (3.86). Recall that σ2

w = 89.72,

Page 132: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

3.5 Forecasting 119

Fig. 3.6. Twenty-four month forecasts for the Recruitment series. The actual datashown are from about January 1980 to September 1987, and then the forecasts plusand minus one standard error are displayed.

and using (3.40) from Example 3.11, we have ψj = 1.35ψj−1 − .46ψj−2 forj ≥ 2, where ψ0 = 1 and ψ1 = 1.35. Thus, for n = 453,

Pnn+1 = 89.72,Pnn+2 = 89.72(1 + 1.352),Pnn+3 = 89.72(1 + 1.352 + [1.352 − .46]2),

and so on.Note how the forecast levels off quickly and the prediction intervals are

wide, even though in this case the forecast limits are only based on onestandard error; that is, xnn+m ±

√Pnn+m.

To reproduce the analysis and Figure 3.6, use the following commands:1 regr = ar.ols(rec, order=2, demean=FALSE, intercept=TRUE)

2 fore = predict(regr, n.ahead=24)

3 ts.plot(rec, fore$pred, col=1:2, xlim=c(1980,1990),

ylab="Recruitment")

4 lines(fore$pred, type="p", col=2)

5 lines(fore$pred+fore$se, lty="dashed", col=4)

6 lines(fore$pred-fore$se, lty="dashed", col=4)

We complete this section with a brief discussion of backcasting. In back-casting, we want to predict x1−m, for m = 1, 2, . . ., based on the data{x1, . . . , xn}. Write the backcast as

Page 133: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

120 3 ARIMA Models

xn1−m =

n∑j=1

αjxj . (3.94)

Analogous to (3.74), the prediction equations (assuming µx = 0) are

n∑j=1

αjE(xjxk) = E(x1−mxk), k = 1, . . . , n, (3.95)

orn∑j=1

αjγ(k − j) = γ(m+ k − 1), k = 1, . . . , n. (3.96)

These equations are precisely the prediction equations for forward prediction.

That is, αj ≡ φ(m)nj , for j = 1, . . . , n, where the φ

(m)nj are given by (3.75).

Finally, the backcasts are given by

xn1−m = φ(m)n1 x1 + · · ·+ φ(m)

nn xn, m = 1, 2, . . . . (3.97)

Example 3.25 Backcasting an ARMA(1, 1)

Consider an ARMA(1, 1) process, xt = φxt−1 + θwt−1 +wt; we will call thisthe forward model. We have just seen that best linear prediction backwardin time is the same as best linear prediction forward in time for stationarymodels. Because we are assuming ARMA models are Gaussian, we also havethat minimum mean square error prediction backward in time is the sameas forward in time for ARMA models.4 Thus, the process can equivalentlybe generated by the backward model,

xt = φxt+1 + θvt+1 + vt,

where {vt} is a Gaussian white noise process with variance σ2w. We may

write xt =∑∞j=0 ψjvt+j , where ψ0 = 1; this means that xt is uncorrelated

with {vt−1, vt−2, . . .}, in analogy to the forward model.Given data {x1, . . . ., xn}, truncate vnn = E(vn |x1, . . . ., xn) to zero and

then iterate backward. That is, put vnn = 0, as an initial approximation, andthen generate the errors backward

vnt = xt − φxt+1 − θvnt+1, t = (n− 1), (n− 2), . . . , 1.

Then,xn0 = φx1 + θvn1 + vn0 = φx1 + θvn1 ,

because vnt = 0 for t ≤ 0. Continuing, the general truncated backcasts aregiven by

xn1−m = φxn2−m, m = 2, 3, . . . .

4 In the stationary Gaussian case, (a) the distribution of {xn+1, xn, . . . , x1} is thesame as (b) the distribution of {x0, x1 . . . , xn}. In forecasting we use (a) to ob-tain E(xn+1 |xn, . . . , x1); in backcasting we use (b) to obtain E(x0 |x1, . . . , xn).Because (a) and (b) are the same, the two problems are equivalent.

Page 134: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

3.6 Estimation 121

3.6 Estimation

Throughout this section, we assume we have n observations, x1, . . . , xn, froma causal and invertible Gaussian ARMA(p, q) process in which, initially, theorder parameters, p and q, are known. Our goal is to estimate the parameters,φ1, . . . , φp, θ1, . . . , θq, and σ2

w. We will discuss the problem of determining pand q later in this section.

We begin with method of moments estimators. The idea behind these esti-mators is that of equating population moments to sample moments and thensolving for the parameters in terms of the sample moments. We immediatelysee that, if E(xt) = µ, then the method of moments estimator of µ is thesample average, x. Thus, while discussing method of moments, we will as-sume µ = 0. Although the method of moments can produce good estimators,they can sometimes lead to suboptimal estimators. We first consider the casein which the method leads to optimal (efficient) estimators, that is, AR(p)models.

When the process is AR(p),

xt = φ1xt−1 + · · ·+ φpxt−p + wt,

the first p+ 1 equations of (3.47) and (3.48) lead to the following:

Definition 3.10 The Yule–Walker equations are given by

γ(h) = φ1γ(h− 1) + · · ·+ φpγ(h− p), h = 1, 2, . . . , p, (3.98)

σ2w = γ(0)− φ1γ(1)− · · · − φpγ(p). (3.99)

In matrix notation, the Yule–Walker equations are

Γpφφφ = γγγp, σ2w = γ(0)− φφφ′γγγp, (3.100)

where Γp = {γ(k−j)}pj,k=1 is a p×p matrix, φφφ = (φ1, . . . , φp)′ is a p×1 vector,

and γγγp = (γ(1), . . . , γ(p))′

is a p × 1 vector. Using the method of moments,we replace γ(h) in (3.100) by γ(h) [see equation (1.34)] and solve

φφφ = Γ−1p γγγp, σ2w = γ(0)− γγγ′pΓ−1p γγγp. (3.101)

These estimators are typically called the Yule–Walker estimators. For calcula-tion purposes, it is sometimes more convenient to work with the sample ACF.By factoring γ(0) in (3.101), we can write the Yule–Walker estimates as

φφφ = RRR−1p ρρρp, σ2

w = γ(0)[1− ρρρ′pRRR

−1p ρρρp

], (3.102)

where Rp = {ρ(k − j)}pj,k=1 is a p × p matrix and ρρρp = (ρ(1), . . . , ρ(p))′

is ap× 1 vector.

For AR(p) models, if the sample size is large, the Yule–Walker estimatorsare approximately normally distributed, and σ2

w is close to the true value ofσ2w. We state these results in Property 3.8; for details, see Appendix B, §B.3.

Page 135: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

122 3 ARIMA Models

Property 3.8 Large Sample Results for Yule–Walker EstimatorsThe asymptotic (n → ∞) behavior of the Yule–Walker estimators in the

case of causal AR(p) processes is as follows:

√n(φφφ− φφφ

)d→ N

(000, σ2

wΓ−1p

), σ2

wp→ σ2

w. (3.103)

The Durbin–Levinson algorithm, (3.68)-(3.70), can be used to calculate φφφ

without inverting Γp or Rp, by replacing γ(h) by γ(h) in the algorithm. In

running the algorithm, we will iteratively calculate the h × 1 vector, φφφh =

(φh1, . . . , φhh)′, for h = 1, 2, . . .. Thus, in addition to obtaining the desired

forecasts, the Durbin–Levinson algorithm yields φhh, the sample PACF. Using(3.103), we can show the following property.

Property 3.9 Large Sample Distribution of the PACFFor a causal AR(p) process, asymptotically (n→∞),

√n φhh

d→ N (0, 1) , for h > p. (3.104)

Example 3.26 Yule–Walker Estimation for an AR(2) Process

The data shown in Figure 3.3 were n = 144 simulated observations from theAR(2) model

xt = 1.5xt−1 − .75xt−2 + wt,

where wt ∼ iid N(0, 1). For these data, γ(0) = 8.903, ρ(1) = .849, andρ(2) = .519. Thus,

φφφ =

(φ1φ2

)=

[1 .849.849 1

]−1(.849.519

)=

(1.463−.723

)and

σ2w = 8.903

[1− (.849, .519)

(1.463−.723

)]= 1.187.

By Property 3.8, the asymptotic variance–covariance matrix of φφφ,

1

144

1.187

8.903

[1 .849.849 1

]−1=

[.0582 −.003−.003 .0582

],

can be used to get confidence regions for, or make inferences about φφφ andits components. For example, an approximate 95% confidence interval forφ2 is −.723 ± 2(.058), or (−.838,−.608), which contains the true value ofφ2 = −.75.

For these data, the first three sample partial autocorrelations are φ11 =ρ(1) = .849, φ22 = φ2 = −.721, and φ33 = −.085. According to Property 3.9,

the asymptotic standard error of φ33 is 1/√

144 = .083, and the observedvalue, −.085, is about only one standard deviation from φ33 = 0.

Page 136: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

3.6 Estimation 123

Example 3.27 Yule–Walker Estimation of the Recruitment Series

In Example 3.17 we fit an AR(2) model to the recruitment series usingregression. Below are the results of fitting the same model using Yule-Walkerestimation in R, which are nearly identical to the values in Example 3.17.

1 rec.yw = ar.yw(rec, order=2)

2 rec.yw$x.mean # = 62.26 (mean estimate)

3 rec.yw$ar # = 1.33, -.44 (parameter estimates)

4 sqrt(diag(rec.yw$asy.var.coef)) # = .04, .04 (standard errors)

5 rec.yw$var.pred # = 94.80 (error variance estimate)

To obtain the 24 month ahead predictions and their standard errors, andthen plot the results as in Example 3.24, use the R commands:

1 rec.pr = predict(rec.yw, n.ahead=24)

2 U = rec.pr$pred + rec.pr$se

3 L = rec.pr$pred - rec.pr$se

4 minx = min(rec,L); maxx = max(rec,U)

5 ts.plot(rec, rec.pr$pred, xlim=c(1980,1990), ylim=c(minx,maxx))

6 lines(rec.pr$pred, col="red", type="o")

7 lines(U, col="blue", lty="dashed")

8 lines(L, col="blue", lty="dashed")

In the case of AR(p) models, the Yule–Walker estimators given in (3.102)are optimal in the sense that the asymptotic distribution, (3.103), is thebest asymptotic normal distribution. This is because, given initial conditions,AR(p) models are linear models, and the Yule–Walker estimators are essen-tially least squares estimators. If we use method of moments for MA or ARMAmodels, we will not get optimal estimators because such processes are nonlin-ear in the parameters.

Example 3.28 Method of Moments Estimation for an MA(1)

Consider the time seriesxt = wt + θwt−1,

where |θ| < 1. The model can then be written as

xt =∞∑j=1

(−θ)jxt−j + wt,

which is nonlinear in θ. The first two population autocovariances are γ(0) =σ2w(1 + θ2) and γ(1) = σ2

wθ, so the estimate of θ is found by solving:

ρ(1) =γ(1)

γ(0)=

θ

1 + θ2.

Two solutions exist, so we would pick the invertible one. If |ρ(1)| ≤ 12 , the

solutions are real, otherwise, a real solution does not exist. Even though|ρ(1)| < 1

2 for an invertible MA(1), it may happen that |ρ(1)| ≥ 12 because

it is an estimator. For example, the following simulation in R produces avalue of ρ(1) = .507 when the true value is ρ(1) = .9/(1 + .92) = .497.

Page 137: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

124 3 ARIMA Models

1 set.seed(2)

2 ma1 = arima.sim(list(order = c(0,0,1), ma = 0.9), n = 50)

3 acf(ma1, plot=FALSE)[1] # = .507 (lag 1 sample ACF)

When |ρ(1)| < 12 , the invertible estimate is

θ =1−

√1− 4ρ(1)2

2ρ(1).

It can be shown that5

θ ∼ AN

(θ,

1 + θ2 + 4θ4 + θ6 + θ8

n(1− θ2)2

);

AN is read asymptotically normal and is defined in Definition A.5, page 515,of Appendix A. The maximum likelihood estimator (which we discuss next)of θ, in this case, has an asymptotic variance of (1 − θ2)/n. When θ = .5,for example, the ratio of the asymptotic variance of the method of momentsestimator to the maximum likelihood estimator of θ is about 3.5. That is,for large samples, the variance of the method of moments estimator is about3.5 times larger than the variance of the MLE of θ when θ = .5.

Maximum Likelihood and Least Squares Estimation

To fix ideas, we first focus on the causal AR(1) case. Let

xt = µ+ φ(xt−1 − µ) + wt (3.105)

where |φ| < 1 and wt ∼ iid N(0, σ2w). Given data x1, x2, . . . , xn, we seek the

likelihoodL(µ, φ, σ2

w) = f(x1, x2, . . . , xn

∣∣ µ, φ, σ2w

).

In the case of an AR(1), we may write the likelihood as

L(µ, φ, σ2w) = f(x1)f(x2

∣∣ x1) · · · f(xn∣∣ xn−1),

where we have dropped the parameters in the densities, f(·), to ease thenotation. Because xt

∣∣ xt−1 ∼ N(µ+ φ(xt−1 − µ), σ2

w

), we have

f(xt∣∣ xt−1) = fw[(xt − µ)− φ(xt−1 − µ)],

where fw(·) is the density of wt, that is, the normal density with mean zeroand variance σ2

w. We may then write the likelihood as

L(µ, φ, σw) = f(x1)n∏t=2

fw [(xt − µ)− φ(xt−1 − µ)] .

5 The result follows from Theorem A.7 given in Appendix A and the delta method.See the proof of Theorem A.7 for details on the delta method.

Page 138: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

3.6 Estimation 125

To find f(x1), we can use the causal representation

x1 = µ+∞∑j=0

φjw1−j

to see that x1 is normal, with mean µ and variance σ2w/(1− φ2). Finally, for

an AR(1), the likelihood is

L(µ, φ, σ2w) = (2πσ2

w)−n/2(1− φ2)1/2 exp

[−S(µ, φ)

2σ2w

], (3.106)

where

S(µ, φ) = (1− φ2)(x1 − µ)2 +n∑t=2

[(xt − µ)− φ(xt−1 − µ)]2. (3.107)

Typically, S(µ, φ) is called the unconditional sum of squares. We could havealso considered the estimation of µ and φ using unconditional least squares,that is, estimation by minimizing S(µ, φ).

Taking the partial derivative of the log of (3.106) with respect to σ2w and

setting the result equal to zero, we see that for any given values of µ and φin the parameter space, σ2

w = n−1S(µ, φ) maximizes the likelihood. Thus, themaximum likelihood estimate of σ2

w is

σ2w = n−1S(µ, φ), (3.108)

where µ and φ are the MLEs of µ and φ, respectively. If we replace n in (3.108)by n− 2, we would obtain the unconditional least squares estimate of σ2

w.If, in (3.106), we take logs, replace σ2

w by σ2w, and ignore constants, µ and

φ are the values that minimize the criterion function

l(µ, φ) = log[n−1S(µ, φ)

]− n−1 log(1− φ2); (3.109)

that is, l(µ, φ) ∝ −2 logL(µ, φ, σ2w).6 Because (3.107) and (3.109) are com-

plicated functions of the parameters, the minimization of l(µ, φ) or S(µ, φ) isaccomplished numerically. In the case of AR models, we have the advantagethat, conditional on initial values, they are linear models. That is, we candrop the term in the likelihood that causes the nonlinearity. Conditioning onx1, the conditional likelihood becomes

L(µ, φ, σ2w

∣∣ x1) =n∏t=2

fw [(xt − µ)− φ(xt−1 − µ)]

= (2πσ2w)−(n−1)/2 exp

[−Sc(µ, φ)

2σ2w

], (3.110)

6 The criterion function is sometimes called the profile or concentrated likelihood.

Page 139: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

126 3 ARIMA Models

where the conditional sum of squares is

Sc(µ, φ) =n∑t=2

[(xt − µ)− φ(xt−1 − µ)]2. (3.111)

The conditional MLE of σ2w is

σ2w = Sc(µ, φ)/(n− 1), (3.112)

and µ and φ are the values that minimize the conditional sum of squares,Sc(µ, φ). Letting α = µ(1− φ), the conditional sum of squares can be writtenas

Sc(µ, φ) =n∑t=2

[xt − (α+ φxt−1)]2. (3.113)

The problem is now the linear regression problem stated in §2.2. Followingthe results from least squares estimation, we have α = x(2) − φx(1), where

x(1) = (n− 1)−1∑n−1t=1 xt, and x(2) = (n− 1)−1

∑nt=2 xt, and the conditional

estimates are then

µ =x(2) − φx(1)

1− φ(3.114)

φ =

∑nt=2(xt − x(2))(xt−1 − x(1))∑n

t=2(xt−1 − x(1))2. (3.115)

From (3.114) and (3.115), we see that µ ≈ x and φ ≈ ρ(1). That is, theYule–Walker estimators and the conditional least squares estimators are ap-proximately the same. The only difference is the inclusion or exclusion ofterms involving the endpoints, x1 and xn. We can also adjust the estimate ofσ2w in (3.112) to be equivalent to the least squares estimator, that is, divide

Sc(µ, φ) by (n− 3) instead of (n− 1) in (3.112).For general AR(p) models, maximum likelihood estimation, unconditional

least squares, and conditional least squares follow analogously to the AR(1)example. For general ARMA models, it is difficult to write the likelihood as anexplicit function of the parameters. Instead, it is advantageous to write thelikelihood in terms of the innovations, or one-step-ahead prediction errors,xt − xt−1t . This will also be useful in Chapter 6 when we study state-spacemodels.

For a normal ARMA(p, q) model, let βββ = (µ, φ1, . . . , φp, θ1, . . . , θq)′ be the

(p+ q+ 1)-dimensional vector of the model parameters. The likelihood can bewritten as

L(βββ, σ2w) =

n∏t=1

f(xt∣∣ xt−1, . . . , x1).

The conditional distribution of xt given xt−1, . . . , x1 is Gaussian with meanxt−1t and variance P t−1t . Recall from (3.71) that P t−1t = γ(0)

∏t−1j=1(1− φ2jj).

For ARMA models, γ(0) = σ2w

∑∞j=0 ψ

2j , in which case we may write

Page 140: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

3.6 Estimation 127

P t−1t = σ2w

∞∑j=0

ψ2j

t−1∏j=1

(1− φ2jj)

def= σ2

w rt,

where rt is the term in the braces. Note that the rt terms are functions onlyof the regression parameters and that they may be computed recursively asrt+1 = (1− φ2tt)rt with initial condition r1 =

∑∞j=0 ψ

2j . The likelihood of the

data can now be written as

L(βββ, σ2w) = (2πσ2

w)−n/2 [r1(βββ)r2(βββ) · · · rn(βββ)]−1/2

exp

[−S(βββ)

2σ2w

], (3.116)

where

S(βββ) =n∑t=1

[(xt − xt−1t (βββ))2

rt(βββ)

]. (3.117)

Both xt−1t and rt are functions of βββ alone, and we make that fact explicitin (3.116)-(3.117). Given values for βββ and σ2

w, the likelihood may be evalu-ated using the techniques of §3.5. Maximum likelihood estimation would nowproceed by maximizing (3.116) with respect to βββ and σ2

w. As in the AR(1)example, we have

σ2w = n−1S(βββ), (3.118)

where βββ is the value of βββ that minimizes the concentrated likelihood

l(βββ) = log[n−1S(βββ)

]+ n−1

n∑t=1

log rt(βββ). (3.119)

For the AR(1) model (3.105) discussed previously, recall that x01 = µ andxt−1t = µ + φ(xt−1 − µ), for t = 2, . . . , n. Also, using the fact that φ11 = φand φhh = 0 for h > 1, we have r1 =

∑∞j=0 φ

2j = (1 − φ2)−1, r2 = (1 −φ2)−1(1−φ2) = 1, and in general, rt = 1 for t = 2, . . . , n. Hence, the likelihoodpresented in (3.106) is identical to the innovations form of the likelihood givenby (3.116). Moreover, the generic S(βββ) in (3.117) is S(µ, φ) given in (3.107)and the generic l(βββ) in (3.119) is l(µ, φ) in (3.109).

Unconditional least squares would be performed by minimizing (3.117)with respect to βββ. Conditional least squares estimation would involve mini-mizing (3.117) with respect to βββ but where, to ease the computational burden,the predictions and their errors are obtained by conditioning on initial valuesof the data. In general, numerical optimization routines are used to obtain theactual estimates and their standard errors.

Example 3.29 The Newton–Raphson and Scoring Algorithms

Two common numerical optimization routines for accomplishing maximumlikelihood estimation are Newton–Raphson and scoring. We will give a briefaccount of the mathematical ideas here. The actual implementation of thesealgorithms is much more complicated than our discussion might imply. For

Page 141: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

128 3 ARIMA Models

details, the reader is referred to any of the Numerical Recipes books, forexample, Press et al. (1993).

Let l(βββ) be a criterion function of k parameters βββ = (β1, . . . , βk) that wewish to minimize with respect to βββ. For example, consider the likelihoodfunction given by (3.109) or by (3.119). Suppose l(βββ) is the extremum that

we are interested in finding, and βββ is found by solving ∂l(βββ)/∂βj = 0, forj = 1, . . . , k. Let l(1)(βββ) denote the k × 1 vector of partials

l(1)(βββ) =

(∂l(βββ)

∂β1, . . . ,

∂l(βββ)

∂βk

)′.

Note, l(1)(βββ) = 000, the k× 1 zero vector. Let l(2)(βββ) denote the k× k matrixof second-order partials

l(2)(βββ) =

{− ∂l

2(βββ)

∂βi∂βj

}ki,j=1

,

and assume l(2)(βββ) is nonsingular. Let βββ(0) be an initial estimator of βββ. Then,using a Taylor expansion, we have the following approximation:

000 = l(1)(βββ) ≈ l(1)(βββ(0))− l(2)(βββ(0))[βββ − βββ(0)

].

Setting the right-hand side equal to zero and solving for βββ [call the solutionβββ(1)], we get

βββ(1) = βββ(0) +[l(2)(βββ(0))

]−1l(1)(βββ(0)).

The Newton–Raphson algorithm proceeds by iterating this result, replacingβββ(0) by βββ(1) to get βββ(2), and so on, until convergence. Under a set of appro-priate conditions, the sequence of estimators, βββ(1), βββ(2), . . ., will converge to

βββ, the MLE of βββ.For maximum likelihood estimation, the criterion function used is l(βββ)

given by (3.119); l(1)(βββ) is called the score vector, and l(2)(βββ) is called theHessian. In the method of scoring, we replace l(2)(βββ) by E[l(2)(βββ)], the in-formation matrix. Under appropriate conditions, the inverse of the informa-tion matrix is the asymptotic variance–covariance matrix of the estimatorβββ. This is sometimes approximated by the inverse of the Hessian at βββ. Ifthe derivatives are difficult to obtain, it is possible to use quasi-maximumlikelihood estimation where numerical techniques are used to approximatethe derivatives.

Example 3.30 MLE for the Recruitment Series

So far, we have fit an AR(2) model to the Recruitment series using ordinaryleast squares (Example 3.17) and using Yule–Walker (Example 3.27). Thefollowing is an R session used to fit an AR(2) model via maximum likelihoodestimation to the Recruitment series; these results can be compared to theresults in Examples 3.17 and 3.27.

Page 142: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

3.6 Estimation 129

1 rec.mle = ar.mle(rec, order=2)

2 rec.mle$x.mean # 62.26

3 rec.mle$ar # 1.35, -.46

4 sqrt(diag(rec.mle$asy.var.coef)) # .04, .04

5 rec.mle$var.pred # 89.34

We now discuss least squares for ARMA(p, q) models via Gauss–Newton.For general and complete details of the Gauss–Newton procedure, the readeris referred to Fuller (1996). As before, write βββ = (φ1, . . . , φp, θ1, . . . , θq)

′, andfor the ease of discussion, we will put µ = 0. We write the model in terms ofthe errors

wt(βββ) = xt −p∑j=1

φjxt−j −q∑

k=1

θkwt−k(βββ), (3.120)

emphasizing the dependence of the errors on the parameters.For conditional least squares, we approximate the residual sum of squares

by conditioning on x1, . . . , xp (if p > 0) and wp = wp−1 = wp−2 = · · · =w1−q = 0 (if q > 0), in which case, given βββ, we may evaluate (3.120) fort = p+1, p+2, . . . , n. Using this conditioning argument, the conditional errorsum of squares is

Sc(βββ) =

n∑t=p+1

w2t (βββ). (3.121)

Minimizing Sc(βββ) with respect to βββ yields the conditional least squares esti-mates. If q = 0, the problem is linear regression and no iterative technique isneeded to minimize Sc(φ1, . . . , φp). If q > 0, the problem becomes nonlinearregression and we will have to rely on numerical optimization.

When n is large, conditioning on a few initial values will have little in-fluence on the final parameter estimates. In the case of small to moderatesample sizes, one may wish to rely on unconditional least squares. The uncon-ditional least squares problem is to choose βββ to minimize the unconditionalsum of squares, which we have generically denoted by S(βββ) in this section.The unconditional sum of squares can be written in various ways, and oneuseful form in the case of ARMA(p, q) models is derived in Box et al. (1994,Appendix A7.3). They showed (see Problem 3.19) the unconditional sum ofsquares can be written as

S(βββ) =

n∑t=−∞

w2t (βββ), (3.122)

where wt(βββ) = E(wt |x1, . . . , xn). When t ≤ 0, the wt(βββ) are obtainedby backcasting. As a practical matter, we approximate S(βββ) by startingthe sum at t = −M + 1, where M is chosen large enough to guarantee∑−Mt=−∞ w2

t (βββ) ≈ 0. In the case of unconditional least squares estimation,a numerical optimization technique is needed even when q = 0.

Page 143: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

130 3 ARIMA Models

To employ Gauss–Newton, let βββ(0) = (φ(0)1 , . . . , φ

(0)p , θ

(0)1 , . . . , θ

(0)q )′ be an

initial estimate of βββ. For example, we could obtain βββ(0) by method of moments.The first-order Taylor expansion of wt(βββ) is

wt(βββ) ≈ wt(βββ(0))−(βββ − βββ(0)

)′zzzt(βββ(0)), (3.123)

where

zzzt(βββ(0)) =

(−∂wt(βββ(0))

∂β1, . . . ,−

∂wt(βββ(0))

∂βp+q

)′, t = 1, . . . , n.

The linear approximation of Sc(βββ) is

Q(βββ) =n∑

t=p+1

[wt(βββ(0))−

(βββ − βββ(0)

)′zzzt(βββ(0))

]2(3.124)

and this is the quantity that we will minimize. For approximate unconditionalleast squares, we would start the sum in (3.124) at t = −M + 1, for a largevalue of M , and work with the backcasted values.

Using the results of ordinary least squares (§2.2), we know

(βββ − βββ(0)) =(n−1

n∑t=p+1

zzzt(βββ(0))zzz′t(βββ(0))

)−1(n−1

n∑t=p+1

zzzt(βββ(0))wt(βββ(0)))

(3.125)minimizes Q(βββ). From (3.125), we write the one-step Gauss–Newton estimateas

βββ(1) = βββ(0) +∆(βββ(0)), (3.126)

where ∆(βββ(0)) denotes the right-hand side of (3.125). Gauss–Newton esti-mation is accomplished by replacing βββ(0) by βββ(1) in (3.126). This process isrepeated by calculating, at iteration j = 2, 3, . . .,

βββ(j) = βββ(j−1) +∆(βββ(j−1))

until convergence.

Example 3.31 Gauss–Newton for an MA(1)

Consider an invertible MA(1) process, xt = wt+θwt−1. Write the truncatederrors as

wt(θ) = xt − θwt−1(θ), t = 1, . . . , n, (3.127)

where we condition on w0(θ) = 0. Taking derivatives,

−∂wt(θ)∂θ

= wt−1(θ) + θ∂wt−1(θ)

∂θ, t = 1, . . . , n, (3.128)

where ∂w0(θ)/∂θ = 0. Using the notation of (3.123), we can also write(3.128) as

Page 144: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

3.6 Estimation 131

0 5 10 15 20 25 30 35

−0.4

0.0

0.4

0.8

LAG

ACF

0 5 10 15 20 25 30 35

−0.4

0.0

0.4

0.8

LAG

PACF

Fig. 3.7. ACF and PACF of transformed glacial varves.

zt(θ) = wt−1(θ)− θzt−1(θ), t = 1, . . . , n, (3.129)

where z0(θ) = 0.Let θ(0) be an initial estimate of θ, for example, the estimate given in Ex-

ample 3.28. Then, the Gauss–Newton procedure for conditional least squaresis given by

θ(j+1) = θ(j) +

∑nt=1 zt(θ(j))wt(θ(j))∑n

t=1 z2t (θ(j))

, j = 0, 1, 2, . . . , (3.130)

where the values in (3.130) are calculated recursively using (3.127) and(3.129). The calculations are stopped when |θ(j+1) − θ(j)|, or |Q(θ(j+1)) −Q(θ(j))|, are smaller than some preset amount.

Example 3.32 Fitting the Glacial Varve Series

Consider the series of glacial varve thicknesses from Massachusetts for n =634 years, as analyzed in Example 2.6 and in Problem 2.8, where it wasargued that a first-order moving average model might fit the logarithmicallytransformed and differenced varve series, say,

∇ log(xt) = log(xt)− log(xt−1) = log

(xtxt−1

),

which can be interpreted as being approximately the percentage change inthe thickness.

Page 145: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

132 3 ARIMA Models

−1.0 −0.8 −0.6 −0.4 −0.2 0.0

150

160

170

180

190

200

210

S c

Fig. 3.8. Conditional sum of squares versus values of the moving average parameterfor the glacial varve example, Example 3.32. Vertical lines indicate the values of theparameter obtained via Gauss–Newton; see Table 3.2 for the actual values.

The sample ACF and PACF, shown in Figure 3.7, confirm the tendencyof ∇ log(xt) to behave as a first-order moving average process as the ACFhas only a significant peak at lag one and the PACF decreases exponentially.Using Table 3.1, this sample behavior fits that of the MA(1) very well.

The results of eleven iterations of the Gauss–Newton procedure, (3.130),starting with θ(0) = −.10 are given in Table 3.2. The final estimate is

θ = θ(11) = −.773; interim values and the corresponding value of the condi-tional sum of squares, Sc(θ) given in (3.121), are also displayed in the table.The final estimate of the error variance is σ2

w = 148.98/632 = .236 with632 degrees of freedom (one is lost in differencing). The value of the sumof the squared derivatives at convergence is

∑nt=1 z

2t (θ(11)) = 369.73, and

consequently, the estimated standard error of θ is√.236/369.73 = .025;7

this leads to a t-value of −.773/.025 = −30.92 with 632 degrees of freedom.Figure 3.8 displays the conditional sum of squares, Sc(θ) as a function

of θ, as well as indicating the values of each step of the Gauss–Newtonalgorithm. Note that the Gauss–Newton procedure takes large steps towardthe minimum initially, and then takes very small steps as it gets close tothe minimizing value. When there is only one parameter, as in this case, itwould be easy to evaluate Sc(θ) on a grid of points, and then choose theappropriate value of θ from the grid search. It would be difficult, however,to perform grid searches when there are many parameters.

7 To estimate the standard error, we are using the standard regression results from(2.9) as an approximation

Page 146: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

3.6 Estimation 133

Table 3.2. Gauss–Newton Results for Example 3.32

j θ(j) Sc(θ(j))∑nt=1 z

2t (θ(j))

0 −0.100 195.0010 183.34641 −0.250 177.7614 163.30382 −0.400 165.0027 161.62793 −0.550 155.6723 182.64324 −0.684 150.2896 247.49425 −0.736 149.2283 304.31256 −0.757 149.0272 337.92007 −0.766 148.9885 355.04658 −0.770 148.9812 363.28139 −0.771 148.9804 365.404510 −0.772 148.9799 367.554411 −0.773 148.9799 369.7314

In the general case of causal and invertible ARMA(p, q) models, maxi-mum likelihood estimation and conditional and unconditional least squaresestimation (and Yule–Walker estimation in the case of AR models) all lead tooptimal estimators. The proof of this general result can be found in a num-ber of texts on theoretical time series analysis (for example, Brockwell andDavis, 1991, or Hannan, 1970, to mention a few). We will denote the ARMAcoefficient parameters by βββ = (φ1, . . . , φp, θ1, . . . , θq)

′.

Property 3.10 Large Sample Distribution of the EstimatorsUnder appropriate conditions, for causal and invertible ARMA processes,

the maximum likelihood, the unconditional least squares, and the conditionalleast squares estimators, each initialized by the method of moments estimator,all provide optimal estimators of σ2

w and βββ, in the sense that σ2w is consistent,

and the asymptotic distribution of βββ is the best asymptotic normal distribution.In particular, as n→∞,

√n(βββ − βββ

)d→ N

(000, σ2

w ΓΓΓ−1p,q). (3.131)

The asymptotic variance–covariance matrix of the estimator βββ is the inverseof the information matrix. In particular, the (p+ q)× (p+ q) matrix Γp,q, hasthe form

Γp,q =

(Γφφ ΓφθΓθφ Γθθ

). (3.132)

The p×p matrix Γφφ is given by (3.100), that is, the ij-th element of Γφφ, fori, j = 1, . . . , p, is γx(i−j) from an AR(p) process, φ(B)xt = wt. Similarly, Γθθis a q × q matrix with the ij-th element, for i, j = 1, . . . , q, equal to γy(i− j)from an AR(q) process, θ(B)yt = wt. The p × q matrix Γφθ = {γxy(i − j)},for i = 1, . . . , p; j = 1, . . . , q; that is, the ij-th element is the cross-covariance

Page 147: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

134 3 ARIMA Models

between the two AR processes given by φ(B)xt = wt and θ(B)yt = wt. Finally,Γθφ = Γ ′φθ is q × p.

Further discussion of Property 3.10, including a proof for the case of leastsquares estimators for AR(p) processes, can be found in Appendix B, §B.3.

Example 3.33 Some Specific Asymptotic Distributions

The following are some specific cases of Property 3.10.AR(1): γx(0) = σ2

w/(1− φ2), so σ2wΓ−11,0 = (1− φ2). Thus,

φ ∼ AN[φ, n−1(1− φ2)

]. (3.133)

AR(2): The reader can verify that

γx(0) =

(1− φ21 + φ2

)σ2w

(1− φ2)2 − φ21

and γx(1) = φ1γx(0) + φ2γx(1). From these facts, we can compute Γ−12,0 . Inparticular, we have(

φ1φ2

)∼ AN

[(φ1φ2

), n−1

(1− φ22 −φ1(1 + φ2)sym 1− φ22

)]. (3.134)

MA(1): In this case, write θ(B)yt = wt, or yt+θyt−1 = wt. Then, analogousto the AR(1) case, γy(0) = σ2

w/(1− θ2), so σ2wΓ−10,1 = (1− θ2). Thus,

θ ∼ AN[θ, n−1(1− θ2)

]. (3.135)

MA(2): Write yt+θ1yt−1 +θ2yt−2 = wt, so , analogous to the AR(2) case,we have (

θ1θ2

)∼ AN

[(θ1θ2

), n−1

(1− θ22 θ1(1 + θ2)sym 1− θ22

)]. (3.136)

ARMA(1,1): To calculate Γφθ, we must find γxy(0), where xt−φxt−1 = wtand yt + θyt−1 = wt. We have

γxy(0) = cov(xt, yt) = cov(φxt−1 + wt,−θyt−1 + wt)

= −φθγxy(0) + σ2w.

Solving, we find, γxy(0) = σ2w/(1 + φθ). Thus,(

φ

θ

)∼ AN

[(φθ

), n−1

[(1− φ2)−1 (1 + φθ)−1

sym (1− θ2)−1

]−1]. (3.137)

Page 148: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

3.6 Estimation 135

Example 3.34 Overfitting Caveat

The asymptotic behavior of the parameter estimators gives us an additionalinsight into the problem of fitting ARMA models to data. For example,suppose a time series follows an AR(1) process and we decide to fit anAR(2) to the data. Do any problems occur in doing this? More generally,why not simply fit large-order AR models to make sure that we capturethe dynamics of the process? After all, if the process is truly an AR(1), theother autoregressive parameters will not be significant. The answer is thatif we overfit, we obtain less efficient, or less precise parameter estimates.For example, if we fit an AR(1) to an AR(1) process, for large n, var(φ1) ≈n−1(1 − φ21). But, if we fit an AR(2) to the AR(1) process, for large n,

var(φ1) ≈ n−1(1− φ22) = n−1 because φ2 = 0. Thus, the variance of φ1 hasbeen inflated, making the estimator less precise.

We do want to mention, however, that overfitting can be used as a di-agnostic tool. For example, if we fit an AR(2) model to the data and aresatisfied with that model, then adding one more parameter and fitting anAR(3) should lead to approximately the same model as in the AR(2) fit. Wewill discuss model diagnostics in more detail in §3.8.

The reader might wonder, for example, why the asymptotic distributionsof φ from an AR(1) and θ from an MA(1) are of the same form; compare(3.133) to (3.135). It is possible to explain this unexpected result heuristicallyusing the intuition of linear regression. That is, for the normal regressionmodel presented in §2.2 with no intercept term, xt = βzt + wt, we know β isnormally distributed with mean β, and from (2.9),

var{√

n(β − β

)}= nσ2

w

(n∑t=1

z2t

)−1= σ2

w

(n−1

n∑t=1

z2t

)−1.

For the causal AR(1) model given by xt = φxt−1 + wt, the intuition ofregression tells us to expect that, for n large,

√n(φ− φ

)is approximately normal with mean zero and with variance given by

σ2w

(n−1

n∑t=2

x2t−1

)−1.

Now, n−1∑nt=2 x

2t−1 is the sample variance (recall that the mean of xt is zero)

of the xt, so as n becomes large we would expect it to approach var(xt) =

γ(0) = σ2w/(1− φ2). Thus, the large sample variance of

√n(φ− φ

)is

σ2wγx(0)−1 = σ2

w

(σ2w

1− φ2

)−1= (1− φ2);

Page 149: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

136 3 ARIMA Models

that is, (3.133) holds.In the case of an MA(1), we may use the discussion of Example 3.31 to

write an approximate regression model for the MA(1). That is, consider theapproximation (3.129) as the regression model

zt(θ) = −θzt−1(θ) + wt−1,

where now, zt−1(θ) as defined in Example 3.31, plays the role of the regressor.Continuing with the analogy, we would expect the asymptotic distribution of√n(θ − θ

)to be normal, with mean zero, and approximate variance

σ2w

(n−1

n∑t=2

z2t−1(θ)

)−1.

As in the AR(1) case, n−1∑nt=2 z

2t−1(θ) is the sample variance of the zt(θ)

so, for large n, this should be var{zt(θ)} = γz(0), say. But note, as seen from(3.129), zt(θ) is approximately an AR(1) process with parameter −θ. Thus,

σ2wγz(0)−1 = σ2

w

(σ2w

1− (−θ)2

)−1= (1− θ2),

which agrees with (3.135). Finally, the asymptotic distributions of the ARparameter estimates and the MA parameter estimates are of the same formbecause in the MA case, the “regressors” are the differential processes zt(θ)that have AR structure, and it is this structure that determines the asymptoticvariance of the estimators. For a rigorous account of this approach for thegeneral case, see Fuller (1996, Theorem 5.5.4).

In Example 3.32, the estimated standard error of θ was .025. In that ex-ample, we used regression results to estimate the standard error as the squareroot of

n−1σ2w

(n−1

n∑t=1

z2t (θ)

)−1=

σ2w∑n

t=1 z2t (θ)

,

where n = 632, σ2w = .236,

∑nt=1 z

2t (θ) = 369.73 and θ = −.773. Using (3.135),

we could have also calculated this value using the asymptotic approximation,the square root of (1− (−.773)2)/632, which is also .025.

If n is small, or if the parameters are close to the boundaries, the asymp-totic approximations can be quite poor. The bootstrap can be helpful in thiscase; for a broad treatment of the bootstrap, see Efron and Tibshirani (1994).We discuss the case of an AR(1) here and leave the general discussion forChapter 6. For now, we give a simple example of the bootstrap for an AR(1)process.

Page 150: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

3.6 Estimation 137

Fig. 3.9. One hundred observations generated from the model in Example 3.35.

Example 3.35 Bootstrapping an AR(1)

We consider an AR(1) model with a regression coefficient near the bound-ary of causality and an error process that is symmetric but not normal.Specifically, consider the causal model

xt = µ+ φ(xt−1 − µ) + wt, (3.138)

where µ = 50, φ = .95, and wt are iid double exponential with location zero,and scale parameter β = 2. The density of wt is given by

f(w) =1

2βexp {−|w|/β} −∞ < w <∞.

In this example, E(wt) = 0 and var(wt) = 2β2 = 8. Figure 3.9 showsn = 100 simulated observations from this process. This particular realizationis interesting; the data look like they were generated from a nonstationaryprocess with three different mean levels. In fact, the data were generatedfrom a well-behaved, albeit non-normal, stationary and causal model. Toshow the advantages of the bootstrap, we will act as if we do not know theactual error distribution and we will proceed as if it were normal; of course,this means, for example, that the normal based MLE of φ will not be theactual MLE because the data are not normal.

Using the data shown in Figure 3.9, we obtained the Yule–Walker esti-mates µ = 40.05, φ = .96, and s2w = 15.30, where s2w is the estimate of

var(wt). Based on Property 3.10, we would say that φ is approximatelynormal with mean φ (which we supposedly do not know) and variance(1− φ2)/100, which we would approximate by (1− .962)/100 = .032.

Page 151: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

138 3 ARIMA Models

phi.yw

Den

sity

0.70 0.75 0.80 0.85 0.90 0.95

02

46

8

Fig. 3.10. Finite sample density of the Yule–Walker estimate of φ in Example 3.35.

To assess the finite sample distribution of φ when n = 100, we simulated1000 realizations of this AR(1) process and estimated the parameters viaYule–Walker. The finite sampling density of the Yule–Walker estimate of φ,based on the 1000 repeated simulations, is shown in Figure 3.10. Clearlythe sampling distribution is not close to normality for this sample size.The mean of the distribution shown in Figure 3.10 is .89, and the varianceof the distribution is .052; these values are considerably different than theasymptotic values. Some of the quantiles of the finite sample distributionare .79 (5%), .86 (25%), .90 (50%), .93 (75%), and .95 (95%). The R codeto perform the simulation and plot the histogram is as follows:

1 set.seed(111)

2 phi.yw = rep(NA, 1000)

3 for (i in 1:1000){

4 e = rexp(150, rate=.5); u = runif(150,-1,1); de = e*sign(u)

5 x = 50 + arima.sim(n=100,list(ar=.95), innov=de, n.start=50)

6 phi.yw[i] = ar.yw(x, order=1)$ar }

7 hist(phi.yw, prob=TRUE, main="")

8 lines(density(phi.yw, bw=.015))

Before discussing the bootstrap, we first investigate the sample innovationprocess, xt−xt−1t , with corresponding variances P t−1t . For the AR(1) modelin this example,

xt−1t = µ+ φ(xt−1 − µ), t = 2, . . . , 100.

From this, it follows that

Page 152: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

3.6 Estimation 139

P t−1t = E(xt − xt−1t )2 = σ2w, t = 2, . . . , 100.

When t = 1, we have

x01 = µ and P 01 = σ2

w/(1− φ2).

Thus, the innovations have zero mean but different variances; in order thatall of the innovations have the same variance, σ2

w, we will write them as

ε1 = (x1 − µ)√

(1− φ2)

εt = (xt − µ)− φ(xt−1 − µ), for t = 2, . . . , 100. (3.139)

From these equations, we can write the model in terms of the εt as

x1 = µ+ ε1/√

(1− φ2)

xt = µ+ φ(xt−1 − µ) + εt for t = 2, . . . , 100. (3.140)

Next, replace the parameters with their estimates in (3.139), that is,

µ = 40.048 and φ = .957, and denote the resulting sample innovationsas {ε1, . . . , ε100}. To obtain one bootstrap sample, first randomly sample,with replacement, n = 100 values from the set of sample innovations; callthe sampled values {ε∗1, . . . , ε∗100}. Now, generate a bootstrapped data setsequentially by setting

x∗1 = 40.048 + ε∗1/√

(1− .9572)

x∗t = 40.048 + .957(x∗t−1 − 40.048) + ε∗t , t = 2, . . . , n. (3.141)

Next, estimate the parameters as if the data were x∗t . Call these esti-

mates µ(1), φ(1), and s2w(1). Repeat this process a large number, B,of times, generating a collection of bootstrapped parameter estimates,{µ(b), φ(b), s2w(b), b = 1, . . . , B}. We can then approximate the finite sampledistribution of an estimator from the bootstrapped parameter values. Forexample, we can approximate the distribution of φ − φ by the empiricaldistribution of φ(b)− φ, for b = 1, . . . , B.

Figure 3.11 shows the bootstrap histogram of 200 bootstrapped estimatesof φ using the data shown in Figure 3.9. In addition, Figure 3.11 shows adensity estimate based on the bootstrap histogram, as well as the asymp-totic normal density that would have been used based on Proposition 3.10.Clearly, the bootstrap distribution of φ is closer to the distribution of φshown in Figure 3.10 than to the asymptotic normal approximation. In par-ticular, the mean of the distribution of φ(b) is .92 with a variance of .052.Some quantiles of this distribution are .83 (5%), .90 (25%), .93 (50%), .95(75%), and .98 (95%).

To perform a similar bootstrap exercise in R, use the following commands.We note that the R estimation procedure is conditional on the first obser-vation, so the first residual is not returned. To get around this problem,

Page 153: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

140 3 ARIMA Models

phi.star.yw

Den

sity

0.75 0.80 0.85 0.90 0.95 1.00 1.05

02

46

810

1214

Fig. 3.11. Bootstrap histogram of φ based on 200 bootstraps; a density estimatebased on the histogram (solid line) and the corresponding asymptotic normal density(dashed line).

we simply fix the first observation and bootstrap the remaining data. Thesimulated data are available in the file ar1boot, but you can simulate yourown data as was done in the code that produced Figure 3.10.

1 x = ar1boot

2 m = mean(x) # estimate of mu

3 fit = ar.yw(x, order=1)

4 phi = fit$ar # estimate of phi

5 nboot = 200 # number of bootstrap replicates

6 resids = fit$resid[-1] # the first resid is NA

7 x.star = x # initialize x*

8 phi.star.yw = rep(NA, nboot)

9 for (i in 1:nboot) {

10 resid.star = sample(resids, replace=TRUE)

11 for (t in 1:99){ x.star[t+1] = m + phi*(x.star[t]-m) +

resid.star[t] }

12 phi.star.yw[i] = ar.yw(x.star, order=1)$ar }

13 hist(phi.star.yw, 10, main="", prob=TRUE, ylim=c(0,14),

xlim=c(.75,1.05))

14 lines(density(phi.star.yw, bw=.02))

15 u = seq(.75, 1.05, by=.001)

16 lines(u, dnorm(u, mean=.96, sd=.03), lty="dashed", lwd=2)

Page 154: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

3.7 Integrated Models for Nonstationary Data 141

3.7 Integrated Models for Nonstationary Data

In Chapters 1 and 2, we saw that if xt is a random walk, xt = xt−1 + wt,then by differencing xt, we find that ∇xt = wt is stationary. In many situa-tions, time series can be thought of as being composed of two components, anonstationary trend component and a zero-mean stationary component. Forexample, in §2.2 we considered the model

xt = µt + yt, (3.142)

where µt = β0 +β1t and yt is stationary. Differencing such a process will leadto a stationary process:

∇xt = xt − xt−1 = β1 + yt − yt−1 = β1 +∇yt.

Another model that leads to first differencing is the case in which µt in (3.142)is stochastic and slowly varying according to a random walk. That is,

µt = µt−1 + vt

where vt is stationary. In this case,

∇xt = vt +∇yt,

is stationary. If µt in (3.142) is a k-th order polynomial, µt =∑kj=0 βjt

j ,

then (Problem 3.27) the differenced series ∇kyt is stationary. Stochastic trendmodels can also lead to higher order differencing. For example, suppose

µt = µt−1 + vt and vt = vt−1 + et,

where et is stationary. Then, ∇xt = vt +∇yt is not stationary, but

∇2xt = et +∇2yt

is stationary.The integrated ARMA, or ARIMA, model is a broadening of the class of

ARMA models to include differencing.

Definition 3.11 A process xt is said to be ARIMA(p, d, q) if

∇dxt = (1−B)dxt

is ARMA(p, q). In general, we will write the model as

φ(B)(1−B)dxt = θ(B)wt. (3.143)

If E(∇dxt) = µ, we write the model as

φ(B)(1−B)dxt = δ + θ(B)wt,

where δ = µ(1− φ1 − · · · − φp).

Page 155: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

142 3 ARIMA Models

Because of the nonstationarity, care must be taken when deriving forecasts.For the sake of completeness, we discuss this issue briefly here, but we stressthe fact that both the theoretical and computational aspects of the problemare best handled via state-space models. We discuss the theoretical details inChapter 6. For information on the state-space based computational aspectsin R, see the ARIMA help files (?arima and ?predict.Arima); our scriptssarima and sarima.for are basically front ends for these R scripts.

It should be clear that, since yt = ∇dxt is ARMA, we can use §3.5 methodsto obtain forecasts of yt, which in turn lead to forecasts for xt. For example, ifd = 1, given forecasts ynn+m for m = 1, 2, . . ., we have ynn+m = xnn+m−xnn+m−1,so that

xnn+m = ynn+m + xnn+m−1

with initial condition xnn+1 = ynn+1 + xn (noting xnn = xn).It is a little more difficult to obtain the prediction errors Pnn+m, but for

large n, the approximation used in §3.5, equation (3.86), works well. That is,the mean-squared prediction error can be approximated by

Pnn+m = σ2w

m−1∑j=0

ψ∗2j , (3.144)

where ψ∗j is the coefficient of zj in ψ∗(z) = θ(z)/φ(z)(1− z)d.To better understand integrated models, we examine the properties of

some simple cases; Problem 3.29 covers the ARIMA(1, 1, 0) case.

Example 3.36 Random Walk with Drift

To fix ideas, we begin by considering the random walk with drift model firstpresented in Example 1.11, that is,

xt = δ + xt−1 + wt,

for t = 1, 2, . . ., and x0 = 0. Technically, the model is not ARIMA, but wecould include it trivially as an ARIMA(0, 1, 0) model. Given data x1, . . . , xn,the one-step-ahead forecast is given by

xnn+1 = E(xn+1

∣∣ xn, . . . , x1) = E(δ + xn + wn+1

∣∣ xn, . . . , x1) = δ + xn.

The two-step-ahead forecast is given by xnn+2 = δ + xnn+1 = 2δ + xn, andconsequently, the m-step-ahead forecast, for m = 1, 2, . . ., is

xnn+m = mδ + xn, (3.145)

To obtain the forecast errors, it is convenient to recall equation (1.4), i.e.,xn = n δ +

∑nj=1 wj , in which case we may write

xn+m = (n+m) δ +n+m∑j=1

wj = mδ + xn +n+m∑j=n+1

wj .

Page 156: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

3.7 Integrated Models for Nonstationary Data 143

From this it follows that the m-step-ahead prediction error is given by

Pnn+m = E(xn+m − xnn+m)2 = E( n+m∑j=n+1

wj

)2= mσ2

w. (3.146)

Hence, unlike the stationary case (see Example 3.22), as the forecast hori-zon grows, the prediction errors, (3.146), increase without bound and theforecasts follow a straight line with slope δ emanating from xn. We notethat (3.144) is exact in this case because ψ∗(z) = 1/(1 − z) =

∑∞j=0 z

j for|z| < 1, so that ψ∗j = 1 for all j.

The wt are Gaussian, so estimation is straightforward because the dif-ferenced data, say yt = ∇xt, are independent and identically distributednormal variates with mean δ and variance σ2

w. Consequently, optimal esti-mates of δ and σ2

w are the sample mean and variance of the yt, respectively.

Example 3.37 IMA(1, 1) and EWMA

The ARIMA(0,1,1), or IMA(1,1) model is of interest because many economictime series can be successfully modeled this way. In addition, the model leadsto a frequently used, and abused, forecasting method called exponentiallyweighted moving averages (EWMA). We will write the model as

xt = xt−1 + wt − λwt−1, (3.147)

with |λ| < 1, for t = 1, 2, . . . , and x0 = 0, because this model formulationis easier to work with here, and it leads to the standard representation forEWMA. We could have included a drift term in (3.147), as was done inthe previous example, but for the sake of simplicity, we leave it out of thediscussion. If we write

yt = wt − λwt−1,we may write (3.147) as xt = xt−1 +yt. Because |λ| < 1, yt has an invertiblerepresentation, yt =

∑∞j=1 λ

jyt−j +wt, and substituting yt = xt − xt−1, wemay write

xt =∞∑j=1

(1− λ)λj−1xt−j + wt. (3.148)

as an approximation for large t (put xt = 0 for t ≤ 0). Verification of (3.148)is left to the reader (Problem 3.28). Using the approximation (3.148), wehave that the approximate one-step-ahead predictor, using the notation of§3.5, is

xn+1 =∞∑j=1

(1− λ)λj−1xn+1−j

= (1− λ)xn + λ∞∑j=1

(1− λ)λj−1xn−j

= (1− λ)xn + λxn. (3.149)

Page 157: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

144 3 ARIMA Models

From (3.149), we see that the new forecast is a linear combination of theold forecast and the new observation. Based on (3.149) and the fact that weonly observe x1, . . . , xn, and consequently y1, . . . , yn (because yt = xt−xt−1;x0 = 0), the truncated forecasts are

xnn+1 = (1− λ)xn + λxn−1n , n ≥ 1, (3.150)

with x01 = x1 as an initial value. The mean-square prediction error can beapproximated using (3.144) by noting that ψ∗(z) = (1 − λz)/(1 − z) =1 + (1− λ)

∑∞j=1 z

j for |z| < 1; consequently, for large n, (3.144) leads to

Pnn+m ≈ σ2w[1 + (m− 1)(1− λ)2].

In EWMA, the parameter 1− λ is often called the smoothing parameterand is restricted to be between zero and one. Larger values of λ lead tosmoother forecasts. This method of forecasting is popular because it is easyto use; we need only retain the previous forecast value and the currentobservation to forecast the next time period. Unfortunately, as previouslysuggested, the method is often abused because some forecasters do not verifythat the observations follow an IMA(1, 1) process, and often arbitrarily pickvalues of λ. In the following, we show how to generate 100 observations froman IMA(1,1) model with λ = −θ = .8 and then calculate and display thefitted EWMA superimposed on the data. This is accomplished using theHolt-Winters command in R (see the help file ?HoltWinters for details; nooutput is shown):

1 set.seed(666)

2 x = arima.sim(list(order = c(0,1,1), ma = -0.8), n = 100)

3 (x.ima = HoltWinters(x, beta=FALSE, gamma=FALSE)) # α below is 1− λSmoothing parameter: alpha: 0.1663072

4 plot(x.ima)

3.8 Building ARIMA Models

There are a few basic steps to fitting ARIMA models to time series data.These steps involve plotting the data, possibly transforming the data, identi-fying the dependence orders of the model, parameter estimation, diagnostics,and model choice. First, as with any data analysis, we should construct atime plot of the data, and inspect the graph for any anomalies. If, for ex-ample, the variability in the data grows with time, it will be necessary totransform the data to stabilize the variance. In such cases, the Box–Cox classof power transformations, equation (2.37), could be employed. Also, the par-ticular application might suggest an appropriate transformation. For example,suppose a process evolves as a fairly small and stable percent-change, such asan investment. For example, we might have

xt = (1 + pt)xt−1,

Page 158: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

3.8 Building ARIMA Models 145

where xt is the value of the investment at time t and pt is the percentage-change from period t− 1 to t, which may be negative. Taking logs we have

log(xt) = log(1 + pt) + log(xt−1),

or∇ log(xt) = log(1 + pt).

If the percent change pt stays relatively small in magnitude, then log(1+pt) ≈pt

8 and, thus,∇ log(xt) ≈ pt,

will be a relatively stable process. Frequently, ∇ log(xt) is called the returnor growth rate. This general idea was used in Example 3.32, and we will useit again in Example 3.38.

After suitably transforming the data, the next step is to identify prelim-inary values of the autoregressive order, p, the order of differencing, d, andthe moving average order, q. We have already addressed, in part, the problemof selecting d. A time plot of the data will typically suggest whether any dif-ferencing is needed. If differencing is called for, then difference the data once,d = 1, and inspect the time plot of ∇xt. If additional differencing is necessary,then try differencing again and inspect a time plot of ∇2xt. Be careful notto overdifference because this may introduce dependence where none exists.For example, xt = wt is serially uncorrelated, but ∇xt = wt−wt−1 is MA(1).In addition to time plots, the sample ACF can help in indicating whetherdifferencing is needed. Because the polynomial φ(z)(1 − z)d has a unit root,the sample ACF, ρ(h), will not decay to zero fast as h increases. Thus, a slowdecay in ρ(h) is an indication that differencing may be needed.

When preliminary values of d have been settled, the next step is to look atthe sample ACF and PACF of ∇dxt for whatever values of d have been chosen.Using Table 3.1 as a guide, preliminary values of p and q are chosen. Recallthat, if p = 0 and q > 0, the ACF cuts off after lag q, and the PACF tails off.If q = 0 and p > 0, the PACF cuts off after lag p, and the ACF tails off. Ifp > 0 and q > 0, both the ACF and PACF will tail off. Because we are dealingwith estimates, it will not always be clear whether the sample ACF or PACFis tailing off or cutting off. Also, two models that are seemingly different canactually be very similar. With this in mind, we should not worry about beingso precise at this stage of the model fitting. At this stage, a few preliminaryvalues of p, d, and q should be at hand, and we can start estimating theparameters.

Example 3.38 Analysis of GNP Data

In this example, we consider the analysis of quarterly U.S. GNP from1947(1) to 2002(3), n = 223 observations. The data are real U.S. gross

8 log(1 +p) = p− p2

2+ p3

3−· · · for −1 < p ≤ 1. If p is a small percent-change, then

the higher-order terms in the expansion are negligible.

Page 159: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

146 3 ARIMA Models

Time

gnp

1950 1960 1970 1980 1990 2000

2000

4000

6000

8000

Fig. 3.12. Quarterly U.S. GNP from 1947(1) to 2002(3).

0 2 4 6 8 10 12

0.0

0.2

0.4

0.6

0.8

1.0

Lag

ACF

Fig. 3.13. Sample ACF of the GNP data. Lag is in terms of years.

national product in billions of chained 1996 dollars and have been season-ally adjusted. The data were obtained from the Federal Reserve Bank ofSt. Louis (http://research.stlouisfed.org/). Figure 3.12 shows a plotof the data, say, yt. Because strong trend hides any other effect, it is notclear from Figure 3.12 that the variance is increasing with time. For thepurpose of demonstration, the sample ACF of the data is displayed in Fig-ure 3.13. Figure 3.14 shows the first difference of the data, ∇yt, and nowthat the trend has been removed we are able to notice that the variability inthe second half of the data is larger than in the first half of the data. Also,it appears as though a trend is still present after differencing. The growth

Page 160: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

3.8 Building ARIMA Models 147

Time

diff(

gnp)

1950 1960 1970 1980 1990 2000

−100

−50

050

100

150

Fig. 3.14. First difference of the U.S. GNP data.

Time

gnpg

r

1950 1960 1970 1980 1990 2000

−0.0

20.

000.

020.

04

Fig. 3.15. U.S. GNP quarterly growth rate.

rate, say, xt = ∇ log(yt), is plotted in Figure 3.15, and, appears to be a sta-ble process. Moreover, we may interpret the values of xt as the percentagequarterly growth of U.S. GNP.

The sample ACF and PACF of the quarterly growth rate are plotted inFigure 3.16. Inspecting the sample ACF and PACF, we might feel that theACF is cutting off at lag 2 and the PACF is tailing off. This would suggestthe GNP growth rate follows an MA(2) process, or log GNP follows anARIMA(0, 1, 2) model. Rather than focus on one model, we will also suggestthat it appears that the ACF is tailing off and the PACF is cutting off at

Page 161: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

148 3 ARIMA Models

1 2 3 4 5 6

−0.2

0.4

0.8

LAG

ACF

1 2 3 4 5 6

−0.2

0.4

0.8

LAG

PAC

F

Fig. 3.16. Sample ACF and PACF of the GNP quarterly growth rate. Lag is interms of years.

lag 1. This suggests an AR(1) model for the growth rate, or ARIMA(1, 1, 0)for log GNP. As a preliminary analysis, we will fit both models.

Using MLE to fit the MA(2) model for the growth rate, xt, the estimatedmodel is

xt = .008(.001) + .303(.065)wt−1 + .204(.064)wt−2 + wt, (3.151)

where σw = .0094 is based on 219 degrees of freedom. The values in paren-theses are the corresponding estimated standard errors. All of the regressioncoefficients are significant, including the constant. We make a special note ofthis because, as a default, some computer packages do not fit a constant ina differenced model. That is, these packages assume, by default, that thereis no drift. In this example, not including a constant leads to the wrongconclusions about the nature of the U.S. economy. Not including a constantassumes the average quarterly growth rate is zero, whereas the U.S. GNPaverage quarterly growth rate is about 1% (which can be seen easily in Fig-ure 3.15). We leave it to the reader to investigate what happens when theconstant is not included.

The estimated AR(1) model is

xt = .008(.001) (1− .347) + .347(.063)xt−1 + wt, (3.152)

where σw = .0095 on 220 degrees of freedom; note that the constant in(3.152) is .008 (1− .347) = .005.

We will discuss diagnostics next, but assuming both of these models fitwell, how are we to reconcile the apparent differences of the estimated models

Page 162: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

3.8 Building ARIMA Models 149

(3.151) and (3.152)? In fact, the fitted models are nearly the same. To showthis, consider an AR(1) model of the form in (3.152) without a constantterm; that is,

xt = .35xt−1 + wt,

and write it in its causal form, xt =∑∞j=0 ψjwt−j , where we recall ψj = .35j .

Thus, ψ0 = 1, ψ1 = .350, ψ2 = .123, ψ3 = .043, ψ4 = .015, ψ5 = .005, ψ6 =.002, ψ7 = .001, ψ8 = 0, ψ9 = 0, ψ10 = 0, and so forth. Thus,

xt ≈ .35wt−1 + .12wt−2 + wt,

which is similar to the fitted MA(2) model in (3.152).The analysis can be performed in R as follows.

1 plot(gnp)

2 acf2(gnp, 50)

3 gnpgr = diff(log(gnp)) # growth rate

4 plot(gnpgr)

5 acf2(gnpgr, 24)

6 sarima(gnpgr, 1, 0, 0) # AR(1)

7 sarima(gnpgr, 0, 0, 2) # MA(2)

8 ARMAtoMA(ar=.35, ma=0, 10) # prints psi-weights

The next step in model fitting is diagnostics. This investigation includesthe analysis of the residuals as well as model comparisons. Again, the firststep involves a time plot of the innovations (or residuals), xt − xt−1t , or of thestandardized innovations

et =(xt − xt−1t

) / √P t−1t , (3.153)

where xt−1t is the one-step-ahead prediction of xt based on the fitted model and

P t−1t is the estimated one-step-ahead error variance. If the model fits well, thestandardized residuals should behave as an iid sequence with mean zero andvariance one. The time plot should be inspected for any obvious departuresfrom this assumption. Unless the time series is Gaussian, it is not enough thatthe residuals are uncorrelated. For example, it is possible in the non-Gaussiancase to have an uncorrelated process for which values contiguous in time arehighly dependent. As an example, we mention the family of GARCH modelsthat are discussed in Chapter 5.

Investigation of marginal normality can be accomplished visually by look-ing at a histogram of the residuals. In addition to this, a normal probabilityplot or a Q-Q plot can help in identifying departures from normality. SeeJohnson and Wichern (1992, Chapter 4) for details of this test as well asadditional tests for multivariate normality.

There are several tests of randomness, for example the runs test, that couldbe applied to the residuals. We could also inspect the sample autocorrelationsof the residuals, say, ρe(h), for any patterns or large values. Recall that, for awhite noise sequence, the sample autocorrelations are approximately indepen-dently and normally distributed with zero means and variances 1/n. Hence, a

Page 163: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

150 3 ARIMA Models

good check on the correlation structure of the residuals is to plot ρe(h) versush along with the error bounds of ±2/

√n. The residuals from a model fit,

however, will not quite have the properties of a white noise sequence and thevariance of ρe(h) can be much less than 1/n. Details can be found in Box andPierce (1970) and McLeod (1978). This part of the diagnostics can be viewedas a visual inspection of ρe(h) with the main concern being the detection ofobvious departures from the independence assumption.

In addition to plotting ρe(h), we can perform a general test that takes intoconsideration the magnitudes of ρe(h) as a group. For example, it may be thecase that, individually, each ρe(h) is small in magnitude, say, each one is justslightly less that 2/

√n in magnitude, but, collectively, the values are large.

The Ljung–Box–Pierce Q-statistic given by

Q = n(n+ 2)

H∑h=1

ρ2e(h)

n− h(3.154)

can be used to perform such a test. The value H in (3.154) is chosen somewhatarbitrarily, typically, H = 20. Under the null hypothesis of model adequacy,asymptotically (n → ∞), Q ∼ χ2

H−p−q. Thus, we would reject the null hy-

pothesis at level α if the value of Q exceeds the (1−α)-quantile of the χ2H−p−q

distribution. Details can be found in Box and Pierce (1970), Ljung and Box(1978), and Davies et al. (1977). The basic idea is that if wt is white noise,then by Property 1.1, nρ2w(h), for h = 1, . . . ,H, are asymptotically indepen-

dent χ21 random variables. This means that n

∑Hh=1 ρ

2w(h) is approximately a

χ2H random variable. Because the test involves the ACF of residuals from a

model fit, there is a loss of p+q degrees of freedom; the other values in (3.154)are used to adjust the statistic to better match the asymptotic chi-squareddistribution.

Example 3.39 Diagnostics for GNP Growth Rate Example

We will focus on the MA(2) fit from Example 3.38; the analysis of the AR(1)residuals is similar. Figure 3.17 displays a plot of the standardized residuals,the ACF of the residuals, a boxplot of the standardized residuals, and thep-values associated with the Q-statistic, (3.154), at lags H = 3 throughH = 20 (with corresponding degrees of freedom H − 2).

Inspection of the time plot of the standardized residuals in Figure 3.17shows no obvious patterns. Notice that there are outliers, however, witha few values exceeding 3 standard deviations in magnitude. The ACF ofthe standardized residuals shows no apparent departure from the modelassumptions, and the Q-statistic is never significant at the lags shown. Thenormal Q-Q plot of the residuals shows departure from normality at thetails due to the outliers that occurred primarily in the 1950s and the early1980s.

The model appears to fit well except for the fact that a distribution withheavier tails than the normal distribution should be employed. We discuss

Page 164: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

3.8 Building ARIMA Models 151

Standardized Residuals

Time

1950 1960 1970 1980 1990 2000

−3−1

13

1 2 3 4 5 6

−0.2

0.0

0.2

0.4

ACF of Residuals

LAG

ACF

−3 −2 −1 0 1 2 3

−3−1

13

Normal Q−Q Plot of Std Residuals

Theoretical QuantilesSa

mpl

e Q

uant

iles

5 10 15 20

0.0

0.4

0.8

p values for Ljung−Box statistic

lag

p va

lue

Fig. 3.17. Diagnostics of the residuals from MA(2) fit on GNP growth rate.

some possibilities in Chapters 5 and 6. The diagnostics shown in Figure 3.17are a by-product of the sarima command from the previous example.9

Example 3.40 Diagnostics for the Glacial Varve Series

In Example 3.32, we fit an ARIMA(0, 1, 1) model to the logarithms of theglacial varve data and there appears to be a small amount of autocorrelationleft in the residuals and the Q-tests are all significant; see Figure 3.18.

To adjust for this problem, we fit an ARIMA(1, 1, 1) to the logged varvedata and obtained the estimates

φ = .23(.05), θ = −.89(.03), σ2w = .23.

Hence the AR term is significant. The Q-statistic p-values for this model arealso displayed in Figure 3.18, and it appears this model fits the data well.

As previously stated, the diagnostics are byproducts of the individualsarima runs. We note that we did not fit a constant in either model because

9 The script tsdiag is available in R to run diagnostics for an ARIMA object,however, the script has errors and we do not recommend using it.

Page 165: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

152 3 ARIMA Models

Fig. 3.18. Q-statistic p-values for the ARIMA(0, 1, 1) fit [top] and theARIMA(1, 1, 1) fit [bottom] to the logged varve data.

there is no apparent drift in the differenced, logged varve series. This factcan be verified by noting the constant is not significant when the commandno.constant=TRUE is removed in the code:

1 sarima(log(varve), 0, 1, 1, no.constant=TRUE) # ARIMA(0,1,1)

2 sarima(log(varve), 1, 1, 1, no.constant=TRUE) # ARIMA(1,1,1)

In Example 3.38, we have two competing models, an AR(1) and an MA(2)on the GNP growth rate, that each appear to fit the data well. In addition,we might also consider that an AR(2) or an MA(3) might do better for fore-casting. Perhaps combining both models, that is, fitting an ARMA(1, 2) tothe GNP growth rate, would be the best. As previously mentioned, we haveto be concerned with overfitting the model; it is not always the case thatmore is better. Overfitting leads to less-precise estimators, and adding moreparameters may fit the data better but may also lead to bad forecasts. Thisresult is illustrated in the following example.

Example 3.41 A Problem with Overfitting

Figure 3.19 shows the U.S. population by official census, every ten yearsfrom 1910 to 1990, as points. If we use these nine observations to predictthe future population, we can use an eight-degree polynomial so the fit tothe nine observations is perfect. The model in this case is

xt = β0 + β1t+ β2t2 + · · ·+ β8t

8 + wt.

The fitted line, which is plotted in the figure, passes through the nine ob-servations. The model predicts that the population of the United States willbe close to zero in the year 2000, and will cross zero sometime in the year2002!

Page 166: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

3.8 Building ARIMA Models 153

Fig. 3.19. A perfect fit and a terrible forecast.

The final step of model fitting is model choice or model selection. That is,we must decide which model we will retain for forecasting. The most populartechniques, AIC, AICc, and BIC, were described in §2.2 in the context ofregression models.

Example 3.42 Model Choice for the U.S. GNP Series

Returning to the analysis of the U.S. GNP data presented in Examples 3.38and 3.39, recall that two models, an AR(1) and an MA(2), fit the GNPgrowth rate well. To choose the final model, we compare the AIC, the AICc,and the BIC for both models. These values are a byproduct of the sarima

runs displayed at the end of Example 3.38, but for convenience, we displaythem again here (recall the growth rate data are in gnpgr):

1 sarima(gnpgr, 1, 0, 0) # AR(1)

$AIC: -8.294403 $AICc: -8.284898 $BIC: -9.263748

2 sarima(gnpgr, 0, 0, 2) # MA(2)

$AIC: -8.297693 $AICc: -8.287854 $BIC: -9.251711

The AIC and AICc both prefer the MA(2) fit, whereas the BIC prefers thesimpler AR(1) model. It is often the case that the BIC will select a modelof smaller order than the AIC or AICc. It would not be unreasonable in thiscase to retain the AR(1) because pure autoregressive models are easier towork with.

Page 167: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

154 3 ARIMA Models

3.9 Multiplicative Seasonal ARIMA Models

In this section, we introduce several modifications made to the ARIMA modelto account for seasonal and nonstationary behavior. Often, the dependenceon the past tends to occur most strongly at multiples of some underlyingseasonal lag s. For example, with monthly economic data, there is a strongyearly component occurring at lags that are multiples of s = 12, becauseof the strong connections of all activity to the calendar year. Data takenquarterly will exhibit the yearly repetitive period at s = 4 quarters. Naturalphenomena such as temperature also have strong components correspondingto seasons. Hence, the natural variability of many physical, biological, andeconomic processes tends to match with seasonal fluctuations. Because of this,it is appropriate to introduce autoregressive and moving average polynomialsthat identify with the seasonal lags. The resulting pure seasonal autoregressivemoving average model, say, ARMA(P,Q)s, then takes the form

ΦP (Bs)xt = ΘQ(Bs)wt, (3.155)

with the following definition.

Definition 3.12 The operators

ΦP (Bs) = 1− Φ1Bs − Φ2B

2s − · · · − ΦPBPs (3.156)

andΘQ(Bs) = 1 +Θ1B

s +Θ2B2s + · · ·+ΘQB

Qs (3.157)

are the seasonal autoregressive operator and the seasonal moving av-erage operator of orders P and Q, respectively, with seasonal period s.

Analogous to the properties of nonseasonal ARMA models, the pure sea-sonal ARMA(P,Q)s is causal only when the roots of ΦP (zs) lie outside theunit circle, and it is invertible only when the roots of ΘQ(zs) lie outside theunit circle.

Example 3.43 A Seasonal ARMA Series

A first-order seasonal autoregressive moving average series that might runover months could be written as

(1− ΦB12)xt = (1 +ΘB12)wt

orxt = Φxt−12 + wt +Θwt−12.

This model exhibits the series xt in terms of past lags at the multiple of theyearly seasonal period s = 12 months. It is clear from the above form thatestimation and forecasting for such a process involves only straightforwardmodifications of the unit lag case already treated. In particular, the causalcondition requires |Φ| < 1, and the invertible condition requires |Θ| < 1.

Page 168: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

3.9 Multiplicative Seasonal ARIMA Models 155

Table 3.3. Behavior of the ACF and PACF for Pure SARMA Models

AR(P )s MA(Q)s ARMA(P,Q)s

ACF* Tails off at lags ks, Cuts off after Tails off atk = 1, 2, . . . , lag Qs lags ks

PACF* Cuts off after Tails off at lags ks Tails off atlag Ps k = 1, 2, . . . , lags ks

*The values at nonseasonal lags h 6= ks, for k = 1, 2, . . ., are zero.

For the first-order seasonal (s = 12) MA model, xt = wt + Θwt−12, it iseasy to verify that

γ(0) = (1 +Θ2)σ2

γ(±12) = Θσ2

γ(h) = 0, otherwise.

Thus, the only nonzero correlation, aside from lag zero, is

ρ(±12) = Θ/(1 +Θ2).

For the first-order seasonal (s = 12) AR model, using the techniques ofthe nonseasonal AR(1), we have

γ(0) = σ2/(1− Φ2)γ(±12k) = σ2Φk/(1− Φ2) k = 1, 2, . . .

γ(h) = 0, otherwise.

In this case, the only non-zero correlations are

ρ(±12k) = Φk, k = 0, 1, 2, . . . .

These results can be verified using the general result that γ(h) = Φγ(h− 12),for h ≥ 1. For example, when h = 1, γ(1) = Φγ(11), but when h = 11, wehave γ(11) = Φγ(1), which implies that γ(1) = γ(11) = 0. In addition tothese results, the PACF have the analogous extensions from nonseasonal toseasonal models.

As an initial diagnostic criterion, we can use the properties for the pureseasonal autoregressive and moving average series listed in Table 3.3. Theseproperties may be considered as generalizations of the properties for nonsea-sonal models that were presented in Table 3.1.

In general, we can combine the seasonal and nonseasonal operators intoa multiplicative seasonal autoregressive moving average model, denoted byARMA(p, q)× (P,Q)s, and write

ΦP (Bs)φ(B)xt = ΘQ(Bs)θ(B)wt (3.158)

Page 169: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

156 3 ARIMA Models

as the overall model. Although the diagnostic properties in Table 3.3 arenot strictly true for the overall mixed model, the behavior of the ACF andPACF tends to show rough patterns of the indicated form. In fact, for mixedmodels, we tend to see a mixture of the facts listed in Tables 3.1 and 3.3.In fitting such models, focusing on the seasonal autoregressive and movingaverage components first generally leads to more satisfactory results.

Example 3.44 A Mixed Seasonal Model

Consider an ARMA(0, 1)× (1, 0)12 model

xt = Φxt−12 + wt + θwt−1,

where |Φ| < 1 and |θ| < 1. Then, because xt−12, wt, and wt−1 are uncorre-lated, and xt is stationary, γ(0) = Φ2γ(0) + σ2

w + θ2σ2w, or

γ(0) =1 + θ2

1− Φ2σ2w.

In addition, multiplying the model by xt−h, h > 0, and taking expectations,we have γ(1) = Φγ(11) + θσ2

w, and γ(h) = Φγ(h− 12), for h ≥ 2. Thus, theACF for this model is

ρ(12h) = Φh h = 1, 2, . . .

ρ(12h− 1) = ρ(12h+ 1) =θ

1 + θ2Φh h = 0, 1, 2, . . . ,

ρ(h) = 0, otherwise.

The ACF and PACF for this model, with Φ = .8 and θ = −.5, are shown inFigure 3.20. These type of correlation relationships, although idealized here,are typically seen with seasonal data.

To reproduce Figure 3.20 in R, use the following commands:1 phi = c(rep(0,11),.8)

2 ACF = ARMAacf(ar=phi, ma=-.5, 50)[-1] # [-1] removes 0 lag

3 PACF = ARMAacf(ar=phi, ma=-.5, 50, pacf=TRUE)

4 par(mfrow=c(1,2))

5 plot(ACF, type="h", xlab="lag", ylim=c(-.4,.8)); abline(h=0)

6 plot(PACF, type="h", xlab="lag", ylim=c(-.4,.8)); abline(h=0)

Seasonal nonstationarity can occur, for example, when the process is nearlyperiodic in the season. For example, with average monthly temperatures overthe years, each January would be approximately the same, each Februarywould be approximately the same, and so on. In this case, we might think ofaverage monthly temperature xt as being modeled as

xt = St + wt,

where St is a seasonal component that varies slowly from one year to the next,according to a random walk,

Page 170: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

3.9 Multiplicative Seasonal ARIMA Models 157

0 10 20 30 40 50

−0.4

0.0

0.2

0.4

0.6

0.8

lag

ACF

0 10 20 30 40 50

−0.4

0.0

0.2

0.4

0.6

0.8

lag

PAC

FFig. 3.20. ACF and PACF of the mixed seasonal ARMA model xt = .8xt−12 +wt − .5wt−1.

St = St−12 + vt.

In this model, wt and vt are uncorrelated white noise processes. The tendencyof data to follow this type of model will be exhibited in a sample ACF that islarge and decays very slowly at lags h = 12k, for k = 1, 2, . . . . If we subtractthe effect of successive years from each other, we find that

(1−B12)xt = xt − xt−12 = vt + wt − wt−12.

This model is a stationary MA(1)12, and its ACF will have a peak only at lag12. In general, seasonal differencing can be indicated when the ACF decaysslowly at multiples of some season s, but is negligible between the periods.Then, a seasonal difference of order D is defined as

∇Ds xt = (1−Bs)Dxt, (3.159)

where D = 1, 2, . . ., takes positive integer values. Typically, D = 1 is sufficientto obtain seasonal stationarity. Incorporating these ideas into a general modelleads to the following definition.

Definition 3.13 The multiplicative seasonal autoregressive integratedmoving average model, or SARIMA model is given by

ΦP (Bs)φ(B)∇Ds ∇dxt = δ +ΘQ(Bs)θ(B)wt, (3.160)

where wt is the usual Gaussian white noise process. The general model isdenoted as ARIMA(p, d, q)× (P,D,Q)s. The ordinary autoregressive andmoving average components are represented by polynomials φ(B) and θ(B)of orders p and q, respectively [see (3.5) and (3.18)], and the seasonal autore-gressive and moving average components by ΦP (Bs) and ΘQ(Bs) [see (3.156)and (3.157)] of orders P and Q and ordinary and seasonal difference compo-nents by ∇d = (1−B)d and ∇Ds = (1−Bs)D.

Page 171: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

158 3 ARIMA Models

1950 1955 1960 1965 1970 1975

40

60

80

100

120

140

160Production

1950 1955 1960 1965 1970 19750

200

400

600

800

1000

Unemployment

Fig. 3.21. Values of the Monthly Federal Reserve Board Production Index andUnemployment (1948-1978, n = 372 months).

Example 3.45 An SARIMA Model

Consider the following model, which often provides a reasonable representa-tion for seasonal, nonstationary, economic time series. We exhibit the equa-tions for the model, denoted by ARIMA(0, 1, 1)× (0, 1, 1)12 in the notationgiven above, where the seasonal fluctuations occur every 12 months. Then,the model (3.160) becomes

(1−B12)(1−B)xt = (1 +ΘB12)(1 + θB)wt. (3.161)

Expanding both sides of (3.161) leads to the representation

(1−B −B12 +B13)xt = (1 + θB +ΘB12 +ΘθB13)wt,

or in difference equation form

xt = xt−1 + xt−12 − xt−13 + wt + θwt−1 +Θwt−12 +Θθwt−13.

Note that the multiplicative nature of the model implies that the coefficientof wt−13 is the product of the coefficients of wt−1 and wt−12 rather than a freeparameter. The multiplicative model assumption seems to work well withmany seasonal time series data sets while reducing the number of parametersthat must be estimated.

Selecting the appropriate model for a given set of data from all of thoserepresented by the general form (3.160) is a daunting task, and we usually

Page 172: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

3.9 Multiplicative Seasonal ARIMA Models 159

0 1 2 3 4

−0.2

0.2

0.6

1.0

LAG

ACF

0 1 2 3 4

−0.2

0.2

0.6

1.0

LAG

PACF

Fig. 3.22. ACF and PACF of the production series.

think first in terms of finding difference operators that produce a roughlystationary series and then in terms of finding a set of simple autoregressivemoving average or multiplicative seasonal ARMA to fit the resulting residualseries. Differencing operations are applied first, and then the residuals areconstructed from a series of reduced length. Next, the ACF and the PACF ofthese residuals are evaluated. Peaks that appear in these functions can oftenbe eliminated by fitting an autoregressive or moving average component inaccordance with the general properties of Tables 3.1 and 3.2. In consideringwhether the model is satisfactory, the diagnostic techniques discussed in §3.8still apply.

Example 3.46 The Federal Reserve Board Production Index

A problem of great interest in economics involves first identifying a modelwithin the Box–Jenkins class for a given time series and then producingforecasts based on the model. For example, we might consider applyingthis methodology to the Federal Reserve Board Production Index shown inFigure 3.21. For demonstration purposes only, the ACF and PACF for thisseries are shown in Figure 3.22. We note that the trend in the data, the slowdecay in the ACF, and the fact that the PACF at the first lag is nearly 1,all indicate nonstationary behavior.

Following the recommended procedure, a first difference was taken, andthe ACF and PACF of the first difference

∇xt = xt − xt−1

Page 173: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

160 3 ARIMA Models

0 1 2 3 4

−0.2

0.2

0.6

1.0

LAG

ACF

0 1 2 3 4

−0.2

0.2

0.6

1.0

LAG

PACF

Fig. 3.23. ACF and PACF of differenced production, (1−B)xt.

are shown in Figure 3.23. Noting the peaks at seasonal lags, h = 1s, 2s, 3s, 4swhere s = 12 (i.e., h = 12, 24, 36, 48) with relatively slow decay suggests aseasonal difference. Figure 3.24 shows the ACF and PACF of the seasonaldifference of the differenced production, say,

∇12∇xt = (1−B12)(1−B)xt.

First, concentrating on the seasonal (s = 12) lags, the characteristics ofthe ACF and PACF of this series tend to show a strong peak at h = 1s inthe autocorrelation function, with smaller peaks appearing at h = 2s, 3s,combined with peaks at h = 1s, 2s, 3s, 4s in the partial autocorrelationfunction. It appears that either(i) the ACF is cutting off after lag 1s and the PACF is tailing off in the

seasonal lags,(ii) the ACF is cutting off after lag 3s and the PACF is tailing off in the

seasonal lags, or(iii) the ACF and PACF are both tailing off in the seasonal lags.Using Table 3.3, this suggests either (i) an SMA of order Q = 1, (ii) anSMA of order Q = 3, or (iii) an SARMA of orders P = 2 (because of thetwo spikes in the PACF) and Q = 1.

Next, inspecting the ACF and the PACF at the within season lags, h =1, . . . , 11, it appears that either (a) both the ACF and PACF are tailingoff, or (b) that the PACF cuts off at lag 2. Based on Table 3.1, this resultindicates that we should either consider fitting a model (a) with both p > 0and q > 0 for the nonseasonal components, say p = 1, q = 1, or (b) p =

Page 174: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

3.9 Multiplicative Seasonal ARIMA Models 161

0 1 2 3 4

−0.4

0.0

0.4

0.8

LAG

ACF

0 1 2 3 4

−0.4

0.0

0.4

0.8

LAG

PACF

Fig. 3.24. ACF and PACF of first differenced and then seasonally differenced pro-duction, (1−B)(1−B12)xt.

2, q = 0. It turns out that there is little difference in the results for case (a)and (b), but that (b) is slightly better, so we will concentrate on case (b).

Fitting the three models suggested by these observations we obtain:(i) ARIMA(2, 1, 0)× (0, 1, 1)12:

AIC= 1.372, AICc= 1.378, BIC= .404

(ii) ARIMA(2, 1, 0)× (0, 1, 3)12:

AIC= 1.299, AICc= 1.305, BIC= .351

(iii) ARIMA(2, 1, 0)× (2, 1, 1)12:

AIC= 1.326, AICc= 1.332, BIC= .379The ARIMA(2, 1, 0)× (0, 1, 3)12 is the preferred model, and the fitted modelin this case is

(1− .30(.05)B − .11(.05)B2)∇12∇xt

= (1− .74(.05)B12 − .14(.06)B

24 + .28(.05)B36)wt

with σ2w = 1.312.

The diagnostics for the fit are displayed in Figure 3.25. We note the fewoutliers in the series as exhibited in the plot of the standardized residualsand their normal Q-Q plot, and a small amount of autocorrelation that stillremains (although not at the seasonal lags) but otherwise, the model fitswell. Finally, forecasts based on the fitted model for the next 12 months areshown in Figure 3.26.

Page 175: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

162 3 ARIMA Models

Fig. 3.25. Diagnostics for the ARIMA(2, 1, 0) × (0, 1, 3)12 fit on the ProductionIndex.

The following R code can be used to perform the analysis.1 acf2(prodn, 48)

2 acf2(diff(prodn), 48)

3 acf2(diff(diff(prodn), 12), 48)

4 sarima(prodn, 2, 1, 1, 0, 1, 3, 12) # fit model (ii)

5 sarima.for(prodn, 12, 2, 1, 1, 0, 1, 3, 12) # forecast

Problems

Section 3.2

3.1 For an MA(1), xt = wt + θwt−1, show that |ρx(1)| ≤ 1/2 for any numberθ. For which values of θ does ρx(1) attain its maximum and minimum?

3.2 Let wt be white noise with variance σ2w and let |φ| < 1 be a constant.

Consider the process x1 = w1, and

xt = φxt−1 + wt, t = 2, 3, . . . .

Page 176: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

Problems 163

Fig. 3.26. Forecasts and limits for production index. The vertical dotted line sep-arates the data from the predictions.

(a) Find the mean and the variance of {xt, t = 1, 2, . . .}. Is xt stationary?(b) Show

corr(xt, xt−h) = φh[

var(xt−h)

var(xt)

]1/2for h ≥ 0.

(c) Argue that for large t,

var(xt) ≈σ2w

1− φ2

andcorr(xt, xt−h) ≈ φh, h ≥ 0,

so in a sense, xt is “asymptotically stationary.”(d) Comment on how you could use these results to simulate n observations

of a stationary Gaussian AR(1) model from simulated iid N(0,1) values.

(e) Now suppose x1 = w1/√

1− φ2. Is this process stationary?

3.3 Verify the calculations made in Example 3.3:

(a) Let xt = φxt−1 +wt where |φ| > 1 and wt ∼ iid N(0, σ2w). Show E(xt) = 0

and γx(h) = σ2wφ−2 φ−h/(1− φ−2).

(b) Let yt = φ−1yt−1 + vt where vt ∼ iid N(0, σ2wφ−2) and φ and σw are

as in part (a). Argue that yt is causal with the same mean function andautocovariance function as xt.

Page 177: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

164 3 ARIMA Models

3.4 Identify the following models as ARMA(p, q) models (watch out for pa-rameter redundancy), and determine whether they are causal and/or invert-ible:

(a) xt = .80xt−1 − .15xt−2 + wt − .30wt−1.(b)xt = xt−1 − .50xt−2 + wt − wt−1.

3.5 Verify the causal conditions for an AR(2) model given in (3.28). That is,show that an AR(2) is causal if and only if (3.28) holds.

Section 3.3

3.6 For the AR(2) model given by xt = −.9xt−2 + wt, find the roots of theautoregressive polynomial, and then sketch the ACF, ρ(h).

3.7 For the AR(2) series shown below, use the results of Example 3.9 todetermine a set of difference equations that can be used to find the ACFρ(h), h = 0, 1, . . .; solve for the constants in the ACF using the initial condi-tions. Then plot the ACF values to lag 10 (use ARMAacf as a check on youranswers).

(a) xt + 1.6xt−1 + .64xt−2 = wt.(b)xt − .40xt−1 − .45xt−2 = wt.(c) xt − 1.2xt−1 + .85xt−2 = wt.

Section 3.4

3.8 Verify the calculations for the autocorrelation function of an ARMA(1, 1)process given in Example 3.13. Compare the form with that of the ACF forthe ARMA(1, 0) and the ARMA(0, 1) series. Plot (or sketch) the ACFs ofthe three series on the same graph for φ = .6, θ = .9, and comment on thediagnostic capabilities of the ACF in this case.

3.9 Generate n = 100 observations from each of the three models discussed inProblem 3.8. Compute the sample ACF for each model and compare it to thetheoretical values. Compute the sample PACF for each of the generated seriesand compare the sample ACFs and PACFs with the general results given inTable 3.1.

Section 3.5

3.10 Let xt represent the cardiovascular mortality series (cmort) discussed inChapter 2, Example 2.2.

(a) Fit an AR(2) to xt using linear regression as in Example 3.17.(b) Assuming the fitted model in (a) is the true model, find the forecasts over

a four-week horizon, xnn+m, for m = 1, 2, 3, 4, and the corresponding 95%prediction intervals.

Page 178: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

Problems 165

3.11 Consider the MA(1) series

xt = wt + θwt−1,

where wt is white noise with variance σ2w.

(a) Derive the minimum mean-square error one-step forecast based on theinfinite past, and determine the mean-square error of this forecast.

(b) Let xnn+1 be the truncated one-step-ahead forecast as given in (3.92). Showthat

E[(xn+1 − xnn+1)2

]= σ2(1 + θ2+2n).

Compare the result with (a), and indicate how well the finite approxima-tion works in this case.

3.12 In the context of equation (3.63), show that, if γ(0) > 0 and γ(h) → 0as h→∞, then Γn is positive definite.

3.13 Suppose xt is stationary with zero mean and recall the definition of thePACF given by (3.55) and (3.56). That is, let

εt = xt −h−1∑i=1

aixt−i

and

δt−h = xt−h −h−1∑j=1

bjxt−j

be the two residuals where {a1, . . . , ah−1} and {b1, . . . , bh−1} are chosen sothat they minimize the mean-squared errors

E[ε2t ] and E[δ2t−h].

The PACF at lag h was defined as the cross-correlation between εt and δt−h;that is,

φhh =E(εtδt−h)√E(ε2t )E(δ2t−h)

.

Let Rh be the h× h matrix with elements ρ(i− j), i, j = 1, . . . , h, and letρρρh = (ρ(1), ρ(2), . . . , ρ(h))′ be the vector of lagged autocorrelations, ρ(h) =corr(xt+h, xt). Let ρρρh = (ρ(h), ρ(h − 1), . . . , ρ(1))′ be the reversed vector. Inaddition, let xht denote the BLP of xt given {xt−1, . . . , xt−h}:

xht = αh1xt−1 + · · ·+ αhhxt−h,

as described in Property 3.3. Prove

Page 179: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

166 3 ARIMA Models

φhh =ρ(h)− ρρρ′h−1R−1h−1ρρρh1− ρρρ′h−1R−1h−1ρρρh−1

= αhh.

In particular, this result proves Property 3.4.Hint: Divide the prediction equations [see (3.63)] by γ(0) and write the

matrix equation in the partitioned form as(Rh−1 ρρρh−1ρρρ′h−1 ρ(0)

)(ααα1

αhh

)=

(ρρρh−1ρ(h)

),

where the h × 1 vector of coefficients ααα = (αh1, . . . , αhh)′ is partitioned asααα = (ααα′1, αhh)′.

3.14 Suppose we wish to find a prediction function g(x) that minimizes

MSE = E[(y − g(x))2],

where x and y are jointly distributed random variables with density functionf(x, y).

(a) Show that MSE is minimized by the choice

g(x) = E(y∣∣ x).

Hint:

MSE =

∫ [∫(y − g(x))2f(y|x)dy

]f(x)dx.

(b) Apply the above result to the model

y = x2 + z,

where x and z are independent zero-mean normal variables with varianceone. Show that MSE = 1.

(c) Suppose we restrict our choices for the function g(x) to linear functions ofthe form

g(x) = a+ bx

and determine a and b to minimize MSE. Show that a = 1 and

b =E(xy)

E(x2)= 0

and MSE = 3. What do you interpret this to mean?

3.15 For an AR(1) model, determine the general form of the m-step-aheadforecast xtt+m and show

E[(xt+m − xtt+m)2] = σ2w

1− φ2m

1− φ2.

Page 180: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

Problems 167

3.16 Consider the ARMA(1,1) model discussed in Example 3.7, equation(3.27); that is, xt = .9xt−1 + .5wt−1 + wt. Show that truncated predictionas defined in (3.91) is equivalent to truncated prediction using the recursiveformula (3.92).

3.17 Verify statement (3.87), that for a fixed sample size, the ARMA predic-tion errors are correlated.

Section 3.6

3.18 Fit an AR(2) model to the cardiovascular mortality series (cmort) dis-cussed in Chapter 2, Example 2.2. using linear regression and using Yule–Walker.

(a) Compare the parameter estimates obtained by the two methods.(b) Compare the estimated standard errors of the coefficients obtained by

linear regression with their corresponding asymptotic approximations, asgiven in Property 3.10.

3.19 Suppose x1, . . . , xn are observations from an AR(1) process with µ = 0.

(a) Show the backcasts can be written as xnt = φ1−tx1, for t ≤ 1.(b) In turn, show, for t ≤ 1, the backcasted errors are

wt(φ) = xnt − φxnt−1 = φ1−t(1− φ2)x1.

(c) Use the result of (b) to show∑1t=−∞ w2

t (φ) = (1− φ2)x21.(d) Use the result of (c) to verify the unconditional sum of squares, S(φ), can

be written as∑nt=−∞ w2

t (φ).

(e) Find xt−1t and rt for 1 ≤ t ≤ n, and show that

S(φ) =n∑t=1

(xt − xt−1t )2/rt.

3.20 Repeat the following numerical exercise three times. Generate n = 500observations from the ARMA model given by

xt = .9xt−1 + wt − .9wt−1,

with wt ∼ iid N(0, 1). Plot the simulated data, compute the sample ACF andPACF of the simulated data, and fit an ARMA(1, 1) model to the data. Whathappened and how do you explain the results?

3.21 Generate 10 realizations of length n = 200 each of an ARMA(1,1) processwith φ = .9, θ = .5 and σ2 = 1. Find the MLEs of the three parameters ineach case and compare the estimators to the true values.

Page 181: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

168 3 ARIMA Models

3.22 Generate n = 50 observations from a Gaussian AR(1) model with φ =.99 and σw = 1. Using an estimation technique of your choice, compare theapproximate asymptotic distribution of your estimate (the one you would usefor inference) with the results of a bootstrap experiment (use B = 200).

3.23 Using Example 3.31 as your guide, find the Gauss–Newton proce-dure for estimating the autoregressive parameter, φ, from the AR(1) model,xt = φxt−1 + wt, given data x1, . . . , xn. Does this procedure produce the un-conditional or the conditional estimator? Hint: Write the model as wt(φ) =xt − φxt−1; your solution should work out to be a non-recursive procedure.

3.24 Consider the stationary series generated by

xt = α+ φxt−1 + wt + θwt−1,

where E(xt) = µ, |θ| < 1, |φ| < 1 and the wt are iid random variables withzero mean and variance σ2

w.

(a) Determine the mean as a function of α for the above model. Find theautocovariance and ACF of the process xt, and show that the process isweakly stationary. Is the process strictly stationary?

(b) Prove the limiting distribution as n→∞ of the sample mean,

x = n−1n∑t=1

xt,

is normal, and find its limiting mean and variance in terms of α, φ, θ, andσ2w. (Note: This part uses results from Appendix A.)

3.25 A problem of interest in the analysis of geophysical time series involvesa simple model for observed data containing a signal and a reflected version ofthe signal with unknown amplification factor a and unknown time delay δ. Forexample, the depth of an earthquake is proportional to the time delay δ forthe P wave and its reflected form pP on a seismic record. Assume the signal,say st, is white and Gaussian with variance σ2

s , and consider the generatingmodel

xt = st + ast−δ.

(a) Prove the process xt is stationary. If |a| < 1, show that

st =∞∑j=0

(−a)jxt−δj

is a mean square convergent representation for the signal st, for t =1,±1,±2, . . ..

(b) If the time delay δ is assumed to be known, suggest an approximate com-putational method for estimating the parameters a and σ2

s using maximumlikelihood and the Gauss–Newton method.

Page 182: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

Problems 169

(c) If the time delay δ is an unknown integer, specify how we could estimatethe parameters including δ. Generate a n = 500 point series with a = .9,σ2w = 1 and δ = 5. Estimate the integer time delay δ by searching overδ = 3, 4, . . . , 7.

3.26 Forecasting with estimated parameters: Let x1, x2, . . . , xn be a sample ofsize n from a causal AR(1) process, xt = φxt−1+wt. Let φ be the Yule–Walkerestimator of φ.

(a) Show φ− φ = Op(n−1/2). See Appendix A for the definition of Op(·).

(b) Let xnn+1 be the one-step-ahead forecast of xn+1 given the data x1, . . . , xn,based on the known parameter, φ, and let xnn+1 be the one-step-ahead fore-

cast when the parameter is replaced by φ. Show xnn+1− xnn+1 = Op(n−1/2).

Section 3.7

3.27 Suppose

yt = β0 + β1t+ · · ·+ βqtq + xt, βq 6= 0,

where xt is stationary. First, show that∇kxt is stationary for any k = 1, 2, . . . ,and then show that ∇kyt is not stationary for k < q, but is stationary fork ≥ q.

3.28 Verify that the IMA(1,1) model given in (3.147) can be inverted andwritten as (3.148).

3.29 For the ARIMA(1, 1, 0) model with drift, (1 − φB)(1 − B)xt = δ + wt,let yt = (1−B)xt = ∇xt.(a) Noting that yt is AR(1), show that, for j ≥ 1,

ynn+j = δ [1 + φ+ · · ·+ φj−1] + φj yn.

(b) Use part (a) to show that, for m = 1, 2, . . . ,

xnn+m = xn +δ

1− φ

[m− φ(1− φm)

(1− φ)

]+ (xn − xn−1)

φ(1− φm)

(1− φ).

Hint: From (a), xnn+j − xnn+j−1 = δ 1−φj1−φ + φj(xn − xn−1). Now sum both

sides over j from 1 to m.(c) Use (3.144) to find Pnn+m by first showing that ψ∗0 = 1, ψ∗1 = (1 + φ), and

ψ∗j − (1 + φ)ψ∗j−1 + φψ∗j−2 = 0 for j ≥ 2, in which case ψ∗j = 1−φj+1

1−φ , for

j ≥ 1. Note that, as in Example 3.36, equation (3.144) is exact here.

3.30 For the logarithm of the glacial varve data, say, xt, presented in Example3.32, use the first 100 observations and calculate the EWMA, xtt+1, given in(3.150) for t = 1, . . . , 100, using λ = .25, .50, and .75, and plot the EWMAsand the data superimposed on each other. Comment on the results.

Page 183: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

170 3 ARIMA Models

Section 3.8

3.31 In Example 3.39, we presented the diagnostics for the MA(2) fit to theGNP growth rate series. Using that example as a guide, complete the diag-nostics for the AR(1) fit.

3.32 Crude oil prices in dollars per barrel are in oil; see Appendix R formore details. Fit an ARIMA(p, d, q) model to the growth rate performing allnecessary diagnostics. Comment.

3.33 Fit an ARIMA(p, d, q) model to the global temperature data gtemp per-forming all of the necessary diagnostics. After deciding on an appropriatemodel, forecast (with limits) the next 10 years. Comment.

3.34 One of the series collected along with particulates, temperature, andmortality described in Example 2.2 is the sulfur dioxide series, so2. Fit anARIMA(p, d, q) model to the data, performing all of the necessary diagnostics.After deciding on an appropriate model, forecast the data into the future fourtime periods ahead (about one month) and calculate 95% prediction intervalsfor each of the four forecasts. Comment.

Section 3.9

3.35 Consider the ARIMA model

xt = wt +Θwt−2.

(a) Identify the model using the notation ARIMA(p, d, q)× (P,D,Q)s.(b) Show that the series is invertible for |Θ| < 1, and find the coefficients in

the representation

wt =∞∑k=0

πkxt−k.

(c) Develop equations for the m-step ahead forecast, xn+m, and its variancebased on the infinite past, xn, xn−1, . . . .

3.36 Plot (or sketch) the ACF of the seasonal ARIMA(0, 1)× (1, 0)12 modelwith Φ = .8 and θ = .5.

3.37 Fit a seasonal ARIMA model of your choice to the unemployment data(unemp) displayed in Figure 3.21. Use the estimated model to forecast thenext 12 months.

3.38 Fit a seasonal ARIMA model of your choice to the U.S. Live Birth Series(birth). Use the estimated model to forecast the next 12 months.

3.39 Fit an appropriate seasonal ARIMA model to the log-transformed John-son and Johnson earnings series (jj) of Example 1.1. Use the estimated modelto forecast the next 4 quarters.

Page 184: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

Problems 171

The following problems require supplemental material given in Appendix B.

3.40 Suppose xt =∑pj=1 φjxt−j + wt, where φp 6= 0 and wt is white noise

such that wt is uncorrelated with {xk; k < t}. Use the Projection Theorem toshow that, for n > p, the BLP of xn+1 on sp{xk, k ≤ n} is

xn+1 =

p∑j=1

φjxn+1−j .

3.41 Use the Projection Theorem to derive the Innovations Algorithm, Prop-erty 3.6, equations (3.77)-(3.79). Then, use Theorem B.2 to derive the m-step-ahead forecast results given in (3.80) and (3.81).

3.42 Consider the series xt = wt−wt−1, where wt is a white noise process withmean zero and variance σ2

w. Suppose we consider the problem of predictingxn+1, based on only x1, . . . , xn. Use the Projection Theorem to answer thequestions below.

(a) Show the best linear predictor is

xnn+1 = − 1

n+ 1

n∑k=1

k xk.

(b) Prove the mean square error is

E(xn+1 − xnn+1)2 =n+ 2

n+ 1σ2w.

3.43 Use Theorem B.2 and B.3 to verify (3.116).

3.44 Prove Theorem B.2.

3.45 Prove Property 3.2.

Page 185: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory
Page 186: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

4

Spectral Analysis and Filtering

4.1 Introduction

The notion that a time series exhibits repetitive or regular behavior over timeis of fundamental importance because it distinguishes time series analysisfrom classical statistics, which assumes complete independence over time. Wehave seen how dependence over time can be introduced through models thatdescribe in detail the way certain empirical data behaves, even to the extentof producing forecasts based on the models. It is natural that models based onpredicting the present as a regression on the past, such as are provided by thecelebrated ARIMA or state-space forms, will be attractive to statisticians, whoare trained to view nature in terms of linear models. In fact, the differenceequations used to represent these kinds of models are simply the discreteversions of linear differential equations that may, in some instances, providethe ideal physical model for a certain phenomenon. An alternate version ofthe way nature behaves exists, however, and is based on a decomposition ofan empirical series into its regular components.

In this chapter, we argue, the concept of regularity of a series can best

that produced the series, expressed as Fourier frequencies being driven bysines and cosines. Such a possibility was discussed in Chapters 1 and 2. Froma regression point of view, we may imagine a system responding to variousdriving frequencies by producing linear combinations of sine and cosine func-tions. Expressed in these terms, the time domain approach may be thoughtof as regression of the present on the past, whereas the frequency domainapproach may be considered as regression of the present on periodic sines andcosines. The frequency domain approaches are the focus of this chapter andChapter 7. To illustrate the two methods for generating series with a singleprimary periodic component, consider Figure 1.9, which was generated from asimple second-order autoregressive model, and the middle and bottom panelsof Figure 1.11, which were generated by adding a cosine wave with a periodof 50 points to white noise. Both series exhibit strong periodic fluctuations,

© Springer Science+Business Media, LLC 2011

R.H. Shumway and D.S. Stoffer, Time Series Analysis and Its Applications: With R Examples,Springer Texts in Statistics, DOI 10.1007/978-1-4419-7865-3_4,

be expressed in terms of periodic variations of the underlying phenomenon

173

Page 187: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

174 4 Spectral Analysis and Filtering

illustrating that both models can generate time series with regular behavior.As discussed in Example 2.8, a fundamental objective of spectral analysis isto identify the dominant frequencies in a series and to find an explanation ofthe system from which the measurements were derived.

Of course, the primary justification for any alternate model must lie inits potential for explaining the behavior of some empirical phenomenon. Inthis sense, an explanation involving only a few kinds of primary oscillationsbecomes simpler and more physically meaningful than a collection of param-eters estimated for some selected difference equation. It is the tendency ofobserved data to show periodic kinds of fluctuations that justifies the use offrequency domain methods. Many of the examples in §1.2 are time series rep-resenting real phenomena that are driven by periodic components. The speechrecording of the syllable aa...hh in Figure 1.3 contains a complicated mixtureof frequencies related to the opening and closing of the glottis. Figure 1.5shows the monthly SOI, which we later explain as a combination of two kindsof periodicities, a seasonal periodic component of 12 months and an El Ninocomponent of about three to five years. Of fundamental interest is the returnperiod of the El Nino phenomenon, which can have profound effects on lo-cal climate. Also of interest is whether the different periodic components ofthe new fish population depend on corresponding seasonal and El Nino-typeoscillations. We introduce the coherence as a tool for relating the commonperiodic behavior of two series. Seasonal periodic components are often per-vasive in economic time series; this phenomenon can be seen in the quarterlyearnings series shown in Figure 1.1. In Figure 1.6, we see the extent to whichvarious parts of the brain will respond to a periodic stimulus generated byhaving the subject do alternate left and right finger tapping. Figure 1.7 showsseries from an earthquake and a nuclear explosion. The relative amounts ofenergy at various frequencies for the two phases can produce statistics, usefulfor discriminating between earthquakes and explosions.

In this chapter, we summarize an approach to handling correlation gen-erated in stationary time series that begins by transforming the series to thefrequency domain. This simple linear transformation essentially matches sinesand cosines of various frequencies against the underlying data and serves twopurposes as discussed in Examples 2.8 and 2.9. The periodogram that wasintroduced in Example 2.9 has its population counterpart called the powerspectrum, and its estimation is a main goal of spectral analysis. Another pur-pose of exploring this topic is statistical convenience resulting from the peri-odic components being nearly uncorrelated. This property facilitates writinglikelihoods based on classical statistical methods.

An important part of analyzing data in the frequency domain, as well asthe time domain, is the investigation and exploitation of the properties of thetime-invariant linear filter. This special linear transformation is used similarlyto linear regression in conventional statistics, and we use many of the sameterms in the time series context. We have previously mentioned the coherenceas a measure of the relation between two series at a given frequency, and

Page 188: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

4.2 Cyclical Behavior and Periodicity 175

we show later that this coherence also measures the performance of the bestlinear filter relating the two series. Linear filtering can also be an importantstep in isolating a signal embedded in noise. For example, the lower panelsof Figure 1.11 contain a signal contaminated with an additive noise, whereasthe upper panel contains the pure signal. It might also be appropriate toask whether a linear filter transformation exists that could be applied to thelower panel to produce a series closer to the signal in the upper panel. Theuse of filtering for reducing noise will also be a part of the presentation in thischapter. We emphasize, throughout, the analogy between filtering techniquesand conventional linear regression.

Many frequency scales will often coexist, depending on the nature of theproblem. For example, in the Johnson & Johnson data set in Figure 1.1,the predominant frequency of oscillation is one cycle per year (4 quarters),or .25 cycles per observation. The predominant frequency in the SOI and fishpopulations series in Figure 1.5 is also one cycle per year, but this correspondsto 1 cycle every 12 months, or .083 cycles per observation. For simplicity, wemeasure frequency, ω, at cycles per time point and discuss the implicationsof certain frequencies in terms of the problem context. Of descriptive interestis the period of a time series, defined as the number of points in a cycle, i.e.,1/ω. Hence, the predominant period of the Johnson & Johnson series is 1/.25or 4 quarters per cycle, whereas the predominant period of the SOI series is12 months per cycle.

4.2 Cyclical Behavior and Periodicity

As previously mentioned, we have already encountered the notion of period-icity in numerous examples in Chapters 1, 2 and 3. The general notion ofperiodicity can be made more precise by introducing some terminology. In or-der to define the rate at which a series oscillates, we first define a cycle as onecomplete period of a sine or cosine function defined over a unit time interval.As in (1.5), we consider the periodic process

xt = A cos(2πωt+ φ) (4.1)

for t = 0,±1,±2, . . ., where ω is a frequency index, defined in cycles per unittime with A determining the height or amplitude of the function and φ, calledthe phase, determining the start point of the cosine function. We can introducerandom variation in this time series by allowing the amplitude and phase tovary randomly.

As discussed in Example 2.8, for purposes of data analysis, it is easier touse a trigonometric identity1 and write (4.1) as

1 cos(α± β) = cos(α) cos(β)∓ sin(α) sin(β).

Page 189: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

176 4 Spectral Analysis and Filtering

xt = U1 cos(2πωt) + U2 sin(2πωt), (4.2)

where U1 = A cosφ and U2 = −A sinφ are often taken to be normally dis-tributed random variables. In this case, the amplitude is A =

√U21 + U2

2

and the phase is φ = tan−1(−U2/U1). From these facts we can show that if,and only if, in (4.1), A and φ are independent random variables, where A2

is chi-squared with 2 degrees of freedom, and φ is uniformly distributed on(−π, π), then U1 and U2 are independent, standard normal random variables(see Problem 4.2).

The above random process is also a function of its frequency, defined bythe parameter ω. The frequency is measured in cycles per unit time, or incycles per point in the above illustration. For ω = 1, the series makes onecycle per time unit; for ω = .50, the series makes a cycle every two time units;for ω = .25, every four units, and so on. In general, for data that occur atdiscrete time points will need at least two points to determine a cycle, so thehighest frequency of interest is .5 cycles per point. This frequency is calledthe folding frequency and defines the highest frequency that can be seen indiscrete sampling. Higher frequencies sampled this way will appear at lowerfrequencies, called aliases; an example is the way a camera samples a rotatingwheel on a moving automobile in a movie, in which the wheel appears to berotating at a different rate. For example, movies are recorded at 24 framesper second. If the camera is filming a wheel that is rotating at the rate of 24cycles per second (or 24 Hertz), the wheel will appear to stand still (that’sabout 110 miles per hour in case you were wondering).

Consider a generalization of (4.2) that allows mixtures of periodic serieswith multiple frequencies and amplitudes,

xt =

q∑k=1

[Uk1 cos(2πωkt) + Uk2 sin(2πωkt)] , (4.3)

where Uk1, Uk2, for k = 1, 2, . . . , q, are independent zero-mean random vari-ables with variances σ2

k, and the ωk are distinct frequencies. Notice that (4.3)exhibits the process as a sum of independent components, with variance σ2

k

for frequency ωk. Using the independence of the Us and the trig identity infootnote 1, it is easy to show2 (Problem 4.3) that the autocovariance functionof the process is

γ(h) =

q∑k=1

σ2k cos(2πωkh), (4.4)

and we note the autocovariance function is the sum of periodic componentswith weights proportional to the variances σ2

k. Hence, xt is a mean-zero sta-tionary processes with variance

2 For example, for xt in (4.2) we have cov(xt+h, xt) = σ2{cos(2πω[t+h]) cos(2πωt)+sin(2πω[t+ h]) sin(2πωt)} = σ2 cos(2πωh), noting that cov(U1, U2) = 0.

Page 190: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

4.2 Cyclical Behavior and Periodicity 177

Fig. 4.1. Periodic components and their sum as described in Example 4.1.

γ(0) = E(x2t ) =

q∑k=1

σ2k, (4.5)

which exhibits the overall variance as a sum of variances of each of the com-ponent parts.

Example 4.1 A Periodic Series

Figure 4.1 shows an example of the mixture (4.3) with q = 3 constructed inthe following way. First, for t = 1, . . . , 100, we generated three series

xt1 = 2 cos(2πt 6/100) + 3 sin(2πt 6/100)xt2 = 4 cos(2πt 10/100) + 5 sin(2πt 10/100)xt3 = 6 cos(2πt 40/100) + 7 sin(2πt 40/100)

These three series are displayed in Figure 4.1 along with the correspondingfrequencies and squared amplitudes. For example, the squared amplitude ofxt1 is A2 = 22 + 32 = 13. Hence, the maximum and minimum values thatxt1 will attain are ±

√13 = ±3.61.

Finally, we constructed

xt = xt1 + xt2 + xt3

and this series is also displayed in Figure 4.1. We note that xt appearsto behave as some of the periodic series we saw in Chapters 1 and 2. The

Page 191: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

178 4 Spectral Analysis and Filtering

systematic sorting out of the essential frequency components in a time series,including their relative contributions, constitutes one of the main objectivesof spectral analysis.

The R code to reproduce Figure 4.1 is1 x1 = 2*cos(2*pi*1:100*6/100) + 3*sin(2*pi*1:100*6/100)

2 x2 = 4*cos(2*pi*1:100*10/100) + 5*sin(2*pi*1:100*10/100)

3 x3 = 6*cos(2*pi*1:100*40/100) + 7*sin(2*pi*1:100*40/100)

4 x = x1 + x2 + x3

5 par(mfrow=c(2,2))

6 plot.ts(x1, ylim=c(-10,10), main=expression(omega==6/100~~~A^2==13))

7 plot.ts(x2, ylim=c(-10,10), main=expression(omega==10/100~~~A^2==41))

8 plot.ts(x3, ylim=c(-10,10), main=expression(omega==40/100~~~A^2==85))

9 plot.ts(x, ylim=c(-16,16), main="sum")

Example 4.2 The Scaled Periodogram for Example 4.1

In §2.3, Example 2.9, we introduced the periodogram as a way to discoverthe periodic components of a time series. Recall that the scaled periodogramis given by

P (j/n) =

(2

n

n∑t=1

xt cos(2πtj/n)

)2

+

(2

n

n∑t=1

xt sin(2πtj/n)

)2

, (4.6)

and it may be regarded as a measure of the squared correlation of the datawith sinusoids oscillating at a frequency of ωj = j/n, or j cycles in n timepoints. Recall that we are basically computing the regression of the dataon the sinusoids varying at the fundamental frequencies, j/n. As discussedin Example 2.9, the periodogram may be computed quickly using the fastFourier transform (FFT), and there is no need to run repeated regressions.

The scaled periodogram of the data, xt, simulated in Example 4.1 is shownin Figure 4.2, and it clearly identifies the three components xt1, xt2, and xt3of xt. Note that

P (j/n) = P (1− j/n), j = 0, 1, . . . , n− 1,

so there is a mirroring effect at the folding frequency of 1/2; consequently,the periodogram is typically not plotted for frequencies higher than the fold-ing frequency. In addition, note that the heights of the scaled periodogramshown in the figure are

P (6/100) = 13, P (10/100) = 41, P (40/100) = 85,

P (j/n) = P (1−j/n) and P (j/n) = 0 otherwise. These are exactly the valuesof the squared amplitudes of the components generated in Example 4.1. Thisoutcome suggests that the periodogram may provide some insight into thevariance components, (4.5), of a real set of data.

Assuming the simulated data, x, were retained from the previous example,the R code to reproduce Figure 4.2 is

Page 192: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

4.2 Cyclical Behavior and Periodicity 179

Fig. 4.2. Periodogram of the data generated in Example 4.1.

1 P = abs(2*fft(x)/100)^2; Fr = 0:99/100

2 plot(Fr, P, type="o", xlab="frequency", ylab="periodogram")

If we consider the data xt in Example 4.1 as a color (waveform) madeup of primary colors xt1, xt2, xt3 at various strengths (amplitudes), then wemight consider the periodogram as a prism that decomposes the color xt intoits primary colors (spectrum). Hence the term spectral analysis.

Another fact that may be of use in understanding the periodogram is thatfor any time series sample x1, . . . , xn, where n is odd, we may write, exactly

xt = a0 +

(n−1)/2∑j=1

[aj cos(2πt j/n) + bj sin(2πt j/n)] , (4.7)

for t = 1, . . . , n and suitably chosen coefficients. If n is even, the representation(4.7) can be modified by summing to (n/2 − 1) and adding an additionalcomponent given by an/2 cos(2πt 1/2) = an/2(−1)t. The crucial point hereis that (4.7) is exact for any sample. Hence (4.3) may be thought of as anapproximation to (4.7), the idea being that many of the coefficients in (4.7)may be close to zero. Recall from Example 2.9 that

P (j/n) = a2j + b2j , (4.8)

so the scaled periodogram indicates which components in (4.7) are large inmagnitude and which components are small. We also saw (4.8) in Example 4.2.

The periodogram, which was introduced in Schuster (1898) and used inSchuster (1906) for studying the periodicities in the sunspot series (shown in

Page 193: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

180 4 Spectral Analysis and Filtering

Figure 4.31 in the Problems section) is a sample based statistic. In Exam-ple 4.2, we discussed the fact that the periodogram may be giving us an ideaof the variance components associated with each frequency, as presented in(4.5), of a time series. These variance components, however, are populationparameters. The concepts of population parameters and sample statistics, asthey relate to spectral analysis of time series can be generalized to cover sta-tionary time series and that is the topic of the next section.

4.3 The Spectral Density

The idea that a time series is composed of periodic components, appearing inproportion to their underlying variances, is fundamental in the spectral repre-sentation given in Theorem C.2 of Appendix C. The result is quite technicalbecause it involves stochastic integration; that is, integration with respect to astochastic process. The essence of Theorem C.2 is that (4.3) is approximatelytrue for any stationary time series. In other words, we have the following.

Property 4.1 Spectral Representation of a Stationary ProcessIn nontechnical terms, Theorem C.2 states that any stationary time series

may be thought of, approximately, as the random superposition of sines andcosines oscillating at various frequencies.

Given that (4.3) is approximately true for all stationary time series, thenext question is whether a meaningful representation for its autocovariancefunction, like the one displayed in (4.4), also exists. The answer is yes, andthis representation is given in Theorem C.1 of Appendix C. The followingexample will help explain the result.

Example 4.3 A Periodic Stationary Process

Consider a periodic stationary random process given by (4.2), with a fixedfrequency ω0, say,

xt = U1 cos(2πω0t) + U2 sin(2πω0t),

where U1 and U2 are independent zero-mean random variables with equalvariance σ2. The number of time periods needed for the above series tocomplete one cycle is exactly 1/ω0, and the process makes exactly ω0 cyclesper point for t = 0,±1,±2, . . .. It is easily shown that3

γ(h) = σ2 cos(2πω0h) =σ2

2e−2πiω0h +

σ2

2e2πiω0h

=

∫ 1/2

−1/2e2πiωhdF (ω)

3 Some identities may be helpful here: eiα = cos(α) + i sin(α) and consequently,cos(α) = (eiα + e−iα)/2 and sin(α) = (eiα − e−iα)/2i.

Page 194: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

4.3 The Spectral Density 181

using a Riemann–Stieltjes integration, where F (ω) is the function definedby

F (ω) =

0 ω < −ω0,

σ2/2 −ω0 ≤ ω < ω0,

σ2 ω ≥ ω0.

The function F (ω) behaves like a cumulative distribution function for adiscrete random variable, except that F (∞) = σ2 = var(xt) instead of one.In fact, F (ω) is a cumulative distribution function, not of probabilities,but rather of variances associated with the frequency ω0 in an analysis ofvariance, with F (∞) being the total variance of the process xt. Hence, weterm F (ω) the spectral distribution function.

Theorem C.1 in Appendix C states that a representation such as the onegiven in Example 4.3 always exists for a stationary process. In particular, ifxt is stationary with autocovariance γ(h) = E[(xt+h−µ)(xt−µ)], then thereexists a unique monotonically increasing function F (ω), called the spectraldistribution function, that is bounded, with F (−∞) = F (−1/2) = 0, andF (∞) = F (1/2) = γ(0) such that

γ(h) =

∫ 1/2

−1/2e2πiωh dF (ω). (4.9)

A more important situation we use repeatedly is the one covered by The-orem C.3, where it is shown that, subject to absolute summability of the au-tocovariance, the spectral distribution function is absolutely continuous withdF (ω) = f(ω) dω, and the representation (4.9) becomes the motivation forthe property given below.

Property 4.2 The Spectral DensityIf the autocovariance function, γ(h), of a stationary process satisfies

∞∑h=−∞

|γ(h)| <∞, (4.10)

then it has the representation

γ(h) =

∫ 1/2

−1/2e2πiωh f(ω) dω h = 0,±1,±2, . . . (4.11)

as the inverse transform of the spectral density, which has the representation

f(ω) =∞∑

h=−∞

γ(h)e−2πiωh − 1/2 ≤ ω ≤ 1/2. (4.12)

Page 195: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

182 4 Spectral Analysis and Filtering

This spectral density is the analogue of the probability density function;the fact that γ(h) is non-negative definite ensures

f(ω) ≥ 0

for all ω (see Appendix C, Theorem C.3 for details). It follows immediatelyfrom (4.12) that

f(ω) = f(−ω) and f(ω) = f(1− ω),

verifying the spectral density is an even function of period one. Because ofthe evenness, we will typically only plot f(ω) for ω ≥ 0. In addition, puttingh = 0 in (4.11) yields

γ(0) = var(xt) =

∫ 1/2

−1/2f(ω) dω,

which expresses the total variance as the integrated spectral density over all ofthe frequencies. We show later on, that a linear filter can isolate the variancein certain frequency intervals or bands.

Analogous to probability theory, γ(h) in (4.11) is the characteristic func-tion4 of the spectral density f(ω) in (4.12). These facts should make it clearthat, when the conditions of Property 4.2 are satisfied, the autocovariancefunction, γ(h), and the spectral density function, f(ω), contain the same in-formation. That information, however, is expressed in different ways. Theautocovariance function expresses information in terms of lags, whereas thespectral density expresses the same information in terms of cycles. Some prob-lems are easier to work with when considering lagged information and wewould tend to handle those problems in the time domain. Nevertheless, otherproblems are easier to work with when considering periodic information andwe would tend to handle those problems in the spectral domain.

We note that the autocovariance function, γ(h), in (4.11) and the spectraldensity, f(ω), in (4.12) are Fourier transform pairs. In particular, this meansthat if f(ω) and g(ω) are two spectral densities for which

γf (h) =

∫ 1/2

−1/2f(ω)e2πiωh dω =

∫ 1/2

−1/2g(ω)e2πiωh dω = γg(h) (4.13)

for all h = 0,±1,±2, . . . , then

f(ω) = g(ω). (4.14)

We also mention, at this point, that we have been focusing on the frequencyω, expressed in cycles per point rather than the more common (in statistics)

4 If MX(λ) = E(eλX) for λ ∈ R is the moment generating function of randomvariable X, then ϕX(λ) = MX(iλ) is the characteristic function.

Page 196: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

4.3 The Spectral Density 183

alternative λ = 2πω that would give radians per point. Finally, the absolutesummability condition, (4.10), is not satisfied by (4.4), the example that wehave used to introduce the idea of a spectral representation. The condition,however, is satisfied for ARMA models.

It is illuminating to examine the spectral density for the series that wehave looked at in earlier discussions.

Example 4.4 White Noise Series

As a simple example, consider the theoretical power spectrum of a sequenceof uncorrelated random variables, wt, with variance σ2

w. A simulated setof data is displayed in the top of Figure 1.8. Because the autocovariancefunction was computed in Example 1.16 as γw(h) = σ2

w for h = 0, and zero,otherwise, it follows from (4.12), that

fw(ω) = σ2w

for −1/2 ≤ ω ≤ 1/2. Hence the process contains equal power at all fre-quencies. This property is seen in the realization, which seems to contain alldifferent frequencies in a roughly equal mix. In fact, the name white noisecomes from the analogy to white light, which contains all frequencies in thecolor spectrum at the same level of intensity. Figure 4.3 shows a plot of thewhite noise spectrum for σ2

w = 1.

If xt is ARMA, its spectral density can be obtained explicitly using thefact that it is a linear process, i.e., xt =

∑∞j=0 ψjwt−j , where

∑∞j=0 |ψj | <∞.

In the following property, we exhibit the form of the spectral density of anARMA model. The proof of the property follows directly from the proof of amore general result, Property 4.7 given on page 222, by using the additionalfact that ψ(z) = θ(z)/φ(z); recall Property 3.1.

Property 4.3 The Spectral Density of ARMAIf xt is ARMA(p, q), φ(B)xt = θ(B)wt, its spectral density is given by

fx(ω) = σ2w

|θ(e−2πiω)|2

|φ(e−2πiω)|2(4.15)

where φ(z) = 1−∑pk=1 φkz

k and θ(z) = 1 +∑qk=1 θkz

k.

Example 4.5 Moving Average

As an example of a series that does not have an equal mix of frequencies,we consider a moving average model. Specifically, consider the MA(1) modelgiven by

xt = wt + .5wt−1.

A sample realization is shown in the top of Figure 3.2 and we note that theseries has less of the higher or faster frequencies. The spectral density willverify this observation.

Page 197: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

184 4 Spectral Analysis and Filtering

The autocovariance function is displayed in Example 3.4 on page 90, andfor this particular example, we have

γ(0) = (1 + .52)σ2w = 1.25σ2

w; γ(±1) = .5σ2w; γ(±h) = 0 for h > 1.

Substituting this directly into the definition given in (4.12), we have

f(ω) =∞∑

h=−∞

γ(h) e−2πiωh = σ2w

[1.25 + .5

(e−2πiω + e2πω

)]= σ2

w [1.25 + cos(2πω)] .

(4.16)

We can also compute the spectral density using Property 4.3, which statesthat for an MA, f(ω) = σ2

w|θ(e−2πiω)|2. Because θ(z) = 1 + .5z, we have

|θ(e−2πiω)|2 = |1 + .5e−2πiω|2 = (1 + .5e−2πiω)(1 + .5e2πiω)

= 1.25 + .5(e−2πiω + e2πω

)which leads to agreement with (4.16).

Plotting the spectrum for σ2w = 1, as in the middle of Figure 4.3, shows

the lower or slower frequencies have greater power than the higher or fasterfrequencies.

Example 4.6 A Second-Order Autoregressive Series

We now consider the spectrum of an AR(2) series of the form

xt − φ1xt−1 − φ2xt−2 = wt,

for the special case φ1 = 1 and φ2 = −.9. Figure 1.9 on page 14 shows asample realization of such a process for σw = 1. We note the data exhibit astrong periodic component that makes a cycle about every six points.

To use Property 4.3, note that θ(z) = 1, φ(z) = 1− z + .9z2 and

|φ(e−2πiω)|2 = (1− e−2πiω + .9e−4πiω)(1− e2πiω + .9e4πiω)

= 2.81− 1.9(e2πiω + e−2πiω) + .9(e4πiω + e−4πiω)

= 2.81− 3.8 cos(2πω) + 1.8 cos(4πω).

Using this result in (4.15), we have that the spectral density of xt is

fx(ω) =σ2w

2.81− 3.8 cos(2πω) + 1.8 cos(4πω).

Setting σw = 1, the bottom of Figure 4.3 displays fx(ω) and shows a strongpower component at about ω = .16 cycles per point or a period betweensix and seven cycles per point and very little power at other frequencies. Inthis case, modifying the white noise series by applying the second-order AR

Page 198: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

4.3 The Spectral Density 185

0.0 0.1 0.2 0.3 0.4 0.5

0.6

0.8

1.0

1.2

1.4

frequency

spec

trum

White Noise

0.0 0.1 0.2 0.3 0.4 0.5

0.5

1.0

1.5

2.0

frequency

spec

trum

Moving Average

0.0 0.1 0.2 0.3 0.4 0.5

040

8012

0

frequency

spec

trum

Autoregression

Fig. 4.3. Theoretical spectra of white noise (top), a first-order moving average(middle), and a second-order autoregressive process (bottom).

operator has concentrated the power or variance of the resulting series in avery narrow frequency band.

The spectral density can also be obtained from first principles, withouthaving to use Property 4.3. Because wt = xt−xt−1+ .9xt−2 in this example,we have

γw(h) = cov(wt+h, wt)

= cov(xt+h − xt+h−1 + .9xt+h−2, xt − xt−1 + .9xt−2)

= 2.81γx(h)− 1.9[γx(h+ 1) + γx(h− 1)] + .9[γx(h+ 2) + γx(h− 2)]

Now, substituting the spectral representation (4.11) for γx(h) in the aboveequation yields

γw(h)=

∫ 1/2

−1/2

[2.81− 1.9(e2πiω+ e−2πiω) + .9(e4πiω+ e−4πiω)

]e2πiωhfx(ω)dω

=

∫ 1/2

−1/2

[2.81− 3.8 cos(2πω) + 1.8 cos(4πω)

]e2πiωhfx(ω)dω.

Page 199: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

186 4 Spectral Analysis and Filtering

If the spectrum of the white noise process, wt, is gw(ω), the uniqueness ofthe Fourier transform allows us to identify

gw(ω) = [2.81− 3.8 cos(2πω) + 1.8 cos(4πω)] fx(ω).

But, as we have already seen, gw(ω) = σ2w, from which we deduce that

fx(ω) =σ2w

2.81− 3.8 cos(2πω) + 1.8 cos(4πω)

is the spectrum of the autoregressive series.To reproduce Figure 4.3, use the spec.arma script (see §R.1):

1 par(mfrow=c(3,1))

2 spec.arma(log="no", main="White Noise")

3 spec.arma(ma=.5, log="no", main="Moving Average")

4 spec.arma(ar=c(1,-.9), log="no", main="Autoregression")

The above examples motivate the use of the power spectrum for describingthe theoretical variance fluctuations of a stationary time series. Indeed, theinterpretation of the spectral density function as the variance of the time seriesover a given frequency band gives us the intuitive explanation for its physicalmeaning. The plot of the function f(ω) over the frequency argument ω caneven be thought of as an analysis of variance, in which the columns or blockeffects are the frequencies, indexed by ω.

Example 4.7 Every Explosion has a Cause (cont)

In Example 3.3, we discussed the fact that explosive models have causalcounterparts. In that example, we also indicated that it was easier to showthis result in general in the spectral domain. In this example, we give thedetails for an AR(1) model, but the techniques used here will indicate howto generalize the result.

As in Example 3.3, we suppose that xt = 2xt−1 + wt, where wt ∼ iidN(0, σ2

w). Then, the spectral density of xt is

fx(ω) = σ2w |1− 2e−2πiω|−2. (4.17)

But,

|1− 2e−2πiω| = |1− 2e2πiω| = |(2e2πiω) ( 12e−2πiω − 1)| = 2 |1− 1

2e−2πiω|.

Thus, (4.17) can be written as

fx(ω) = 14σ

2w |1− 1

2e−2πiω|−2,

which implies that xt = 12xt−1 + vt, with vt ∼ iid N(0, 14σ

2w) is an equivalent

form of the model.

Page 200: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

4.4 Periodogram and Discrete Fourier Transform 187

4.4 Periodogram and Discrete Fourier Transform

We are now ready to tie together the periodogram, which is the sample-basedconcept presented in §4.2, with the spectral density, which is the population-based concept of §4.3.

Definition 4.1 Given data x1, . . . , xn, we define the discrete Fourier trans-form (DFT) to be

d(ωj) = n−1/2n∑t=1

xte−2πiωjt (4.18)

for j = 0, 1, . . . , n− 1, where the frequencies ωj = j/n are called the Fourieror fundamental frequencies.

If n is a highly composite integer (i.e., it has many factors), the DFTcan be computed by the fast Fourier transform (FFT) introduced in Cooleyand Tukey (1965). Also, different packages scale the FFT differently, so it isa good idea to consult the documentation. R computes the DFT defined in(4.18) without the factor n−1/2, but with an additional factor of e2πiωj thatcan be ignored because we will be interested in the squared modulus of theDFT. Sometimes it is helpful to exploit the inversion result for DFTs whichshows the linear transformation is one-to-one. For the inverse DFT we have,

xt = n−1/2n−1∑j=0

d(ωj)e2πiωjt (4.19)

for t = 1, . . . , n. The following example shows how to calculate the DFT andits inverse in R for the data set {1, 2, 3, 4}; note that R writes a complexnumber z = a+ ib as a+bi.

1 (dft = fft(1:4)/sqrt(4))

[1] 5+0i -1+1i -1+0i -1-1i

2 (idft = fft(dft, inverse=TRUE)/sqrt(4))

[1] 1+0i 2+0i 3+0i 4+0i

3 (Re(idft)) # keep it real

[1] 1 2 3 4

We now define the periodogram as the squared modulus5 of the DFT.

Definition 4.2 Given data x1, . . . , xn, we define the periodogram to be

I(ωj) = |d(ωj)|2 (4.20)

for j = 0, 1, 2, . . . , n− 1.

5 Recall that if z = a+ ib, then z = a− ib, and |z|2 = zz = a2 + b2.

Page 201: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

188 4 Spectral Analysis and Filtering

Note that I(0) = nx2, where x is the sample mean. In addition, because∑nt=1 exp(−2πit jn ) = 0 for j 6= 0,6 we can write the DFT as

d(ωj) = n−1/2n∑t=1

(xt − x)e−2πiωjt (4.21)

for j 6= 0. Thus, for j 6= 0,

I(ωj) = |d(ωj)|2 = n−1n∑t=1

n∑s=1

(xt − x)(xs − x)e−2πiωj(t−s)

= n−1n−1∑

h=−(n−1)

n−|h|∑t=1

(xt+|h| − x)(xt − x)e−2πiωjh

=

n−1∑h=−(n−1)

γ(h)e−2πiωjh (4.22)

where we have put h = t− s, with γ(h) as given in (1.34).7

Recall, P (ωj) = (4/n)I(ωj) where P (ωj) is the scaled periodogram definedin (4.6). Henceforth we will work with I(ωj) instead of P (ωj). In view of (4.22),the periodogram, I(ωj), is the sample version of f(ωj) given in (4.12). Thatis, we may think of the periodogram as the “sample spectral density” of xt.

It is sometimes useful to work with the real and imaginary parts of theDFT individually. To this end, we define the following transforms.

Definition 4.3 Given data x1, . . . , xn, we define the cosine transform

dc(ωj) = n−1/2n∑t=1

xt cos(2πωjt) (4.23)

and the sine transform

ds(ωj) = n−1/2n∑t=1

xt sin(2πωjt) (4.24)

where ωj = j/n for j = 0, 1, . . . , n− 1.

We note that d(ωj) = dc(ωj)− i ds(ωj) and hence

I(ωj) = d2c(ωj) + d2s(ωj). (4.25)

We have also discussed the fact that spectral analysis can be thought ofas an analysis of variance. The next example examines this notion.

6 ∑nt=1 z

t = z 1−zn1−z for z 6= 1.

7 Note that (4.22) can be used to obtain γ(h) by taking the inverse DFT of I(ωj).This approach was used in Example 1.27 to obtain a two-dimensional ACF.

Page 202: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

4.4 Periodogram and Discrete Fourier Transform 189

Example 4.8 Spectral ANOVA

Let x1, . . . , xn be a sample of size n, where for ease, n is odd. Then, recallingExample 2.9 on page 67 and the discussion around (4.7) and (4.8),

xt = a0 +m∑j=1

[aj cos(2πωjt) + bj sin(2πωjt)] , (4.26)

where m = (n− 1)/2, is exact for t = 1, . . . , n. In particular, using multipleregression formulas, we have a0 = x,

aj =2

n

n∑t=1

xt cos(2πωjt) =2√ndc(ωj)

bj =2

n

n∑t=1

xt sin(2πωjt) =2√nds(ωj).

Hence, we may write

(xt − x) =2√n

m∑j=1

[dc(ωj) cos(2πωjt) + ds(ωj) sin(2πωjt)]

for t = 1, . . . , n. Squaring both sides and summing we obtain

n∑t=1

(xt − x)2 = 2m∑j=1

[d2c(ωj) + d2s(ωj)

]= 2

m∑j=1

I(ωj)

using the results of Problem 2.10(d) on page 81. Thus, we have partitionedthe sum of squares into harmonic components represented by frequency ωjwith the periodogram, I(ωj), being the mean square regression. This leadsto the ANOVA table for n odd:

Source df SS MS

ω1 2 2I(ω1) I(ω1)ω2 2 2I(ω2) I(ω2)...

......

...ωm 2 2I(ωm) I(ωm)

Total n− 1∑nt=1(xt − x)2

This decomposition means that if the data contain some strong periodiccomponents, the periodogram values corresponding to those frequencies (ornear those frequencies) will be large. On the other hand, the correspondingvalues of the periodogram will be small for periodic components not presentin the data.

The following is an R example to help explain this concept. We considern = 5 observations given by x1 = 1, x2 = 2, x3 = 3, x4 = 2, x5 = 1. Note that

Page 203: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

190 4 Spectral Analysis and Filtering

the data complete one cycle, but not in a sinusoidal way. Thus, we shouldexpect the ω1 = 1/5 component to be relatively large but not exhaustive,and the ω2 = 2/5 component to be small.

1 x = c(1, 2, 3, 2, 1)

2 c1 = cos(2*pi*1:5*1/5); s1 = sin(2*pi*1:5*1/5)

3 c2 = cos(2*pi*1:5*2/5); s2 = sin(2*pi*1:5*2/5)

4 omega1 = cbind(c1, s1); omega2 = cbind(c2, s2)

5 anova(lm(x~omega1+omega2)) # ANOVA Table

Df Sum Sq Mean Sq

omega1 2 2.74164 1.37082

omega2 2 .05836 .02918

Residuals 0 .00000

6 abs(fft(x))^2/5 # the periodogram (as a check)

[1] 16.2 1.37082 .029179 .029179 1.37082

# I(0) I(1/5) I(2/5) I(3/5) I(4/5)

Note that x = 1.8, and I(0) = 16.2 = 5× 1.82(= nx2). Also, note that

I(1/5) = 1.37082 = Mean Sq(ω1) and I(2/5) = .02918 = Mean Sq(ω2)

and I(j/5) = I(1 − j/5), for j = 3, 4. Finally, we note that the sum ofsquares associated with the residuals (SSE) is zero, indicating an exact fit.

We are now ready to present some large sample properties of the peri-odogram. First, let µ be the mean of a stationary process xt with absolutelysummable autocovariance function γ(h) and spectral density f(ω). We canuse the same argument as in (4.22), replacing x by µ in (4.21), to write

I(ωj) = n−1n−1∑

h=−(n−1)

n−|h|∑t=1

(xt+|h| − µ)(xt − µ)e−2πiωjh (4.27)

where ωj is a non-zero fundamental frequency. Taking expectation in (4.27)we obtain

E [I(ωj)] =n−1∑

h=−(n−1)

(n− |h|n

)γ(h)e−2πiωjh. (4.28)

For any given ω 6= 0, choose a sequence of fundamental frequencies ωj:n → ω8

from which it follows by (4.28) that, as n→∞9

E [I(ωj:n)]→ f(ω) =∞∑

h=−∞

γ(h)e−2πihω. (4.29)

8 By this we mean ωj:n = jn/n, where {jn} is a sequence of integers chosen so thatjn/n is the closest Fourier frequency to ω; consequently, |jn/n− ω| ≤ 1

2n.

9 From Definition 4.2 we have I(0) = nx2, so the analogous result of (4.29) for thecase ω = 0 is E[I(0)]− nµ2 = n var(x)→ f(0) as n→∞.

Page 204: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

4.4 Periodogram and Discrete Fourier Transform 191

In other words, under absolute summability of γ(h), the spectral density isthe long-term average of the periodogram.

To examine the asymptotic distribution of the periodogram, we note thatif xt is a normal time series, the sine and cosine transforms will also be jointlynormal, because they are linear combinations of the jointly normal randomvariables x1, x2, . . . , xn. In that case, the assumption that the covariance func-tion satisfies the condition

θ =∞∑

h=−∞

|h||γ(h)| <∞ (4.30)

is enough to obtain simple large sample approximations for the variances andcovariances. Using the same argument used to develop (4.28) we have

cov[dc(ωj), dc(ωk)] = n−1n∑s=1

n∑t=1

γ(s− t) cos(2πωjs) cos(2πωkt), (4.31)

cov[dc(ωj), ds(ωk)] = n−1n∑s=1

n∑t=1

γ(s− t) cos(2πωjs) sin(2πωkt), (4.32)

and

cov[ds(ωj), ds(ωk)] = n−1n∑s=1

n∑t=1

γ(s− t) sin(2πωjs) sin(2πωkt), (4.33)

where the variance terms are obtained by setting ωj = ωk in (4.31) and (4.33).In Appendix C, §C.2, we show the terms in (4.31)-(4.33) have interestingproperties under assumption (4.30), namely, for ωj , ωk 6= 0 or 1/2,

cov[dc(ωj), dc(ωk)] =

{f(ωj)/2 + εn ωj = ωk,

εn ωj 6= ωk,(4.34)

cov[ds(ωj), ds(ωk)] =

{f(ωj)/2 + εn ωj = ωk,

εn ωj 6= ωk,(4.35)

andcov[dc(ωj), ds(ωk)] = εn, (4.36)

where the error term εn in the approximations can be bounded,

|εn| ≤ θ/n, (4.37)

and θ is given by (4.30). If ωj = ωk = 0 or 1/2 in (4.34), the multiplier 1/2disappears; note that ds(0) = ds(1/2) = 0, so (4.35) does not apply.

Page 205: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

192 4 Spectral Analysis and Filtering

Example 4.9 Covariance of Sine and Cosine Transforms

For the three-point moving average series of Example 1.9 and n = 256 obser-vations, the theoretical covariance matrix of the vector ddd = (dc(ω26), ds(ω26),dc(ω27), ds(ω27))′ is

cov(ddd) =

.3752 − .0009 − .0022 − .0010−.0009 .3777 −.0009 .0003−.0022 −.0009 .3667 −.0010−.0010 .0003 −.0010 .3692

.

The diagonal elements can be compared with half the theoretical spectralvalues of 1

2f(ω26) = .3774 for the spectrum at frequency ω26 = 26/256,and of 1

2f(ω27) = .3689 for the spectrum at ω27 = 27/256. Hence, the cosineand sine transforms produce nearly uncorrelated variables with variances ap-proximately equal to one half of the theoretical spectrum. For this particularcase, the uniform bound is determined from θ = 8/9, yielding |ε256| ≤ .0035for the bound on the approximation error.

If xt ∼ iid(0, σ2), then it follows from (4.30)-(4.36), Problem 2.10(d), anda central limit theorem10 that

dc(ωj:n) ∼ AN(0, σ2/2) and ds(ωj:n) ∼ AN(0, σ2/2) (4.38)

jointly and independently, and independent of dc(ωk:n) and ds(ωk:n) providedωj:n → ω1 and ωk:n → ω2 where 0 < ω1 6= ω2 < 1/2. We note that in thiscase, fx(ω) = σ2. In view of (4.38), it follows immediately that as n→∞,

2I(ωj:n)

σ2

d→ χ22 and

2I(ωk:n)

σ2

d→ χ22 (4.39)

with I(ωj:n) and I(ωk:n) being asymptotically independent, where χ2ν denotes

a chi-squared random variable with ν degrees of freedom.Using the central limit theory of §C.2, it is fairly easy to extend the results

of the iid case to the case of a linear process.

Property 4.4 Distribution of the Periodogram OrdinatesIf

xt =∞∑

j=−∞ψjwt−j ,

∞∑j=−∞

|ψj | <∞ (4.40)

where wt ∼ iid(0, σ2w), and (4.30) holds, then for any collection of m distinct

frequencies ωj ∈ (0, 1/2) with ωj:n → ωj

10 If Yj ∼ iid(0, σ2) and {aj} are constants for which∑nj=1 a

2j/max1≤j≤n a

2j → ∞

as n → ∞, then∑nj=1 ajYj ∼ AN

(0, σ2∑n

j=1 a2j

). AN is read asymptotically

normal and is explained in Definition A.5; convergence in distribution (d→) is

explained in Definition A.4.

Page 206: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

4.4 Periodogram and Discrete Fourier Transform 193

2I(ωj:n)

f(ωj)

d→ iid χ22 (4.41)

provided f(ωj) > 0, for j = 1, . . . ,m.

This result is stated more precisely in Theorem C.7 of §C.3. Other ap-proaches to large sample normality of the periodogram ordinates are in termsof cumulants, as in Brillinger (1981), or in terms of mixing conditions, such asin Rosenblatt (1956a). Here, we adopt the approach used by Hannan (1970),Fuller (1996), and Brockwell and Davis (1991).

The distributional result (4.41) can be used to derive an approximateconfidence interval for the spectrum in the usual way. Let χ2

ν(α) denote thelower α probability tail for the chi-squared distribution with ν degrees offreedom; that is,

Pr{χ2ν ≤ χ2

ν(α)} = α. (4.42)

Then, an approximate 100(1−α)% confidence interval for the spectral densityfunction would be of the form

2 I(ωj:n)

χ22(1− α/2)

≤ f(ω) ≤ 2 I(ωj:n)

χ22(α/2)

. (4.43)

Often, nonstationary trends are present that should be eliminated beforecomputing the periodogram. Trends introduce extremely low frequency com-ponents in the periodogram that tend to obscure the appearance at higherfrequencies. For this reason, it is usually conventional to center the data priorto a spectral analysis using either mean-adjusted data of the form xt − x toeliminate the zero or d-c component or to use detrended data of the formxt − β1 − β2t to eliminate the term that will be considered a half cycle bythe spectral analysis. Note that higher order polynomial regressions in t ornonparametric smoothing (linear filtering) could be used in cases where thetrend is nonlinear.

As previously indicated, it is often convenient to calculate the DFTs, andhence the periodogram, using the fast Fourier transform algorithm. The FFTutilizes a number of redundancies in the calculation of the DFT when n ishighly composite; that is, an integer with many factors of 2, 3, or 5, the bestcase being when n = 2p is a factor of 2. Details may be found in Cooley andTukey (1965). To accommodate this property, we can pad the centered (ordetrended) data of length n to the next highly composite integer n′ by addingzeros, i.e., setting xcn+1 = xcn+2 = · · · = xcn′ = 0, where xct denotes the centereddata. This means that the fundamental frequency ordinates will be ωj = j/n′

instead of j/n. We illustrate by considering the periodogram of the SOI andRecruitment series, as has been given in Figure 1.5 of Chapter 1. Recall thatthey are monthly series and n = 453 months. To find n′ in R, use the commandnextn(453) to see that n′ = 480 will be used in the spectral analyses bydefault [use help(spec.pgram) to see how to override this default].

Page 207: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

194 4 Spectral Analysis and Filtering

0 1 2 3 4 5 6

0.0

0.4

0.8

frequency

spec

trum

Series: soiRaw Periodogram

bandwidth = 0.00722

0 1 2 3 4 5 6

050

015

00

frequency

spec

trum

Series: recRaw Periodogram

bandwidth = 0.00722

Fig. 4.4. Periodogram of SOI and Recruitment, n = 453 (n′ = 480), where thefrequency axis is labeled in multiples of ∆ = 1/12. Note the common peaks atω = 1∆ = 1/12, or one cycle per year (12 months), and ω = 1

4∆ = 1/48, or one

cycle every four years (48 months).

Example 4.10 Periodogram of SOI and Recruitment Series

Figure 4.4 shows the periodograms of each series, where the frequency axisis labeled in multiples of ∆ = 1/12. As previously indicated, the centereddata have been padded to a series of length 480. We notice a narrow-bandpeak at the obvious yearly (12 month) cycle, ω = 1∆ = 1/12. In addition,there is considerable power in a wide band at the lower frequencies that iscentered around the four-year (48 month) cycle ω = 1

4∆ = 1/48 representinga possible El Nino effect. This wide band activity suggests that the possibleEl Nino cycle is irregular, but tends to be around four years on average.We will continue to address this problem as we move to more sophisticatedanalyses.

Noting χ22(.025) = .05 and χ2

2(.975) = 7.38, we can obtain approximate95% confidence intervals for the frequencies of interest. For example, theperiodogram of the SOI series is IS(1/12) = .97 at the yearly cycle. Anapproximate 95% confidence interval for the spectrum fS(1/12) is then

Page 208: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

4.4 Periodogram and Discrete Fourier Transform 195

[2(.97)/7.38, 2(.97)/.05] = [.26, 38.4],

which is too wide to be of much use. We do notice, however, that the lowervalue of .26 is higher than any other periodogram ordinate, so it is safe tosay that this value is significant. On the other hand, an approximate 95%confidence interval for the spectrum at the four-year cycle, fS(1/48), is

[2(.05)/7.38, 2(.05)/.05] = [.01, 2.12],

which again is extremely wide, and with which we are unable to establishsignificance of the peak.

We now give the R commands that can be used to reproduce Figure 4.4.To calculate and graph the periodogram, we used the spec.pgram commandin R. We note that the value of ∆ is the reciprocal of the value of frequencyused in ts() when making the data a time series object. If the data are nottime series objects, frequency is set to 1. Also, we set log="no" because Rwill plot the periodogram on a log10 scale by default. Figure 4.4 displays abandwidth and by default, R tapers the data (which we override in the com-mands below). We will discuss bandwidth and tapering in the next section,so ignore these concepts for the time being.

1 par(mfrow=c(2,1))

2 soi.per = spec.pgram(soi, taper=0, log="no")

3 abline(v=1/4, lty="dotted")

4 rec.per = spec.pgram(rec, taper=0, log="no")

5 abline(v=1/4, lty="dotted")

The confidence intervals for the SOI series at the yearly cycle, ω = 1/12 =40/480, and the possible El Nino cycle of four years ω = 1/48 = 10/480 canbe computed in R as follows:

1 soi.per$spec[40] # 0.97223; soi pgram at freq 1/12 = 40/480

2 soi.per$spec[10] # 0.05372; soi pgram at freq 1/48 = 10/480

3 # conf intervals - returned value:

4 U = qchisq(.025,2) # 0.05063

5 L = qchisq(.975,2) # 7.37775

6 2*soi.per$spec[10]/L # 0.01456

7 2*soi.per$spec[10]/U # 2.12220

8 2*soi.per$spec[40]/L # 0.26355

9 2*soi.per$spec[40]/U # 38.40108

The example above makes it clear that the periodogram as an estimatoris susceptible to large uncertainties, and we need to find a way to reducethe variance. Not surprisingly, this result follows if we think about the peri-odogram, I(ωj) as an estimator of the spectral density f(ω) and realize thatit is the sum of squares of only two random variables for any sample size. Thesolution to this dilemma is suggested by the analogy with classical statisticswhere we look for independent random variables with the same variance andaverage the squares of these common variance observations. Independence andequality of variance do not hold in the time series case, but the covariance

Page 209: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

196 4 Spectral Analysis and Filtering

structure of the two adjacent estimators given in Example 4.9 suggests thatfor neighboring frequencies, these assumptions are approximately true.

4.5 Nonparametric Spectral Estimation

To continue the discussion that ended the previous section, we introduce afrequency band, B, of L << n contiguous fundamental frequencies, centeredaround frequency ωj = j/n, which is chosen close to a frequency of interest,ω. For frequencies of the form ω∗ = ωj + k/n, let

B =

{ω∗ : ωj −

m

n≤ ω∗ ≤ ωj +

m

n

}, (4.44)

whereL = 2m+ 1 (4.45)

is an odd number, chosen such that the spectral values in the interval B,

f(ωj + k/n), k = −m, . . . , 0, . . . ,m

are approximately equal to f(ω). This structure can be realized for largesample sizes, as shown formally in §C.2.

We now define an averaged (or smoothed) periodogram as the average ofthe periodogram values, say,

f(ω) =1

L

m∑k=−m

I(ωj + k/n), (4.46)

over the band B. Under the assumption that the spectral density is fairlyconstant in the band B, and in view of (4.41) we can show that under appro-priate conditions,11 for large n, the periodograms in (4.46) are approximatelydistributed as independent f(ω)χ2

2/2 random variables, for 0 < ω < 1/2, aslong as we keep L fairly small relative to n. This result is discussed formallyin §C.2. Thus, under these conditions, Lf(ω) is the sum of L approximatelyindependent f(ω)χ2

2/2 random variables. It follows that, for large n,

2Lf(ω)

f(ω)

·∼ χ22L (4.47)

where·∼ means is approximately distributed as.

In this scenario, where we smooth the periodogram by simple averaging, itseems reasonable to call the width of the frequency interval defined by (4.44),

11 The conditions, which are sufficient, are that xt is a linear process, as describedin Property 4.4, with

∑j

√|j| |ψj | <∞, and wt has a finite fourth moment.

Page 210: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

4.5 Nonparametric Spectral Estimation 197

Bw =L

n, (4.48)

the bandwidth.12 The concept of the bandwidth, however, becomes more com-plicated with the introduction of spectral estimators that smooth with unequalweights. Note (4.48) implies the degrees of freedom can be expressed as

2L = 2Bwn, (4.49)

or twice the time-bandwidth product. The result (4.47) can be rearranged toobtain an approximate 100(1− α)% confidence interval of the form

2Lf(ω)

χ22L(1− α/2)

≤ f(ω) ≤ 2Lf(ω)

χ22L(α/2)

(4.50)

for the true spectrum, f(ω).Many times, the visual impact of a spectral density plot will be improved

by plotting the logarithm of the spectrum instead of the spectrum (the logtransformation is the variance stabilizing transformation in this situation).This phenomenon can occur when regions of the spectrum exist with peaks ofinterest much smaller than some of the main power components. For the logspectrum, we obtain an interval of the form[

log f(ω) + log 2L− logχ22L(1− α/2),

log f(ω) + log 2L− logχ22L(α/2)

]. (4.51)

We can also test hypotheses relating to the equality of spectra using thefact that the distributional result (4.47) implies that the ratio of spectra basedon roughly independent samples will have an approximate F2L,2L distribution.The independent estimators can either be from different frequency bands orfrom different series.

If zeros are appended before computing the spectral estimators, we needto adjust the degrees of freedom and an approximation is to replace 2L by2Ln/n′. Hence, we define the adjusted degrees of freedom as

df =2Ln

n′(4.52)

12 The bandwidth value used in R is based on Grenander (1951). The basic ideais that bandwidth can be related to the standard deviation of the weightingdistribution. For the uniform distribution on the frequency range −m/n to m/n,the standard deviation is L/n

√12 (using a continuity correction). Consequently,

in the case of (4.46), R will report a bandwidth of L/n√

12, which amounts todividing our definition by

√12. Note that in the extreme case L = n, we would

have Bw = 1 indicating that everything was used in the estimation; in this case,R would report a bandwidth of 1/

√12. There are many definitions of bandwidth

and an excellent discussion may be found in Percival and Walden (1993, §6.7).

Page 211: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

198 4 Spectral Analysis and Filtering

and use it instead of 2L in the confidence intervals (4.50) and (4.51). Forexample, (4.50) becomes

dff(ω)

χ2df (1− α/2)

≤ f(ω) ≤ dff(ω)

χ2df (α/2)

. (4.53)

A number of assumptions are made in computing the approximate confi-dence intervals given above, which may not hold in practice. In such cases, itmay be reasonable to employ resampling techniques such as one of the para-metric bootstraps proposed by Hurvich and Zeger (1987) or a nonparametriclocal bootstrap proposed by Paparoditis and Politis (1999). To develop thebootstrap distributions, we assume that the contiguous DFTs in a frequencyband of the form (4.44) all came from a time series with identical spectrumf(ω). This, in fact, is exactly the same assumption made in deriving the large-sample theory. We may then simply resample the L DFTs in the band, withreplacement, calculating a spectral estimate from each bootstrap sample. Thesampling distribution of the bootstrap estimators approximates the distribu-tion of the nonparametric spectral estimator. For further details, including thetheoretical properties of such estimators, see Paparoditis and Politis (1999).

Before proceeding further, we pause to consider computing the averageperiodograms for the SOI and Recruitment series, as shown in Figure 4.5.

Example 4.11 Averaged Periodogram for SOI and Recruitment

Generally, it is a good idea to try several bandwidths that seem to be com-patible with the general overall shape of the spectrum, as suggested bythe periodogram. The SOI and Recruitment series periodograms, previouslycomputed in Figure 4.4, suggest the power in the lower El Nino frequencyneeds smoothing to identify the predominant overall period. Trying val-ues of L leads to the choice L = 9 as a reasonable value, and the resultis displayed in Figure 4.5. In our notation, the bandwidth in this case isBw = 9/480 = .01875 cycles per month for the spectral estimator. Thisbandwidth means we are assuming a relatively constant spectrum over about.01875/.5 = 3.75% of the entire frequency interval (0, 1/2). To obtain thebandwidth, Bw = .01875, from the one reported by R in Figure 4.5, wecan multiply .065∆ (the frequency scale is in increments of ∆) by

√12 as

discussed in footnote 12 on page 197.The smoothed spectra shown in Figure 4.5 provide a sensible compromise

between the noisy version, shown in Figure 4.4, and a more heavily smoothedspectrum, which might lose some of the peaks. An undesirable effect ofaveraging can be noticed at the yearly cycle, ω = 1∆, where the narrow bandpeaks that appeared in the periodograms in Figure 4.4 have been flattenedand spread out to nearby frequencies. We also notice, and have marked,the appearance of harmonics of the yearly cycle, that is, frequencies of theform ω = k∆ for k = 1, 2, . . . . Harmonics typically occur when a periodiccomponent is present, but not in a sinusoidal fashion; see Example 4.12.

Page 212: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

4.5 Nonparametric Spectral Estimation 199

0 1 2 3 4 5 6

0.00

0.04

0.08

0.12

frequency

spec

trum

Series: soiSmoothed Periodogram

bandwidth = 0.065

0 1 2 3 4 5 6

020

040

060

0

frequency

spec

trum

Series: recSmoothed Periodogram

bandwidth = 0.065

Fig. 4.5. The averaged periodogram of the SOI and Recruitment series n =453, n′ = 480, L = 9, df = 17, showing common peaks at the four year period,ω = 1

4∆ = 1/48 cycles/month, the yearly period, ω = 1∆ = 1/12 cycles/month and

some of its harmonics ω = k∆ for k = 2, 3.

Figure 4.5 can be reproduced in R using the following commands. Thebasic call is to the function spec.pgram. To compute averaged periodograms,use the Daniell kernel, and specify m, where L = 2m+ 1 (L = 9 and m = 4in this example). We will explain the kernel concept later in this section,specifically just prior to Example 4.13.

1 par(mfrow=c(2,1))

2 k = kernel("daniell", 4)

3 soi.ave = spec.pgram(soi, k, taper=0, log="no")

4 abline(v=c(.25,1,2,3), lty=2)

5 # Repeat above lines using rec in place of soi on line 3

6 soi.ave$bandwidth # 0.0649519 = reported bandwidth

7 soi.ave$bandwidth*(1/12)*sqrt(12) # 0.01875 = Bw

The adjusted degrees of freedom are df = 2(9)(453)/480 ≈ 17. We canuse this value for the 95% confidence intervals, with χ2

df (.025) = 7.56 and

χ2df (.975) = 30.17. Substituting into (4.53) gives the intervals in Table 4.1

for the two frequency bands identified as having the maximum power. To

Page 213: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

200 4 Spectral Analysis and Filtering

0 1 2 3 4 5 6

0.00

20.

010

0.05

0

frequency

spec

trum

Series: soiSmoothed Periodogram

bandwidth = 0.065

0 1 2 3 4 5 6

15

2010

0

frequency

spec

trum

Series: recSmoothed Periodogram

bandwidth = 0.065

Fig. 4.6. Figure 4.5 with the average periodogram ordinates plotted on a log10

scale. The display in the upper right-hand corner represents a generic 95% confidenceinterval.

examine the two peak power possibilities, we may look at the 95% con-fidence intervals and see whether the lower limits are substantially largerthan adjacent baseline spectral levels. For example, the El Nino frequencyof 48 months has lower limits that exceed the values the spectrum wouldhave if there were simply a smooth underlying spectral function without thepeaks. The relative distribution of power over frequencies is different, withthe SOI having less power at the lower frequency, relative to the seasonalperiods, and the recruit series having relatively more power at the lower orEl Nino frequency.

The entries in Table 4.1 for SOI can be obtained in R as follows:1 df = soi.ave$df # df = 16.9875 (returned values)

2 U = qchisq(.025, df) # U = 7.555916

3 L = qchisq(.975, df) # L = 30.17425

4 soi.ave$spec[10] # 0.0495202

5 soi.ave$spec[40] # 0.1190800

6 # intervals

7 df*soi.ave$spec[10]/L # 0.0278789

8 df*soi.ave$spec[10]/U # 0.1113333

Page 214: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

4.5 Nonparametric Spectral Estimation 201

Table 4.1. Confidence Intervals for the Spectra of the SOI and Recruitment Series

Series ω Period Power Lower Upper

SOI 1/48 4 years .05 .03 .111/12 1 year .12 .07 .27

Recruits 1/48 4 years 6.59 3.71 14.82×102 1/12 1 year 2.19 1.24 4.93

9 df*soi.ave$spec[40]/L # 0.0670396

10 df*soi.ave$spec[40]/U # 0.2677201

11 # repeat above commands with soi replaced by rec

Finally, Figure 4.6 shows the averaged periodograms in Figure 4.5 plottedon a log10 scale. This is the default plot in R, and these graphs can be ob-tained by removing the statement log="no" in the spec.pgram call. Noticethat the default plot also shows a generic confidence interval of the form(4.51) (with log replaced by log10) in the upper right-hand corner. To useit, imagine placing the tick mark on the averaged periodogram ordinate ofinterest; the resulting bar then constitutes an approximate 95% confidenceinterval for the spectrum at that frequency. We note that displaying theestimates on a log scale tends to emphasize the harmonic components.

Example 4.12 Harmonics

In the previous example, we saw that the spectra of the annual signalsdisplayed minor peaks at the harmonics; that is, the signal spectra had alarge peak at ω = 1∆ = 1/12 cycles/month (the one-year cycle) and minorpeaks at its harmonics ω = k∆ for k = 2, 3, . . . (two-, three-, and so on,cycles per year). This will often be the case because most signals are notperfect sinusoids (or perfectly cyclic). In this case, the harmonics are neededto capture the non-sinusoidal behavior of the signal. As an example, considerthe signal formed in Figure 4.7 from a (fundamental) sinusoid oscillating attwo cycles per unit time along with the second through sixth harmonics atdecreasing amplitudes. In particular, the signal was formed as

xt = sin(2π2t) + .5 sin(2π4t) + .4 sin(2π6t)

+ .3 sin(2π8t) + .2 sin(2π10t) + .1 sin(2π12t) (4.54)

for 0 ≤ t ≤ 1. Notice that the signal is non-sinusoidal in appearance andrises quickly then falls slowly.

A figure similar to Figure 4.7 can be generated in R as follows.1 t = seq(0, 1, by=1/200)

2 amps = c(1, .5, .4, .3, .2, .1)

3 x = matrix(0, 201, 6)

4 for (j in 1:6) x[,j] = amps[j]*sin(2*pi*t*2*j)

5 x = ts(cbind(x, rowSums(x)), start=0, deltat=1/200)

Page 215: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

202 4 Spectral Analysis and Filtering

Harmonics

Time

Sinus

oids

0.0 0.2 0.4 0.6 0.8 1.0

−1.5

−1.0

−0.5

0.00.5

1.01.5

Fundamental2nd Harmonic3rd Harmonic4th Harmonic5th Harmonic6th HarmonicFormed Signal

Fig. 4.7. A signal (thick solid line) formed by a fundamental sinusoid (thin solidline) oscillating at two cycles per unit time and its harmonics as specified in (4.54).

6 ts.plot(x, lty=c(1:6, 1), lwd=c(rep(1,6), 2), ylab="Sinusoids")

7 names = c("Fundamental","2nd Harmonic","3rd Harmonic","4th

Harmonic", "5th Harmonic", "6th Harmonic", "Formed Signal")

8 legend("topright", names, lty=c(1:6, 1), lwd=c(rep(1,6), 2))

Example 4.11 points out the necessity for having some relatively systematicprocedure for deciding whether peaks are significant. The question of decidingwhether a single peak is significant usually rests on establishing what we mightthink of as a baseline level for the spectrum, defined rather loosely as the shapethat one would expect to see if no spectral peaks were present. This profilecan usually be guessed by looking at the overall shape of the spectrum thatincludes the peaks; usually, a kind of baseline level will be apparent, withthe peaks seeming to emerge from this baseline level. If the lower confidencelimit for the spectral value is still greater than the baseline level at somepredetermined level of significance, we may claim that frequency value as astatistically significant peak. To be consistent with our stated indifference tothe upper limits, we might use a one-sided confidence interval.

An important aspect of interpreting the significance of confidence inter-vals and tests involving spectra is that typically, more than one frequencywill be of interest, so that we will potentially be interested in simultaneousstatements about a whole collection of frequencies. For example, it would beunfair to claim in Table 4.1 the two frequencies of interest as being statisticallysignificant and all other potential candidates as nonsignificant at the overalllevel of α = .05. In this case, we follow the usual statistical approach, not-ing that if K statements S1, S2, . . . , Sk are made at significance level α, i.e.,

Page 216: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

4.5 Nonparametric Spectral Estimation 203

P{Sk} = 1 − α, then the overall probability all statements are true satisfiesthe Bonferroni inequality

P{all Sk true} ≥ 1−Kα. (4.55)

For this reason, it is desirable to set the significance level for testing eachfrequency at α/K if there are K potential frequencies of interest. If, a priori,potentially K = 10 frequencies are of interest, setting α = .01 would give anoverall significance level of bound of .10.

The use of the confidence intervals and the necessity for smoothing requiresthat we make a decision about the bandwidth Bw over which the spectrumwill be essentially constant. Taking too broad a band will tend to smooth outvalid peaks in the data when the constant variance assumption is not met overthe band. Taking too narrow a band will lead to confidence intervals so widethat peaks are no longer statistically significant. Thus, we note that thereis a conflict here between variance properties or bandwidth stability, whichcan be improved by increasing Bw and resolution, which can be improved bydecreasing Bw. A common approach is to try a number of different bandwidthsand to look qualitatively at the spectral estimators for each case.

To address the problem of resolution, it should be evident that the flat-tening of the peaks in Figures 4.5 and 4.6 was due to the fact that simpleaveraging was used in computing f(ω) defined in (4.46). There is no partic-ular reason to use simple averaging, and we might improve the estimator byemploying a weighted average, say

f(ω) =

m∑k=−m

hk I(ωj + k/n), (4.56)

using the same definitions as in (4.46) but where the weights hk > 0 satisfy

m∑k=−m

hk = 1.

In particular, it seems reasonable that the resolution of the estimator willimprove if we use weights that decrease as distance from the center weighth0 increases; we will return to this idea shortly. To obtain the averaged pe-riodogram, f(ω), in (4.56), set hk = L−1, for all k, where L = 2m + 1.

The asymptotic theory established for f(ω) still holds for f(ω) provided thatthe weights satisfy the additional condition that if m → ∞ as n → ∞ butm/n→ 0, then

m∑k=−m

h2k → 0.

Under these conditions, as n→∞,

(i) E(f(ω)

)→ f(ω)

Page 217: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

204 4 Spectral Analysis and Filtering

(ii)(∑m

k=−m h2k

)−1cov

(f(ω), f(λ)

)→ f2(ω) for ω = λ 6= 0, 1/2.

In (ii), replace f2(ω) by 0 if ω 6= λ and by 2f2(ω) if ω = λ = 0 or 1/2.We have already seen these results in the case of f(ω), where the weights

are constant, hk = L−1, in which case∑mk=−m h

2k = L−1. The distributional

properties of (4.56) are more difficult now because f(ω) is a weighted linearcombination of asymptotically independent χ2 random variables. An approx-

imation that seems to work well is to replace L by(∑m

k=−m h2k

)−1. That is,

define

Lh =

(m∑

k=−m

h2k

)−1(4.57)

and use the approximation13

2Lhf(ω)

f(ω)

·∼ χ22Lh

. (4.58)

In analogy to (4.48), we will define the bandwidth in this case to be

Bw =Lhn. (4.59)

Using the approximation (4.58) we obtain an approximate 100(1− α)% con-fidence interval of the form

2Lhf(ω)

χ22Lh

(1− α/2)≤ f(ω) ≤ 2Lhf(ω)

χ22Lh

(α/2)(4.60)

for the true spectrum, f(ω). If the data are padded to n′, then replace 2Lh in(4.60) with df = 2Lhn/n

′ as in (4.52).An easy way to generate the weights in R is by repeated use of the Daniell

kernel. For example, with m = 1 and L = 2m + 1 = 3, the Daniell kernelhas weights {hk} = { 13 ,

13 ,

13}; applying this kernel to a sequence of numbers,

{ut}, producesut = 1

3ut−1 + 13ut + 1

3ut+1.

We can apply the same kernel again to the ut,ut = 13 ut−1 + 1

3 ut + 13 ut+1,

which simplifies to

ut = 19ut−2 + 2

9ut−1 + 39ut + 2

9ut+1 + 19ut+2.

13 The approximation proceeds as follows: If f·∼ cχ2

ν , where c is a constant, thenEf ≈ cν and varf ≈ f2 ∑

k h2k ≈ c22ν. Solving, c ≈ f

∑k h

2k/2 = f/2Lh and

ν ≈ 2(∑

k h2k

)−1= 2Lh.

Page 218: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

4.5 Nonparametric Spectral Estimation 205

0 1 2 3 4 5 6

0.00

0.05

0.10

0.15

frequency

spec

trum

Series: soiSmoothed Periodogram

bandwidth = 0.0633

0 1 2 3 4 5 6

020

040

060

0

frequency

spec

trum

Series: recSmoothed Periodogram

bandwidth = 0.0633

Fig. 4.8. Smoothed spectral estimates of the SOI and Recruitment series; see Ex-ample 4.13 for details.

The modified Daniell kernel puts half weights at the end points, so with m = 1the weights are {hk} = {14 ,

24 ,

14} and

ut = 14ut−1 + 1

2ut + 14ut+1.

Applying the same kernel again to ut yieldsut = 116ut−2 + 4

16ut−1 + 616ut + 4

16ut+1 + 116ut+2.

These coefficients can be obtained in R by issuing the kernel command.For example, kernel("modified.daniell", c(1,1)) would produce the co-efficients of the last example. It is also possible to use different values ofm, e.g., try kernel("modified.daniell", c(1,2)) or kernel("daniell",

c(5,3)). The other kernels that are currently available in R are the Dirichletkernel and the Fejer kernel, which we will discuss shortly.

Example 4.13 Smoothed Periodogram for SOI and Recruitment

In this example, we estimate the spectra of the SOI and Recruitment se-ries using the smoothed periodogram estimate in (4.56). We used a mod-ified Daniell kernel twice, with m = 3 both times. This yields Lh =

Page 219: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

206 4 Spectral Analysis and Filtering

1/∑mk=−m h

2k = 9.232, which is close to the value of L = 9 used in Ex-

ample 4.11. In this case, the bandwidth is Bw = 9.232/480 = .019 and themodified degrees of freedom is df = 2Lh453/480 = 17.43. The weights, hk,can be obtained and graphed in R as follows:

1 kernel("modified.daniell", c(3,3))

coef[-6] = 0.006944 = coef[ 6]

coef[-5] = 0.027778 = coef[ 5]

coef[-4] = 0.055556 = coef[ 4]

coef[-3] = 0.083333 = coef[ 3]

coef[-2] = 0.111111 = coef[ 2]

coef[-1] = 0.138889 = coef[ 1]

coef[ 0] = 0.152778

2 plot(kernel("modified.daniell", c(3,3))) # not shown

The resulting spectral estimates can be viewed in Figure 4.8 and we noticethat the estimates more appealing than those in Figure 4.5. Figure 4.8 wasgenerated in R as follows; we also show how to obtain df and Bw.

1 par(mfrow=c(2,1))

2 k = kernel("modified.daniell", c(3,3))

3 soi.smo = spec.pgram(soi, k, taper=0, log="no")

4 abline(v=1, lty="dotted"); abline(v=1/4, lty="dotted")

5 # Repeat above lines with rec replacing soi in line 3

6 df = soi.smo2$df # df = 17.42618

7 Lh = 1/sum(k[-k$m:k$m]^2) # Lh = 9.232413

8 Bw = Lh/480 # Bw = 0.01923419

The bandwidth reported by R is .063, which is approximately Bw/√

12∆,where ∆ = 1/12 in this example. Reissuing the spec.pgram commands withlog="no" removed will result in a figure similar to Figure 4.6. Finally, wemention that R uses the modified Daniell kernel by default. For example,an easier way to obtain soi.smo is to issue the command:

1 soi.smo = spectrum(soi, spans=c(7,7), taper=0)

Notice that spans is a vector of odd integers, given in terms of L = 2m+ 1instead of m. These values give the widths of the modified Daniell smootherto be used to smooth the periodogram.

We are now ready to briefly introduce the concept of tapering; a moredetailed discussion may be found in Bloomfield (2000, §9.5). Suppose xt is amean-zero, stationary process with spectral density fx(ω). If we replace theoriginal series by the tapered series

yt = htxt, (4.61)

for t = 1, 2, . . . , n, use the modified DFT

dy(ωj) = n−1/2n∑t=1

htxte−2πiωjt, (4.62)

and let Iy(ωj) = |dy(ωj)|2, we obtain (see Problem 4.15)

Page 220: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

4.5 Nonparametric Spectral Estimation 207

E[Iy(ωj)] =

∫ 1/2

−1/2Wn(ωj − ω) fx(ω) dω (4.63)

whereWn(ω) = |Hn(ω)|2 (4.64)

and

Hn(ω) = n−1/2n∑t=1

hte−2πiωt. (4.65)

The value Wn(ω) is called a spectral window because, in view of (4.63), it isdetermining which part of the spectral density fx(ω) is being “seen” by theestimator Iy(ωj) on average. In the case that ht = 1 for all t, Iy(ωj) = Ix(ωj)is simply the periodogram of the data and the window is

Wn(ω) =sin2(nπω)

n sin2(πω)(4.66)

with Wn(0) = n, which is known as the Fejer or modified Bartlett kernel. Ifwe consider the averaged periodogram in (4.46), namely

fx(ω) =1

L

m∑k=−m

Ix(ωj + k/n),

the window, Wn(ω), in (4.63) will take the form

Wn(ω) =1

nL

m∑k=−m

sin2[nπ(ω + k/n)]

sin2[π(ω + k/n)]. (4.67)

Tapers generally have a shape that enhances the center of the data relativeto the extremities, such as a cosine bell of the form

ht = .5

[1 + cos

(2π(t− t)

n

)], (4.68)

where t = (n + 1)/2, favored by Blackman and Tukey (1959). In Figure 4.9,we have plotted the shapes of two windows, Wn(ω), for n = 480 and L = 9,when (i) ht ≡ 1, in which case, (4.67) applies, and (ii) ht is the cosine taper in(4.68). In both cases the predicted bandwidth should be Bw = 9/480 = .01875cycles per point, which corresponds to the “width” of the windows shown inFigure 4.9. Both windows produce an integrated average spectrum over thisband but the untapered window in the top panels shows considerable ripplesover the band and outside the band. The ripples outside the band are calledsidelobes and tend to introduce frequencies from outside the interval thatmay contaminate the desired spectral estimate within the band. For example,a large dynamic range for the values in the spectrum introduces spectra incontiguous frequency intervals several orders of magnitude greater than thevalue in the interval of interest. This effect is sometimes called leakage. Fig-ure 4.9 emphasizes the suppression of the sidelobes in the Fejer kernel whena cosine taper is used.

Page 221: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

208 4 Spectral Analysis and Filtering

Fig. 4.9. Averaged Fejer window (top row) and the corresponding cosine taperwindow (bottom row) for L = 9, n = 480. The extra tic marks on the horizontalaxis of the left-hand plots exhibit the predicted bandwidth, Bw = 9/480 = .01875.

Example 4.14 The Effect of Tapering the SOI Series

In this example, we examine the effect of tapering on the estimate of thespectrum of the SOI series. The results for the Recruitment series are sim-ilar. Figure 4.10 shows two spectral estimates plotted on a log scale. Thedegree of smoothing here is the same as in Example 4.13. The dashed linein Figure 4.10 shows the estimate without any tapering and hence it is thesame as the estimated spectrum displayed in the top of Figure 4.8. The solidline shows the result with full tapering. Notice that the tapered spectrumdoes a better job in separating the yearly cycle (ω = 1) and the El Ninocycle (ω = 1/4).

The following R session was used to generate Figure 4.10. We note that,by default, R tapers 10% of each end of the data and leaves the middle 80%of the data alone. To instruct R not to taper, we must specify taper=0. Forfull tapering, we use the argument taper=.5 to instruct R to taper 50% ofeach end of the data.

Page 222: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

4.5 Nonparametric Spectral Estimation 209

0 1 2 3 4 5 6

0.00

20.

005

0.02

00.

050

frequency

spec

trum

Fig. 4.10. Smoothed spectral estimates of the SOI without tapering (dashed line)and with full tapering (solid line); see Example 4.14 for details.

1 s0 = spectrum(soi, spans=c(7,7), taper=0, plot=FALSE)

2 s50 = spectrum(soi, spans=c(7,7), taper=.5, plot=FALSE)

3 plot(s0$freq, s0$spec, log="y", type="l", lty=2, ylab="spectrum",

xlab="frequency") # dashed line

4 lines(s50$freq, s50$spec) # solid line

We close this section with a brief discussion of lag window estimators.First, consider the periodogram, I(ωj), which was shown in (4.22) to be

I(ωj) =∑|h|<n

γ(h)e−2πiωjh.

Thus, (4.56) can be written as

f(ω) =∑|k|≤m

hk I(ωj + k/n)

=∑|k|≤m

hk∑|h|<n

γ(h)e−2πi(ωj+k/n)h

=∑|h|<n

g(h/n) γ(h)e−2πiωjh. (4.69)

where g(h/n) =∑|k|≤m hk exp(−2πikh/n). Equation (4.69) suggests estima-

tors of the formf(ω) =

∑|h|≤r

w(h/r) γ(h)e−2πiωh (4.70)

Page 223: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

210 4 Spectral Analysis and Filtering

where w(·) is a weight function, called the lag window, that satisfies

(i) w(0) = 1(ii) |w(x)| ≤ 1 and w(x) = 0 for |x| > 1,(iii) w(x) = w(−x).

Note that if w(x) = 1 for |x| < 1 and r = n, then f(ωj) = I(ωj), theperiodogram. This result indicates the problem with the periodogram as anestimator of the spectral density is that it gives too much weight to the valuesof γ(h) when h is large, and hence is unreliable [e.g, there is only one pair ofobservations used in the estimate γ(n−1), and so on]. The smoothing windowis defined to be

W (ω) =r∑

h=−r

w(h/r)e−2πiωh, (4.71)

and it determines which part of the periodogram will be used to form theestimate of f(ω). The asymptotic theory for f(ω) holds for f(ω) under thesame conditions and provided r →∞ as n→∞ but with r/n→ 0. We have

E{f(ω)} → f(ω), (4.72)

n

rcov

(f(ω), f(λ)

)→ f2(ω)

∫ 1

−1w2(x)dx ω = λ 6= 0, 1/2. (4.73)

In (4.73), replace f2(ω) by 0 if ω 6= λ and by 2f2(ω) if ω = λ = 0 or 1/2.Many authors have developed various windows and Brillinger (2001, Ch

3) and Brockwell and Davis (1991, Ch 10) are good sources of detailed infor-mation on this topic. We mention a few.

The rectangular lag window, which gives uniform weight in (4.70),

w(x) = 1, |x| ≤ 1,

corresponds to the Dirichlet smoothing window given by

W (ω) =sin(2πr + π)ω

sin(πω). (4.74)

This smoothing window takes on negative values, which may lead to estimatesof the spectral density that are negative a various frequencies. Using (4.73) inthis case, for large n we have

var{f(ω)} ≈ 2r

nf2(ω).

The Parzen lag window is defined to be

w(x) =

1− 6x+ 6|x|3 |x| < 1/2,

2(1− |x|)3 1/2 ≤ x ≤ 1,

0 otherwise.

Page 224: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

4.5 Nonparametric Spectral Estimation 211

This leads to an approximate smoothing window of

W (ω) =6

πr3sin4(rω/4)

sin4(ω/2).

For large n, the variance of the estimator is approximately

var{f(ω)} ≈ .539f2(ω)/n.

The Tukey-Hanning lag window has the form

w(x) =1

2(1 + cos(x)), |x| ≤ 1

which leads to the smoothing window

W (ω) =1

4Dr(2πω − π/r) +

1

2Dr(2πω) +

1

4Dr(2πω + π/r)

where Dr(ω) is the Dirichlet kernel in (4.74). The approximate large samplevariance of the estimator is

var{f(ω)} ≈ 3r

4nf2(ω).

The triangular lag window, also known as the Bartlett or Fejer window,given by

w(x) = 1− |x|, |x| ≤ 1

leads to the Fejer smoothing window:

W (ω) =sin2(πrω)

r sin2(πω).

In this case, (4.73) yields

var{f(ω)} ≈ 2r

3nf2(ω).

The idealized rectangular smoothing window, also called the Daniell win-dow, is given by

W (ω) =

{r |ω| ≤ 1/2r,

0 otherwise,

and leads to the sinc lag window, namely

w(x) =sin(πx)

πx, |x| ≤ 1.

From (4.73) we have

var{f(ω)} ≈ r

nf2(ω).

Page 225: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

212 4 Spectral Analysis and Filtering

For lag window estimators, the width of the idealized rectangular windowthat leads to the same asymptotic variance as a given lag window estimatoris sometimes called the equivalent bandwidth. For example, the bandwidth ofthe idealized rectangular window is br = 1/r and the asymptotic variance is1nbr

f2. The asymptotic variance of the triangular window is 2r3nf

2, so setting1nbr

f2 = 2r3nf

2 and solving we get br = 3/2r as the equivalent bandwidth.

4.6 Parametric Spectral Estimation

The methods of §4.5 lead to estimators generally referred to as nonparamet-ric spectra because no assumption is made about the parametric form of thespectral density. In Property 4.3, we exhibited the spectrum of an ARMAprocess and we might consider basing a spectral estimator on this function,substituting the parameter estimates from an ARMA(p, q) fit on the data intothe formula for the spectral density fx(ω) given in (4.15). Such an estimator iscalled a parametric spectral estimator. For convenience, a parametric spectralestimator is obtained by fitting an AR(p) to the data, where the order p is de-termined by one of the model selection criteria, such as AIC, AICc, and BIC,defined in (2.19)-(2.21). Parametric autoregressive spectral estimators will of-ten have superior resolution in problems when several closely spaced narrowspectral peaks are present and are preferred by engineers for a broad vari-ety of problems (see Kay, 1988). The development of autoregressive spectralestimators has been summarized by Parzen (1983).

If φ1, φ2, . . . , φp and σ2w are the estimates from an AR(p) fit to xt, then

based on Property 4.3, a parametric spectral estimate of fx(ω) is attained bysubstituting these estimates into (4.15), that is,

fx(ω) =σ2w

|φ(e−2πiω)|2, (4.75)

whereφ(z) = 1− φ1z − φ2z2 − · · · − φpzp. (4.76)

The asymptotic distribution of the autoregressive spectral estimator has beenobtained by Berk (1974) under the conditions p→∞, p3/n→ 0 as p, n→∞,which may be too severe for most applications. The limiting results imply aconfidence interval of the form

fx(ω)

(1 + Czα/2)≤ fx(ω) ≤ fx(ω)

(1− Czα/2), (4.77)

where C =√

2p/n and zα/2 is the ordinate corresponding to the upper α/2probability of the standard normal distribution. If the sampling distribution isto be checked, we suggest applying the bootstrap estimator to get the samplingdistribution of fx(ω) using a procedure similar to the one used for p = 1 in

Page 226: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

4.6 Parametric Spectral Estimation 213

Example 3.35. An alternative for higher order autoregressive series is to putthe AR(p) in state-space form and use the bootstrap procedure discussed in§6.7.

An interesting fact about rational spectra of the form (4.15) is that anyspectral density can be approximated, arbitrarily close, by the spectrum of anAR process.

Property 4.5 AR Spectral ApproximationLet g(ω) be the spectral density of a stationary process. Then, given ε > 0,

there is a time series with the representation

xt =

p∑k=1

φkxt−k + wt

where wt is white noise with variance σ2w, such that

|fx(ω)− g(ω)| < ε forall ω ∈ [−1/2, 1/2].

Moreover, p is finite and the roots of φ(z) = 1 −∑pk=1 φkz

k are outside theunit circle.

One drawback of the property is that it does not tell us how large p must bebefore the approximation is reasonable; in some situations p may be extremelylarge. Property 4.5 also holds for MA and for ARMA processes in general, anda proof of the result may be found in Fuller (1996, Ch 4). We demonstratethe technique in the following example.

Example 4.15 Autoregressive Spectral Estimator for SOI

Consider obtaining results comparable to the nonparametric estimatorsshown in Figure 4.5 for the SOI series. Fitting successively higher orderAR(p) models for p = 1, 2, . . . , 30 yields a minimum BIC at p = 15 and aminimum AIC at p = 16, as shown in Figure 4.11. We can see from Fig-ure 4.11 that BIC is very definite about which model it chooses; that is, theminimum BIC is very distinct. On the other hand, it is not clear what isgoing to happen with AIC; that is, the minimum is not so clear, and thereis some concern that AIC will start decreasing after p = 30. Minimum AICcselects the p = 15 model, but suffers from the same uncertainty as AIC.The spectra of the two cases are almost identical, as shown in Figure 4.12,and we note the strong peaks at 52 months and 12 months corresponding tothe nonparametric estimators obtained in §4.5. In addition, the harmonicsof the yearly period are evident in the estimated spectrum.

To perform a similar analysis in R, the command spec.ar can be used tofit the best model via AIC and plot the resulting spectrum. A quick way toobtain the AIC values is to run the ar command as follows.

1 spaic = spec.ar(soi, log="no", ylim=c(0,.3)) # min AIC spec

2 text(frequency(soi)*1/52,.07,substitute(omega==1/52)) # El Nino

Cycle

Page 227: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

214 4 Spectral Analysis and Filtering

Information Criteria

p

AIC

/ BIC

0 5 10 15 20 25 30

−1.5

5−1

.50

−1.4

5−1

.40

−1.3

5−1

.30

AIC

BIC

Fig. 4.11. Model selection criteria AIC and BIC as a function of order p for au-toregressive models fitted to the SOI series.

3 text(frequency(soi)*1/12,.29,substitute(omega==1/12)) # Yearly Cycle

4 sp16 = spec.ar(soi,order=16, log="no", plot=F)

5 lines(sp16$freq, sp16$spec, lty="dashed") # ar16 spec

6 (soi.ar = ar(soi, order.max=30)) # estimates and AICs

7 dev.new()

8 plot(1:30, soi.ar$aic[-1], type="o") # plot AICs

R works only with the AIC in this case. To generate Figure 4.11 we usedthe following code to obtain AIC, AICc, and BIC. Because AIC and AICcare nearly identical in this example, we only graphed AIC and BIC+1; weadded 1 to the BIC to reduce white space in the graphic.

1 n = length(soi)

2 AIC = rep(0, 30) -> AICc -> BIC

3 for (k in 1:30){

4 fit = ar(soi, order=k, aic=FALSE)

5 sigma2 = var(fit$resid, na.rm=TRUE)

6 BIC[k] = log(sigma2) + (k*log(n)/n)

7 AICc[k] = log(sigma2) + ((n+k)/(n-k-2))

8 AIC[k] = log(sigma2) + ((n+2*k)/n) }

9 IC = cbind(AIC, BIC+1)

10 ts.plot(IC, type="o", xlab="p", ylab="AIC / BIC")

11 text(15, -1.5, "AIC"); text(15, -1.38, "BIC")

Finally, it should be mentioned that any parametric spectrum, say f(ω; θθθ),depending on the vector parameter θθθ can be estimated via the Whittle likeli-hood (Whittle, 1961), using the approximate properties of the discrete Fourier

Page 228: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

4.6 Parametric Spectral Estimation 215

0.0 0.1 0.2 0.3 0.4 0.5

0.00

0.05

0.10

0.15

0.20

0.25

0.30

frequency

spec

trum

Series: soiAR (15) spectrum

1 52

1 12

Fig. 4.12. Autoregressive spectral estimators for the SOI series using models se-lected by AIC (p = 16, solid line) and by BIC and AICc (p = 15, dashed line). Thefirst peak corresponds to the El Nino period of 52 months.

transform derived in Appendix C. We have that the DFTs, d(ωj), are approx-imately complex normally distributed with mean zero and variance f(ωj ; θθθ)and are approximately independent for ωj 6= ωk. This implies that an approx-imate log likelihood can be written in the form

lnL(xxx; θθθ) ≈ −∑

0<ωj<1/2

(ln fx(ωj ; θθθ) +

|d(ωj)|2

fx(ωj ; θθθ)

), (4.78)

where the sum is sometimes expanded to include the frequencies ωj = 0, 1/2.If the form with the two additional frequencies is used, the multiplier of thesum will be unity, except for the purely real points at ωj = 0, 1/2 for whichthe multiplier is 1/2. For a discussion of applying the Whittle approximationto the problem of estimating parameters in an ARMA spectrum, see Ander-son (1978). The Whittle likelihood is especially useful for fitting long memorymodels that will be discussed in Chapter 5.

Page 229: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

216 4 Spectral Analysis and Filtering

4.7 Multiple Series and Cross-Spectra

The notion of analyzing frequency fluctuations using classical statistical ideasextends to the case in which there are several jointly stationary series, forexample, xt and yt. In this case, we can introduce the idea of a correlationindexed by frequency, called the coherence. The results in Appendix C, §C.2,imply the covariance function

γxy(h) = E[(xt+h − µx)(yt − µy)]

has the representation

γxy(h) =

∫ 1/2

−1/2fxy(ω)e2πiωh dω h = 0,±1,±2, ..., (4.79)

where the cross-spectrum is defined as the Fourier transform

fxy(ω) =∞∑

h=−∞

γxy(h) e−2πiωh − 1/2 ≤ ω ≤ 1/2, (4.80)

assuming that the cross-covariance function is absolutely summable, as wasthe case for the autocovariance. The cross-spectrum is generally a complex-valued function, and it is often written as14

fxy(ω) = cxy(ω)− iqxy(ω), (4.81)

where

cxy(ω) =∞∑

h=−∞

γxy(h) cos(2πωh) (4.82)

and

qxy(ω) =∞∑

h=−∞

γxy(h) sin(2πωh) (4.83)

are defined as the cospectrum and quadspectrum, respectively. Because ofthe relationship γyx(h) = γxy(−h), it follows, by substituting into (4.80) andrearranging, that

fyx(ω) = fxy(ω). (4.84)

This result, in turn, implies that the cospectrum and quadspectrum satisfy

cyx(ω) = cxy(ω) (4.85)

andqyx(ω) = −qxy(ω). (4.86)

14 For this section, it will be useful to recall the facts e−iα = cos(α) − i sin(α) andif z = a+ ib, then z = a− ib.

Page 230: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

4.7 Multiple Series and Cross-Spectra 217

An important example of the application of the cross-spectrum is to theproblem of predicting an output series yt from some input series xt through alinear filter relation such as the three-point moving average considered below.A measure of the strength of such a relation is the squared coherence function,defined as

ρ2y·x(ω) =|fyx(ω)|2

fxx(ω)fyy(ω), (4.87)

where fxx(ω) and fyy(ω) are the individual spectra of the xt and yt series,respectively. Although we consider a more general form of this that applies tomultiple inputs later, it is instructive to display the single input case as (4.87)to emphasize the analogy with conventional squared correlation, which takesthe form

ρ2yx =σ2yx

σ2xσ

2y

,

for random variables with variances σ2x and σ2

y and covariance σyx = σxy. Thismotivates the interpretation of squared coherence and the squared correlationbetween two time series at frequency ω.

Example 4.16 Three-Point Moving Average

As a simple example, we compute the cross-spectrum between xt and thethree-point moving average yt = (xt−1+xt+xt+1)/3, where xt is a stationaryinput process with spectral density fxx(ω). First,

γxy(h) = cov(xt+h, yt) = 13 cov(xt+h, xt−1 + xt + xt+1)

=1

3

(γxx(h+ 1) + γxx(h) + γxx(h− 1)

)=

1

3

∫ 1/2

−1/2

(e2πiω + 1 + e−2πiω

)e2πiωhfxx(ω) dω

=1

3

∫ 1/2

−1/2[1 + 2 cos(2πω)]fxx(ω)e2πiωh dω,

where we have use (4.11). Using the uniqueness of the Fourier transform, weargue from the spectral representation (4.79) that

fxy(ω) = 13 [1 + 2 cos(2πω)] fxx(ω)

so that the cross-spectrum is real in this case. From Example 4.5, the spectraldensity of yt is

fyy(ω) = 19 [3 + 4 cos(2πω) + 2 cos(4πω)]fxx(ω)

= 19 [1 + 2 cos(2πω)]2 fxx(ω),

using the identity cos(2α) = 2 cos2(α) − 1 in the last step. Substitutinginto (4.87) yields the squared coherence between xt and yt as unity over all

Page 231: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

218 4 Spectral Analysis and Filtering

frequencies. This is a characteristic inherited by more general linear filters,as will be shown in Problem 4.23. However, if some noise is added to thethree-point moving average, the coherence is not unity; these kinds of modelswill be considered in detail later.

Property 4.6 Spectral Representation of a Vector StationaryProcess

If the elements of the p× p autocovariance function matrix

Γ (h) = E[(xxxt+h − µµµ)(xxxt − µµµ)′]

of a p-dimensional stationary time series, xxxt = (xt1, xt2, . . . , xtp)′, has ele-

ments satisfying∞∑

h=−∞

|γjk(h)| <∞ (4.88)

for all j, k = 1, . . . , p, then Γ (h) has the representation

Γ (h) =

∫ 1/2

−1/2e2πiωh f(ω) dω h = 0,±1,±2, ..., (4.89)

as the inverse transform of the spectral density matrix, f(ω) = {fjk(ω)},for j, k = 1, . . . , p, with elements equal to the cross-spectral components. Thematrix f(ω) has the representation

f(ω) =

∞∑h=−∞

Γ (h)e−2πiωh − 1/2 ≤ ω ≤ 1/2. (4.90)

Example 4.17 Spectral Matrix of a Bivariate Process

Consider a jointly stationary bivariate process (xt, yt). We arrange the au-tocovariances in the matrix

Γ (h) =

(γxx(h) γxy(h)γyx(h) γyy(h)

).

The spectral matrix would be given by

f(ω) =

(fxx(ω) fxy(ω)fyx(ω) fyy(ω)

),

where the Fourier transform (4.89) and (4.90) relate the autocovariance andspectral matrices.

The extension of spectral estimation to vector series is fairly obvious. Forthe vector series xxxt = (xt1, xt2, . . . , xtp)

′, we may use the vector of DFTs, sayddd(ωj) = (d1(ωj), d2(ωj), . . . , dp(ωj))

′, and estimate the spectral matrix by

Page 232: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

4.7 Multiple Series and Cross-Spectra 219

f(ω) = L−1m∑

k=−m

I(ωj + k/n) (4.91)

where nowI(ωj) = ddd(ωj)ddd

∗(ωj) (4.92)

is a p× p complex matrix.15

Again, the series may be tapered before the DFT is taken in (4.91) andwe can use weighted estimation,

f(ω) =

m∑k=−m

hk I(ωj + k/n) (4.93)

where {hk} are weights as defined in (4.56). The estimate of squared coherencebetween two series, yt and xt is

ρ2y·x(ω) =|fyx(ω)|2

fxx(ω)fyy(ω). (4.94)

If the spectral estimates in (4.94) are obtained using equal weights, we willwrite ρ2y·x(ω) for the estimate.

Under general conditions, if ρ2y·x(ω) > 0 then

|ρy·x(ω)| ∼ AN(|ρy·x(ω)|,

(1− ρ2y·x(ω)

)2/2Lh

)(4.95)

where Lh is defined in (4.57); the details of this result may be found in Brock-well and Davis (1991, Ch 11). We may use (4.95) to obtain approximateconfidence intervals for the squared coherency ρ2y·x(ω).

We can test the hypothesis that ρ2y·x(ω) = 0 if we use ρ2y·x(ω) for theestimate with L > 1,16 that is,

ρ2y·x(ω) =|fyx(ω)|2

fxx(ω)fyy(ω). (4.96)

In this case, under the null hypothesis, the statistic

F =ρ2y·x(ω)

(1− ρ2y·x(ω))(L− 1) (4.97)

has an approximate F -distribution with 2 and 2L − 2 degrees of freedom.When the series have been extended to length n′, we replace 2L−2 by df −2,

15 If Z is a complex matrix, then Z∗ = Z′denotes the conjugate transpose operation.

That is, Z∗ is the result of replacing each element of Z by its complex conjugateand transposing the resulting matrix.

16 If L = 1 then ρ2y·x(ω) ≡ 1.

Page 233: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

220 4 Spectral Analysis and Filtering

0 1 2 3 4 5 6

0.0

0.2

0.4

0.6

0.8

1.0

frequency

squa

red

cohe

renc

ySOI and Recruitment

Fig. 4.13. Squared coherency between the SOI and Recruitment series; L = 19, n =453, n′ = 480, and α = .001. The horizontal line is C.001.

where df is defined in (4.52). Solving (4.97) for a particular significance levelα leads to

Cα =F2,2L−2(α)

L− 1 + F2,2L−2(α)(4.98)

as the approximate value that must be exceeded for the original squared co-herence to be able to reject ρ2y·x(ω) = 0 at an a priori specified frequency.

Example 4.18 Coherence Between SOI and Recruitment

Figure 4.13 shows the squared coherence between the SOI and Recruitmentseries over a wider band than was used for the spectrum. In this case, weused L = 19, df = 2(19)(453/480) ≈ 36 and F2,df−2(.001) ≈ 8.53 at thesignificance level α = .001. Hence, we may reject the hypothesis of no co-herence for values of ρ2y·x(ω) that exceed C.001 = .32. We emphasize thatthis method is crude because, in addition to the fact that the F -statistic isapproximate, we are examining the squared coherence across all frequencieswith the Bonferroni inequality, (4.55), in mind. Figure 4.13 also exhibitsconfidence bands as part of the R plotting routine. We emphasize that thesebands are only valid for ω where ρ2y·x(ω) > 0.

In this case, the seasonal frequency and the El Nino frequencies rangingbetween about 3 and 7 year periods are strongly coherent. Other frequenciesare also strongly coherent, although the strong coherence is less impressivebecause the underlying power spectrum at these higher frequencies is fairly

Page 234: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

4.8 Linear Filters 221

small. Finally, we note that the coherence is persistent at the seasonal har-monic frequencies.

This example may be reproduced using the following R commands.1 sr=spec.pgram(cbind(soi,rec),kernel("daniell",9),taper=0,plot=FALSE)

2 sr$df # df = 35.8625

3 f = qf(.999, 2, sr$df-2) # = 8.529792

4 C = f/(18+f) # = 0.318878

5 plot(sr, plot.type = "coh", ci.lty = 2)

6 abline(h = C)

4.8 Linear Filters

Some of the examples of the previous sections have hinted at the possibility thedistribution of power or variance in a time series can be modified by makinga linear transformation. In this section, we explore that notion further bydefining a linear filter and showing how it can be used to extract signals froma time series. The linear filter modifies the spectral characteristics of a timeseries in a predictable way, and the systematic development of methods fortaking advantage of the special properties of linear filters is an important topicin time series analysis.

A linear filter uses a set of specified coefficients aj , for j = 0,±1,±2, . . .,to transform an input series, xt, producing an output series, yt, of the form

yt =∞∑

j=−∞ajxt−j ,

∞∑j=−∞

|aj | <∞. (4.99)

The form (4.99) is also called a convolution in some statistical contexts. Thecoefficients, collectively called the impulse response function, are required tosatisfy absolute summability so yt in (4.99) exists as a limit in mean squareand the infinite Fourier transform

Ayx(ω) =∞∑

j=−∞aj e−2πiωj , (4.100)

called the frequency response function, is well defined. We have already en-countered several linear filters, for example, the simple three-point movingaverage in Example 4.16, which can be put into the form of (4.99) by lettinga−1 = a0 = a1 = 1/3 and taking at = 0 for |j| ≥ 2.

The importance of the linear filter stems from its ability to enhance certainparts of the spectrum of the input series. To see this, assuming that xt isstationary with spectral density fxx(ω), the autocovariance function of thefiltered output yt in (4.99) can be derived as

Page 235: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

222 4 Spectral Analysis and Filtering

γyy(h) = cov(yt+h, yt)

= cov

(∑r

arxt+h−r,∑s

asxt−s

)=∑r

∑s

arγxx(h− r + s)as

=∑r

∑s

ar

[∫ 1/2

−1/2e2πiω(h−r+s)fxx(ω)dω

]as

=

∫ 1/2

−1/2

(∑r

are−2πiωr

)(∑s

ase2πiωs

)e2πiωhfxx(ω) dω

=

∫ 1/2

−1/2e2πiωh|Ayx(ω)|2fxx(ω) dω,

where we have first replaced γxx(·) by its representation (4.11) and then sub-stituted Ayx(ω) from (4.100). The computation is one we do repeatedly, ex-ploiting the uniqueness of the Fourier transform. Now, because the left-handside is the Fourier transform of the spectral density of the output, say, fyy(ω),we get the important filtering property as follows.

Property 4.7 Output Spectrum of a Filtered Stationary SeriesThe spectrum of the filtered output yt in (4.99) is related to the spectrum

of the input xt byfyy(ω) = |Ayx(ω)|2 fxx(ω), (4.101)

where the frequency response function Ayx(ω) is defined in (4.100).

The result (4.101) enables us to calculate the exact effect on the spectrumof any given filtering operation. This important property shows the spectrumof the input series is changed by filtering and the effect of the change canbe characterized as a frequency-by-frequency multiplication by the squaredmagnitude of the frequency response function. Again, an obvious analogyto a property of the variance in classical statistics holds, namely, if x is arandom variable with variance σ2

x, then y = ax will have variance σ2y = a2σ2

x,so the variance of the linearly transformed random variable is changed bymultiplication by a2 in much the same way as the linearly filtered spectrumis changed in (4.101).

Finally, we mention that Property 4.3, which was used to get the spectrumof an ARMA process, is just a special case of Property 4.7 where in (4.99),xt = wt is white noise, in which case fxx(ω) = σ2

w, and aj = ψj , in which case

Ayx(ω) = ψ(e−2πiω) = θ(e−2πiω)/φ(e−2πiω).

Page 236: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

4.8 Linear Filters 223

SOI

1950 1960 1970 1980

−1.

0−

0.5

0.0

0.5

1.0

SOI − First Difference

1950 1960 1970 1980

−1.

0−

0.5

0.0

0.5

SOI − Twelve Month Moving Average

1950 1960 1970 1980

−0.

40.

00.

20.

4

Fig. 4.14. SOI series (top) compared with the differenced SOI (middle) and acentered 12-month moving average (bottom).

Example 4.19 First Difference and Moving Average Filters

We illustrate the effect of filtering with two common examples, the firstdifference filter

yt = ∇xt = xt − xt−1and the symmetric moving average filter

yt = 124

(xt−6 + xt+6

)+ 1

12

5∑r=−5

xt−r,

which is a modified Daniell kernel with m = 6. The results of filtering theSOI series using the two filters are shown in the middle and bottom panelsof Figure 4.14. Notice that the effect of differencing is to roughen the se-ries because it tends to retain the higher or faster frequencies. The centered

Page 237: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

224 4 Spectral Analysis and Filtering

0 1 2 3 4 5 6

0.00

0.01

0.02

0.03

0.04

frequency

spec

trum

SOI − Twelve Month Moving Average

bandwidth = 0.063

Fig. 4.15. Spectral analysis of SOI after applying a 12-month moving average filter.The vertical line corresponds to the 52-month cycle.

moving average smoothes the series because it retains the lower frequenciesand tends to attenuate the higher frequencies. In general, differencing is anexample of a high-pass filter because it retains or passes the higher frequen-cies, whereas the moving average is a low-pass filter because it passes thelower or slower frequencies.

Notice that the slower periods are enhanced in the symmetric movingaverage and the seasonal or yearly frequencies are attenuated. The filteredseries makes about 9 cycles in the length of the data (about one cycle every52 months) and the moving average filter tends to enhance or extract thesignal that is associated with El Nino. Moreover, by the low-pass filteringof the data, we get a better sense of the El Nino effect and its irregularity.Figure 4.15 shows the results of a spectral analysis on the low-pass filteredSOI series. It is clear that all high frequency behavior has been removedand the El Nino cycle is accentuated; the dotted vertical line in the figurecorresponds to the 52 months cycle.

Now, having done the filtering, it is essential to determine the exact wayin which the filters change the input spectrum. We shall use (4.100) and(4.101) for this purpose. The first difference filter can be written in the form(4.99) by letting a0 = 1, a1 = −1, and ar = 0 otherwise. This implies that

Ayx(ω) = 1− e−2πiω,

and the squared frequency response becomes

|Ayx(ω)|2 = (1− e−2πiω)(1− e2πiω) = 2[1− cos(2πω)]. (4.102)

Page 238: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

4.8 Linear Filters 225

The top panel of Figure 4.16 shows that the first difference filter will atten-uate the lower frequencies and enhance the higher frequencies because themultiplier of the spectrum, |Ayx(ω)|2, is large for the higher frequencies andsmall for the lower frequencies. Generally, the slow rise of this kind of filterdoes not particularly recommend it as a procedure for retaining only thehigh frequencies.

For the centered 12-month moving average, we can take a−6 = a6 =1/24, ak = 1/12 for −5 ≤ k ≤ 5 and ak = 0 elsewhere. Substituting andrecognizing the cosine terms gives

Ayx(ω) = 112

[1 + cos(12πω) + 2

5∑k=1

cos(2πωk)]. (4.103)

Plotting the squared frequency response of this function as in Figure 4.16shows that we can expect this filter to cut most of the frequency contentabove .05 cycles per point. This corresponds to eliminating periods shorterthan T = 1/.05 = 20 points. In particular, this drives down the yearlycomponents with periods of T = 12 months and enhances the El Ninofrequency, which is somewhat lower. The filter is not completely efficientat attenuating high frequencies; some power contributions are left at higherfrequencies, as shown in the function |Ayx(ω)|2 and in the spectrum of themoving average shown in Figure 4.3.

The following R session shows how to filter the data, perform the spectralanalysis of this example, and plot the squared frequency response curve ofthe difference filter.

1 par(mfrow=c(3,1))

2 plot(soi) # plot data

3 plot(diff(soi)) # plot first difference

4 k = kernel("modified.daniell", 6) # filter weights

5 plot(soif <- kernapply(soi, k)) # plot 12 month filter

6 dev.new()

7 spectrum(soif, spans=9, log="no") # spectral analysis

8 abline(v=12/52, lty="dashed")

9 dev.new()

10 w = seq(0, .5, length=500) # frequency response

11 FR = abs(1-exp(2i*pi*w))^2

12 plot(w, FR, type="l")

The two filters discussed in the previous example were different in thatthe frequency response function of the first difference was complex-valued,whereas the frequency response of the moving average was purely real. Ashort derivation similar to that used to verify (4.101) shows, when xt and ytare related by the linear filter relation (4.99), the cross-spectrum satisfies

fyx(ω) = Ayx(ω)fxx(ω),

so the frequency response is of the form

Page 239: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

226 4 Spectral Analysis and Filtering

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50

1

2

3

4First Difference

pow

er

frequency

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50

0.2

0.4

0.6

0.8

112 Month Moving Avg.

pow

er

frequency

Fig. 4.16. Squared frequency response functions of the first difference and 12-monthmoving average filters.

Ayx(ω) =fyx(ω)

fxx(ω)(4.104)

=cyx(ω)

fxx(ω)− i qyx(ω)

fxx(ω), (4.105)

where we have used (4.81) to get the last form. Then, we may write (4.105)in polar coordinates as

Ayx(ω) = |Ayx(ω)| exp{−i φyx(ω)}, (4.106)

where the amplitude and phase of the filter are defined by

|Ayx(ω)| =

√c2yx(ω) + q2yx(ω)

fxx(ω)(4.107)

and

φyx(ω) = tan−1(−qyx(ω)

cyx(ω)

). (4.108)

A simple interpretation of the phase of a linear filter is that it exhibits timedelays as a function of frequency in the same way as the spectrum representsthe variance as a function of frequency. Additional insight can be gained byconsidering the simple delaying filter

Page 240: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

4.8 Linear Filters 227

yt = Axt−D,

where the series gets replaced by a version, amplified by multiplying by A anddelayed by D points. For this case,

fyx(ω) = Ae−2πiωDfxx(ω),

and the amplitude is |A|, and the phase is

φyx(ω) = −2πωD,

or just a linear function of frequency ω. For this case, applying a simpletime delay causes phase delays that depend on the frequency of the periodiccomponent being delayed. Interpretation is further enhanced by setting

xt = cos(2πωt),

in which caseyt = A cos(2πωt− 2πωD).

Thus, the output series, yt, has the same period as the input series, xt, butthe amplitude of the output has increased by a factor of |A| and the phasehas been changed by a factor of −2πωD.

Example 4.20 Difference and Moving Average Filters

We consider calculating the amplitude and phase of the two filters discussedin Example 4.19. The case for the moving average is easy because Ayx(ω)given in (4.103) is purely real. So, the amplitude is just |Ayx(ω)| and thephase is φyx(ω) = 0. In general, symmetric (aj = a−j) filters have zerophase. The first difference, however, changes this, as we might expect fromthe example above involving the time delay filter. In this case, the squaredamplitude is given in (4.102). To compute the phase, we write

Ayx(ω) = 1− e−2πiω = e−iπω(eiπω − e−iπω)

= 2ie−iπω sin(πω) = 2 sin2(πω) + 2i cos(πω) sin(πω)

=cyx(ω)

fxx(ω)− i qyx(ω)

fxx(ω),

so

φyx(ω) = tan−1(−qyx(ω)

cyx(ω)

)= tan−1

(cos(πω)

sin(πω)

).

Noting thatcos(πω) = sin(−πω + π/2)

and thatsin(πω) = cos(−πω + π/2),

we getφyx(ω) = −πω + π/2,

and the phase is again a linear function of frequency.

Page 241: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

228 4 Spectral Analysis and Filtering

The above tendency of the frequencies to arrive at different times in thefiltered version of the series remains as one of two annoying features of thedifference type filters. The other weakness is the gentle increase in the fre-quency response function. If low frequencies are really unimportant and highfrequencies are to be preserved, we would like to have a somewhat sharperresponse than is obvious in Figure 4.16. Similarly, if low frequencies are impor-tant and high frequencies are not, the moving average filters are also not veryefficient at passing the low frequencies and attenuating the high frequencies.Improvement is possible by using longer filters, obtained by approximationsto the infinite inverse Fourier transform. The design of filters will be discussedin §4.10 and §4.11.

We will occasionally use results for multivariate series xxxt = (xt1, . . . , xtp)′

that are comparable to the simple property shown in (4.101). Consider thematrix filter

yyyt =∞∑

j=−∞Ajxxxt−j , (4.109)

where {Aj} denotes a sequence of q×p matrices such that∑∞j=−∞ ‖Aj‖ <∞

and ‖ · ‖ denotes any matrix norm, xxxt = (xt1, . . . , xtp)′ is a p × 1 stationary

vector process with mean vector µµµx and p × p, matrix covariance functionΓxx(h) and spectral matrix fxx(ω), and yyyt is the q× 1 vector output process.Then, we can obtain the following property.

Property 4.8 Output Spectral Matrix of a Linearly FilteredStationary Vector Series

The spectral matrix of the filtered output yyyt in (4.109) is related to thespectrum of the input xxxt by

fyy(ω) = A(ω)fxx(ω)A∗(ω), (4.110)

where the matrix frequency response function A(ω) is defined by

A(ω) =

∞∑j=−∞

Aj exp(−2πiωj). (4.111)

4.9 Dynamic Fourier Analysis and Wavelets

If a time series, xt, is stationary, its second-order behavior remains the same,regardless of the time t. It makes sense to match a stationary time series withsines and cosines because they, too, behave the same forever. Indeed, basedon the Spectral Representation Theorem (Appendix C, §C.1), we may regarda stationary series as the superposition of sines and cosines that oscillate atvarious frequencies. As seen in this text, however, many time series are notstationary. Typically, the data are coerced into stationarity via transforma-tions, or we restrict attention to parts of the data where stationarity appears

Page 242: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

4.9 Dynamic Fourier Analysis and Wavelets 229

to adhere. In some cases, the nonstationarity of a time series is of interest.That is to say, it is the local behavior of the process, and not the globalbehavior of the process, that is of concern to the investigator. As a case inpoint, we mention the explosion and earthquake series first presented in Ex-ample 1.7 (see Figure 1.7). The following example emphasizes the importanceof dynamic (or time-frequency) Fourier analysis.

Example 4.21 Dynamic Spectral Analysis of Seismic Traces

Consider the earthquake and explosion series displayed in Figure 1.7; itshould be apparent that the dynamics of the series are changing with time.The goal of this analysis is to summarize the spectral behavior of the signalas it evolves over time.

First, a spectral analysis is performed on a short section of the data. Then,the section is shifted, and a spectral analysis is performed on the new section.This process is repeated until the end of the data, and the results are shownan image in Figures 4.17 and 4.18; in the images, darker areas correspondto higher power. Specifically, in this example, let xt, for t = 1, . . . , 2048,represent the series of interest. Then, the sections of the data that were an-alyzed were {xtk+1, . . . , xtk+256}, for tk = 128k, and k = 0, 1, . . . , 14; e.g.,the first section analyzed is {x1, . . . , x256}, the second section analyzed is{x129, . . . , x384}, and so on. Each section of 256 observations was taperedusing a cosine bell, and spectral estimation was performed using a repeatedDaniell kernel with weights 1

9{1, 2, 3, 2, 1}; see page 204. The sections over-lap each other, however, this practice is not necessary and sometimes notdesirable.17

The results of the dynamic analysis are shown as the estimated spectrafor frequencies up to 10 Hz (the folding frequency is 20 Hz) for each startinglocation (time), tk = 128k, with k = 0, 1, . . . , 14. The S component for theearthquake shows power at the low frequencies only, and the power remainsstrong for a long time. In contrast, the explosion shows power at higherfrequencies than the earthquake, and the power of the signals (P and Swaves) does not last as long as in the case of the earthquake.

The following is an R session that corresponds to the analysis of the ex-plosion series. The images are generated using filled.contour() on thelog of the power; this, as well as using a gray scale and limiting the numberof levels was done to produce a decent black-and-white graphic. The im-ages look better in color, so we advise removing the nlevels=... and thecol=gray(...) parts of the code. We also include the code for obtaining a

17 A number of technical problems exist in this setting because the process of interestis nonstationary and we have not specified the nature of the nonstationarity. Inaddition, overlapping intervals complicate matters by introducing another layerof dependencies among the spectra. Consequently, the spectral estimates of con-tiguous sections are dependent in a non-trivial way that we have not specified.Nevertheless, as seen from this example, dynamic spectral analysis can be a help-ful tool in summarizing the local behavior of a time series.

Page 243: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

230 4 Spectral Analysis and Filtering

−12

−10

−8

−6

−4

−2

0

0 2 4 6 8 10

0

500

1000

1500

Earthquake

frequency (Hz)

time

Fig. 4.17. Time-frequency image for the dynamic Fourier analysis of the earthquakeseries shown in Figure 1.7.

three-dimensional graphic to display the information, however, the graphicis not exhibited in the text.

1 nobs = length(EXP6) # number of observations

2 wsize = 256 # window size

3 overlap = 128 # overlap

4 ovr = wsize-overlap

5 nseg = floor(nobs/ovr)-1; # number of segments

6 krnl = kernel("daniell", c(1,1)) # kernel

7 ex.spec = matrix(0, wsize/2, nseg)

8 for (k in 1:nseg) {

9 a = ovr*(k-1)+1

10 b = wsize+ovr*(k-1)

11 ex.spec[,k] = spectrum(EXP6[a:b], krnl, taper=.5, plot=F)$spec }

12 x = seq(0, 10, len = nrow(ex.spec)/2)

13 y = seq(0, ovr*nseg, len = ncol(ex.spec))

14 z = ex.spec[1:(nrow(ex.spec)/2),]

15 filled.contour(x , y, log(z), ylab="time",xlab="frequency (Hz)",

nlevels=12, col=gray(11:0/11), main="Explosion")

16 persp(x, y, z, zlab="Power",xlab="frequency (Hz)",ylab="time",

ticktype="detailed", theta=25,d=2, main="Explosion") # not shown

One way to view the time-frequency analysis of Example 4.21 is to considerit as being based on local transforms of the data xt of the form

Page 244: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

4.9 Dynamic Fourier Analysis and Wavelets 231

−12

−10

−8

−6

−4

−2

0

0 2 4 6 8 10

0

500

1000

1500

Explosion

frequency (Hz)

time

Fig. 4.18. Time-frequency image for the dynamic Fourier analysis of the explosionseries shown in Figure 1.7.

dj,k = n−1/2n∑t=1

xtψj,k(t), (4.112)

where

ψj,k(t) =

{(n/m)1/2ht e

−2πitj/m t ∈ [tk + 1, tk +m],

0 otherwise,(4.113)

where ht is a taper and m is some fraction of n. In Example 4.21, n = 2048,m = 256, tk = 128k, for k = 0, 1, . . . , 14, and ht was a cosine bell taperover 256 points. In (4.112) and (4.113), j indexes frequency, ωj = j/m, forj = 1, 2, . . . , [m/2], and k indexes the location, or time shift, of the transform.In this case, the transforms are based on tapered cosines and sines that havebeen zeroed out over various regions in time. The key point here is that thetransforms are based on local sinusoids. Figure 4.19 shows an example of fourlocal, tapered cosine functions at various frequencies. In that figure, the lengthof the data is considered to be one, and the cosines are localized to a fourthof the data length.

In addition to dynamic Fourier analysis as a method to overcome therestriction of stationarity, researchers have sought various alternative meth-ods. A recent, and successful, alternative is wavelet analysis. The website

Page 245: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

232 4 Spectral Analysis and Filtering

−0.5

0.0

0.5

1.0

2 cy

cles

−1.0

−0.5

0.0

0.5

5 cy

cles

−1.0

−0.5

0.0

0.5

1.0

6 cy

cles

−1.0

−0.5

0.0

0.5

0.0 0.2 0.4 0.6 0.8 1.0

3 cy

cles

Time

Local Tapered Cosines

Fig. 4.19. Local, tapered cosines at various frequencies.

http://www.wavelet.org is devoted to wavelets, which includes informationabout books, technical papers, software, and links to other sites. In addi-tion, we mention the monograph on wavelets by Daubechies (1992), the textby Percival and Walden (2000), and we note that many statistical softwaremanufacturers have wavelet modules that sit on top of their base package.In this section, we rely primarily on the S-PLUS wavelets module (with amanual written by Bruce and Gao, 1996), however, we will present some Rcode where possible. The basic idea of wavelet analysis is to imitate dynamicFourier analysis, but with functions (wavelets) that may be better suited tocapture the local behavior of nonstationary time series.

Wavelets come in families generated by a father wavelet, φ, and a motherwavelet, ψ. The father wavelets are used to capture the smooth, low-frequencynature of the data, whereas the mother wavelets are used to capture thedetailed, and high-frequency nature of the data. The father wavelet integratesto one, and the mother wavelet integrates to zero∫

φ(t)dt = 1 and

∫ψ(t)dt = 0. (4.114)

For a simple example, consider the Haar function,

ψ(t) =

1, 0 ≤ t < 1/2,−1, 1/2 ≤ t < 1,

0, otherwise.(4.115)

Page 246: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

4.9 Dynamic Fourier Analysis and Wavelets 233

‘d4’ father, phi(0,0)

0.0 0.5 1.0 1.5 2.0 2.5 3.0

0.00.5

1.0

‘d4’ mother, psi(0,0)

-1.0 0.0 0.5 1.0 1.5 2.0

-1.0

-0.5

0.00.5

1.01.5

‘s8’ father, phi(0,0)

0 2 4 6

-0.2

0.00.2

0.40.6

0.81.0

1.2

‘s8’ mother, psi(0,0)

-2 0 2 4

-1.0

-0.5

0.00.5

1.01.5

Fig. 4.20. Father and mother daublet4 wavelets (top row); father and mothersymmlet8 wavelets (bottom row).

The father in this case is φ(t) = 1 for t ∈ [0, 1) and zero otherwise. TheHaar functions are useful for demonstrating properties of wavelets, but theydo not have good time-frequency localization properties. Figure 4.20 displaystwo of the more commonly used wavelets that are available with the S-PLUSwavelets module, the daublet4 and symmlet8 wavelets, which are described indetail in Daubechies (1992). The number after the name refers to the widthand smoothness of the wavelet; for example, the symmlet10 wavelet is widerand smoother than the symmlet8 wavelet. Daublets are one of the first typeof continuous orthogonal wavelets with compact support, and symmlets wereconstructed to be closer to symmetry than daublets. In general, wavelets donot have an analytical form, but instead they are generated using numericalmethods.

Figure 4.20 was generated in S-PLUS using the wavelet module as fol-lows:18

1 d4f <- wavelet("d4", mother=F)

18 At this time, the R packages available for wavelet analysis are not extensiveenough for our purposes, hence we will rely on S-PLUS for some of the demon-strations. We will provide R code when possible, and that will be based on thewavethresh package (version 4.2-1) that accompanies Nason (2008).

Page 247: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

234 4 Spectral Analysis and Filtering

‘d4’ mother, psi(1,0)

-5 0 5 10 15 20

-0.5

0.00.5

‘d4’ mother, psi(2,1)

-5 0 5 10 15 20

-0.5

0.00.5

‘s8’ mother, psi(1,0)

-5 0 5 10 15 20

-0.5

0.00.5

‘s8’ mother, psi(2,1)

-5 0 5 10 15 20

-0.5

0.00.5

Fig. 4.21. Scaled and translated daublet4 wavelets, ψ1,0(t) and ψ2,1(t) (top row);scaled and translated symmlet8 wavelets, ψ1,0(t) and ψ2,1(t) (bottom row).

2 d4m <- wavelet("d4")

3 s8f <- wavelet("s8", mother=F)

4 s8m <- wavelet("s8")

5 par(mfrow=c(2,2))

6 plot(d4f); plot(d4m)

7 plot(s8f); plot(s8m)

It is possible to draw some wavelets in R using the wavethresh pack-age. In that package, daublets are called DaubExPhase and symmlets arecalled DaubLeAsymm. The following R session displays some of the availablewavelets (this will produce a figure similar to Figure 4.20) and it assumesthe wavethresh package has been downloaded and installed (see AppendixR, §R.2, for details on installing packages). The filter.number determinesthe width and smoothness of the wavelet.

1 library(wavethresh)

2 par(mfrow=c(2,2))

3 draw(filter.number=4, family="DaubExPhase", enhance=FALSE, main="")

4 draw(filter.number=8, family="DaubExPhase", enhance=FALSE, main="")

5 draw(filter.number=4, family="DaubLeAsymm", enhance=FALSE, main="")

6 draw(filter.number=8, family="DaubLeAsymm", enhance=FALSE, main="")

Page 248: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

4.9 Dynamic Fourier Analysis and Wavelets 235

When we depart from periodic functions, such as sines and cosines, theprecise meaning of frequency, or cycles per unit time, is lost. When usingwavelets, we typically refer to scale rather than frequency. The orthogonalwavelet decomposition of a time series, xt, for t = 1, . . . , n is

xt =∑k

sJ,kφJ,k(t) +∑k

dJ,kψJ,k(t)

+∑k

dJ−1,kψJ−1,k(t) + · · ·+∑k

d1,kψ1,k(t),(4.116)

where J is the number of scales, and k ranges from one to the number of coeffi-cients associated with the specified component (see Example 4.22). In (4.116),the wavelet functions φJ,k(t), ψJ,k(t), ψJ−1,k(t), . . . , ψ1,k(t) are generated fromthe father wavelet, φ(t), and the mother wavelet, ψ(t), by translation (shift)and scaling:

φJ,k(t) = 2−J/2φ

(t− 2Jk

2J

), (4.117)

ψj,k(t) = 2−j/2ψ

(t− 2jk

2j

), j = 1, . . . , J. (4.118)

The choice of dyadic shifts and scales is arbitrary but convenient. The shift ortranslation parameter is 2jk, and scale parameter is 2j . The wavelet functionsare spread out and shorter for larger values of j (or scale parameter 2j) andtall and narrow for small values of the scale. Figure 4.21 shows ψ1,0(t) andψ2,1(t) generated from the daublet4 (top row), and the symmlet8 (bottomrow) mother wavelets. We may think of 1/2j (or 1/scale) in wavelet analysis asbeing the analogue of frequency (ωj = j/n) in Fourier analysis. For example,when j = 1, the scale parameter of 2 is akin to the Nyquist frequency of1/2, and when j = 6, the scale parameter of 26 is akin to a low frequency(1/26 ≈ 0.016). In other words, larger values of the scale refer to slower,smoother (or coarser) movements of the signal, and smaller values of the scalerefer to faster, choppier (or finer) movements of the signal. Figure 4.21 wasgenerated in S-PLUS using the wavelet module as follows:

1 d4.1 <- wavelet("d4", level=1, shift=0)

2 d4.2 <- wavelet("d4", level=2, shift=1)

3 s8.1 <- wavelet("s8", level=1, shift=0)

4 s8.2 <- wavelet("s8", level=2, shift=1)

5 par(mfrow=c(2,2))

6 plot(d4.1, ylim=c(-.8,.8), xlim=c(-6,20))

7 plot(d4.2, ylim=c(-.8,.8), xlim=c(-6,20))

8 plot(s8.1, ylim=c(-.8,.8), xlim=c(-6,20))

9 plot(s8.2, ylim=c(-.8,.8), xlim=c(-6,20))

The discrete wavelet transform (DWT) of the data xt are the coefficientssJ,k and dj,k for j = J, J − 1, . . . , 1, in (4.116). To some degree of approxima-

Page 249: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

236 4 Spectral Analysis and Filtering

0 500 1000 1500 2000

s6

d6

d5

d4

d3

d2

d1

idwt

Earthquake

Fig. 4.22. Discrete wavelet transform of the earthquake series using the symmlet8wavelets, and J = 6 levels of scale.

tion, they are given by19

sJ,k = n−1/2n∑t=1

xtφJ,k(t), (4.119)

dj,k = n−1/2n∑t=1

xtψj,k(t) j = J, J − 1, . . . , 1. (4.120)

It is the magnitudes of the coefficients that measure the importance of thecorresponding wavelet term in describing the behavior of xt. As in Fourieranalysis, the DWT is not computed as shown but is calculated using a fastalgorithm. The sJ,k are called the smooth coefficients because they representthe smooth behavior of the data. The dj,k are called the detail coefficientsbecause they tend to represent the finer, more high-frequency nature, of thedata.

Example 4.22 Wavelet Analysis of Earthquake and Explosion

Figures 4.22 and 4.23 show the DWTs, based on the symmlet8 waveletbasis, for the earthquake and explosion series, respectively. Each series is of

19 The actual DWT coefficients are defined via a set of filters whose coefficients areclose to what you would get by sampling the father and mother wavelets, but notexactly so; see the discussion surrounding Figures 471 and 478 in Percival andWalden (2000).

Page 250: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

4.9 Dynamic Fourier Analysis and Wavelets 237

0 500 1000 1500 2000

s6

d6

d5

d4

d3

d2

d1

idwt

Explosion

Fig. 4.23. Discrete wavelet transform of the explosion series using the symmlet8wavelets and J = 6 levels of scale.

length n = 211 = 2048, and in this example, the DWTs are calculated usingJ = 6 levels. In this case, n/2 = 210 = 1024 values are in d1 = {d1,k; k =1, . . . , 210}, n/22 = 29 = 512 values are in d2 = {d2,k; k = 1, . . . , 29}, andso on, until finally, n/26 = 25 = 32 values are in d6 and in s6. The detailvalues d1,k, . . . , d6,k are plotted at the same scale, and hence, the relativeimportance of each value can be seen from the graph. The smooth values s6,kare typically larger than the detail values and plotted on a different scale.The top of Figures 4.22 and 4.23 show the inverse DWT (IDWT) computedfrom all of the coefficients. The displayed IDWT is a reconstruction of thedata, and it reproduces the data except for round-off error.

Comparing the DWTs, the earthquake is best represented by waveletswith larger scale than the explosion. One way to measure the importanceof each level, d1, d2, . . . , d6, s6, is to evaluate the proportion of the totalpower (or energy) explained by each. The total power of a time series xt, fort = 1, . . . , n, is TP =

∑nt=1 x

2t . The total power associated with each level

of scale is (recall n = 211),

TP s6 =

n/26∑k=1

s26,k and TP dj =

n/2j∑k=1

d2j,k, j = 1, . . . , 6.

Because we are working with an orthogonal basis, we have

Page 251: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

238 4 Spectral Analysis and Filtering

Table 4.2. Fraction of Total Power

Component Earthquake Explosion

s6 0.009 0.002d6 0.043 0.002d5 0.377 0.007d4 0.367 0.015d3 0.160 0.559d2 0.040 0.349d1 0.003 0.066

TP = TP s6 +6∑j=1

TP dj ,

and the proportion of the total power explained by each level of detail wouldbe the ratios TP dj /TP for j = 1, . . . , 6, and for the smooth level, it wouldbe TP s6 /TP . These values are listed in Table 4.2. From that table nearly80% of the total power of the earthquake series is explained by the higherscale details d4 and d5, whereas 90% of the total power is explained by thesmaller scale details d2 and d3 for the explosion.

Figures 4.24 and 4.25 show the time-scale plots (or scalograms) basedon the DWT of the earthquake series and the explosion series, respectively.These figures are the wavelet analog of the time-frequency plots shown inFigures 4.17 and 4.18. The power axis represents the magnitude of eachvalue djk or s6,k. The time axis matches the time axis in the DWTs shownin Figures 4.22 and 4.23, and the scale axis is plotted as 1/scale, listed fromthe coarsest scale to the finest scale. On the 1/scale axis, the coarsest scalevalues, represented by the smooth coefficients s6, are plotted over the range[0, 2−6), the coarsest detail values, d6, are plotted over [2−6, 2−5), and soon. In these figures, we did not plot the finest scale values, d1, so the finestscale values exhibited in Figures 4.24 and 4.25 are in d2, which are plottedover the range [2−2, 2−1).

The conclusions drawn from these plots are the same as those drawn fromFigures 4.17 and 4.18. That is, the S wave for the earthquake shows powerat the high scales (or low 1/scale) only, and the power remains strong fora long time. In contrast, the explosion shows power at smaller scales (orhigher 1/scale) than the earthquake, and the power of the signals (P and Swaves) do not last as long as in the case of the earthquake.

Assuming the data files EQ5 and EXP6 have been read into S-PLUS, theanalyses of this example can performed using the S-PLUS wavelets module(which must be loaded prior to the analyses) as follows:

1 eq <- scale(EQ5)

2 ex <- scale(EXP6)

3 eq.dwt <- dwt(eq)

Page 252: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

4.9 Dynamic Fourier Analysis and Wavelets 239

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

0.0 0.1 0.2 0.3 0.4 0.5

500

1000

1500

2000

Earthquake

1/scale

time

Fig. 4.24. Time-scale image (scalogram) of the earthquake series.

4 ex.dwt <- dwt(ex)

5 plot(eq.dwt)

6 plot(ex.dwt)

7 # energy distributions (Table 4.2)

8 dotchart(eq.dwt) # a graphic

9 summary(eq.dwt) # numerical details

10 dotchart(ex.dwt)

11 summary(ex.dwt)

12 # time scale plots

13 time.scale.plot(eq.dwt)

14 time.scale.plot(ex.dwt)

Similar analyses may be performed in R using the wavelets, wavethresh,or waveslim packages. We exhibit the analysis for the earthquake seriesusing wavesthresh, assuming it has been downloaded and installed.20

1 library(wavethresh)

2 eq = scale(EQ5) # standardize the series

3 ex = scale(EXP6)

4 eq.dwt = wd(eq, filter.number=8)

5 ex.dwt = wd(ex, filter.number=8)

20 In wavethresh, the transforms are denoted by the resolution rather than the scale.If the series is of length n = 2p, then resolution p − i corresponds to level i fori = 1, . . . , p.

Page 253: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

240 4 Spectral Analysis and Filtering

0.0

0.2

0.4

0.6

0.8

0.0 0.1 0.2 0.3 0.4 0.5

500

1000

1500

2000

Explosion

1/scale

time

Fig. 4.25. Time-scale image (scalogram) of the explosion series.

6 # plot the wavelet transforms

7 par(mfrow = c(1,2))

8 plot(eq.dwt, main="Earthquake")

9 plot(ex.dwt, main="Explosion")

10 # total power

11 TPe = rep(NA,11) # for the earthquake series

12 for (i in 0:10){TPe[i+1] = sum(accessD(eq.dwt, level=i)^2)}

13 TotEq = sum(TPe) # check with sum(eq^2)

14 TPx = rep(NA,11) # for the explosion series

15 for (i in 0:10){TPx[i+1] = sum(accessD(ex.dwt, level=i)^2)}

16 TotEx = sum(TPx) # check with sum(ex^2)

17 # make a nice table

18 Power = round(cbind(11:1, 100*TPe/TotEq, 100*TPx/TotEx), digits=3)

19 colnames(Power) = c("Level", "EQ(%)", "EXP(%)")

20 Power

Wavelets can be used to perform nonparametric smoothing along the linesfirst discussed in §2.4, but with an emphasis on localized behavior. Although aconsiderable amount of literature exists on this topic, we will present the basicideas. For further information, we refer the reader to Donoho and Johnstone(1994, 1995). As in §2.4, we suppose the data xt can be written in terms of asignal plus noise model as

xt = st + εt. (4.121)

Page 254: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

4.9 Dynamic Fourier Analysis and Wavelets 241

0 500 1000 1500 2000

Resid

Signal

Data

Earthquake

0 500 1000 1500 2000

Resid

Signal

Data

Explosion

Fig. 4.26. Waveshrink estimates of the earthquake and explosion signals.

The goal here is to remove the noise from the data, and obtain an estimate ofthe signal, st, without having to specify a parametric form of the signal. Thetechnique based on wavelets is referred to as waveshrink.

The basic idea behind waveshrink is to shrink the wavelet coefficients inthe DWT of xt toward zero in an attempt to denoise the data and then toestimate the signal via (4.116) with the new coefficients. One obvious way toshrink the coefficients toward zero is to simply zero out any coefficient smallerin magnitude than some predetermined value, λ. Such a shrinkage rule isdiscontinuous and sometimes it is preferable to use a continuous shrinkagefunction. One such method, termed soft shrinkage, proceeds as follows. If thevalue of a coefficient is a, we set that coefficient to zero if |a| ≤ λ, and tosign(a)(|a| − λ) if |a| > λ. The choice of a shrinkage method is based on

Page 255: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

242 4 Spectral Analysis and Filtering

the goal of the signal extraction. This process entails choosing a value for theshrinkage threshold, λ, and we may wish to use a different threshold value, say,λj , for each level of scale j = 1, . . . , J . One particular method that works wellif we are interested in a relatively high degree of smoothness in the estimateis to choose λ = σε

√2 log n for all scale levels, where σε is an estimate of

the scale of the noise, σε. Typically a robust estimate of σε is used, e.g., themedian of the absolute deviations of the data from the median (MAD). Forother thresholding techniques or for a better understanding of waveshrink, seeDonoho and Johnstone (1994, 1995), or the S-PLUS wavelets module manual(Bruce and Gao, 1996, Ch 6).

Example 4.23 Waveshrink Analysis of Earthquake and Explosion

Figure 4.26 shows the results of a waveshrink analysis on the earthquake andexplosion series. In this example, soft shrinkage was used with a universalthreshold of λ = σε

√2 log n where σε is the MAD. Figure 4.26 displays the

data xt, the estimated signal st, as well as the residuals xt− st. According tothis analysis, the earthquake is mostly signal and characterized by prolongedenergy, whereas the explosion is comprised of short bursts of energy.

Figure 4.26 was generated in S-PLUS using the wavelets module. Forexample, the analysis of the earthquake series was performed as follows.

1 eq.dwt <- dwt(eq)

2 eq.shrink <- waveshrink(eq.dwt, shrink.rule="universal",

shrink.fun="soft")

In R, using the wavethresh package, use the following commands for theearthquake series.

1 library(wavethresh)

2 eq = scale(EQ5)

3 par(mfrow=c(3,1))

4 eq.dwt = wd(eq, filter.number=8)

5 eq.smo = wr(threshold(eq.dwt, levels=5:10))

6 ts.plot(eq, main="Earthquake", ylab="Data")

7 ts.plot(eq.smo, ylab="Signal")

8 ts.plot(eq-eq.smo, ylab="Resid")

4.10 Lagged Regression Models

One of the intriguing possibilities offered by the coherence analysis of therelation between the SOI and Recruitment series discussed in Example 4.18would be extending classical regression to the analysis of lagged regressionmodels of the form

yt =∞∑

r=−∞βrxt−r + vt, (4.122)

where vt is a stationary noise process, xt is the observed input series, andyt is the observed output series. We are interested in estimating the filter

Page 256: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

4.10 Lagged Regression Models 243

coefficients βr relating the adjacent lagged values of xt to the output seriesyt.

In the case of SOI and Recruitment series, we might identify the El Ninodriving series, SOI, as the input, xt, and yt, the Recruitment series, as theoutput. In general, there will be more than a single possible input series and wemay envision a q×1 vector of driving series. This multivariate input situationis covered in Chapter 7. The model given by (4.122) is useful under severaldifferent scenarios, corresponding to different assumptions that can be madeabout the components.

We assume that the inputs and outputs have zero means and are jointlystationary with the 2 × 1 vector process (xt, yt)

′ having a spectral matrix ofthe form

f(ω) =

(fxx(ω) fxy(ω)fyx(ω) fyy(ω)

). (4.123)

Here, fxy(ω) is the cross-spectrum relating the input xt to the output yt, andfxx(ω) and fyy(ω) are the spectra of the input and output series, respectively.Generally, we observe two series, regarded as input and output and search forregression functions {βt} relating the inputs to the outputs. We assume allautocovariance functions satisfy the absolute summability conditions of theform (4.30).

Then, minimizing the mean squared error

MSE = E

(yt −

∞∑r=−∞

βrxt−r

)2

(4.124)

leads to the usual orthogonality conditions

E

[(yt −

∞∑r=−∞

βrxt−r

)xt−s

]= 0 (4.125)

for all s = 0,±1,±2, . . .. Taking the expectations inside leads to the normalequations

∞∑r=−∞

βr γxx(s− r) = γyx(s) (4.126)

for s = 0,±1,±2, . . .. These equations might be solved, with some effort, ifthe covariance functions were known exactly. If data (xt, yt) for t = 1, ..., nare available, we might use a finite approximation to the above equationswith γxx(h) and γyx(h) substituted into (4.126). If the regression vectors areessentially zero for |s| ≥M/2, and M < n, the system (4.126) would be of fullrank and the solution would involve inverting an (M − 1)× (M − 1) matrix.

A frequency domain approximate solution is easier in this case for two rea-sons. First, the computations depend on spectra and cross-spectra that canbe estimated from sample data using the techniques of §4.6. In addition, nomatrices will have to be inverted, although the frequency domain ratio will

Page 257: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

244 4 Spectral Analysis and Filtering

have to be computed for each frequency. In order to develop the frequency do-main solution, substitute the representation (4.89) into the normal equations,using the convention defined in (4.123). The left side of (4.126) can then bewritten in the form∫ 1/2

−1/2

∞∑r=−∞

βr e2πiω(s−r) fxx(ω) dω =

∫ 1/2

−1/2e2πiωsB(ω)fxx(ω) dω,

where

B(ω) =∞∑

r=−∞βr e−2πiωr (4.127)

is the Fourier transform of the regression coefficients βt. Now, because γyx(s)is the inverse transform of the cross-spectrum fyx(ω), we might write thesystem of equations in the frequency domain, using the uniqueness of theFourier transform, as

B(ω)fxx(ω) = fyx(ω), (4.128)

which then become the analogs of the usual normal equations. Then, we maytake

B(ωk) =fyx(ωk)

fxx(ωk)(4.129)

as the estimator for the Fourier transform of the regression coefficients, evalu-ated at some subset of fundamental frequencies ωk = k/M withM << n. Gen-erally, we assume smoothness of B(·) over intervals of the form {ωk+`/n; ` =

−(L−1)/2, . . . , (L−1)/2}. The inverse transform of the function B(ω) would

give βt, and we note that the discrete time approximation can be taken as

βt = M−1M−1∑k=0

B(ωk)e2πiωkt (4.130)

for t = 0,±1,±2, . . . ,±(M/2 − 1). If we were to use (4.130) to define βt for|t| ≥ M/2, we would end up with a sequence of coefficients that is periodic

with a period of M . In practice we define βt = 0 for |t| ≥ M/2 instead.Problem 4.32 explores the error resulting from this approximation.

Example 4.24 Lagged Regression for SOI and Recruitment

The high coherence between the SOI and Recruitment series noted in Ex-ample 4.18 suggests a lagged regression relation between the two series. Anatural direction for the implication in this situation is implied because wefeel that the sea surface temperature or SOI should be the input and theRecruitment series should be the output. With this in mind, let xt be theSOI series and yt the Recruitment series.

Although we think naturally of the SOI as the input and the Recruitmentas the output, two input-output configurations are of interest. With SOI asthe input, the model is

Page 258: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

4.10 Lagged Regression Models 245

−15 −10 −5 0 5 10 15

−15

−50

5coefficients beta(s)

s

beta(s)

Input: SOI

−15 −10 −5 0 5 10 15

−0.02

0.00

coefficients beta(s)

s

beta(s)

Input: Recruitment

Fig. 4.27. Estimated impulse response functions relating SOI to Recruitment (top)and Recruitment to SOI (bottom) L = 15,M = 32.

yt =∞∑

r=−∞arxt−r + wt

whereas a model that reverses the two roles would be

xt =∞∑

r=−∞bryt−r + vt,

where wt and vt are white noise processes. Even though there is no plausibleenvironmental explanation for the second of these two models, displayingboth possibilities helps to settle on a parsimonious transfer function model.

Based on the script LagReg (see Appendix R, §R.1), the estimated re-gression or impulse response function for SOI, with M = 32 and L = 15is

1 LagReg(soi, rec, L=15, M=32, threshold=6)

lag s beta(s)

[1,] 5 -18.479306

[2,] 6 -12.263296

[3,] 7 -8.539368

[4,] 8 -6.984553

The prediction equation is

rec(t) = alpha + sum_s[ beta(s)*soi(t-s) ], where alpha = 65.97

MSE = 414.08

Note the negative peak at a lag of five points in the top of Figure 4.27;in this case, SOI is the input series. The fall-off after lag five seems to be

Page 259: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

246 4 Spectral Analysis and Filtering

approximately exponential and a possible model is

yt = 66− 18.5xt−5 − 12.3xt−6 − 8.5xt−7 − 7xt−8 + wt.

If we examine the inverse relation, namely, a regression model with theRecruitment series yt as the input, the bottom of Figure 4.27 implies amuch simpler model,

2 LagReg(rec, soi, L=15, M=32, inverse=TRUE, threshold=.01)

lag s beta(s)

[1,] 4 0.01593167

[2,] 5 -0.02120013

The prediction equation is

soi(t) = alpha + sum_s[ beta(s)*rec(t+s) ], where alpha = 0.41

MSE = 0.07

depending on only two coefficients, namely,

xt = .41 + .016yt+4 − .02yt+5 + vt.

Multiplying both sides by 50B5 and rearranging, we have

(1− .8B)yt = 20.5− 50B5xt + εt,

where εt is white noise, as our final, parsimonious model.

The example shows we can get a clean estimator for the transfer functionsrelating the two series if the coherence ρ2xy(ω) is large. The reason is that wecan write the minimized mean squared error (4.124) as

MSE = E

[(yt −

∞∑r=−∞

βrxt−r)yt

]= γyy(0)−

∞∑r=−∞

βrγxy(−r),

using the result about the orthogonality of the data and error term in the Pro-jection theorem. Then, substituting the spectral representations of the autoco-variance and cross-covariance functions and identifying the Fourier transform(4.127) in the result leads to

MSE =

∫ 1/2

−1/2[fyy(ω)−B(ω)fxy(ω)] dω

=

∫ 1/2

−1/2fyy(ω)[1− ρ2yx(ω)]dω, (4.131)

where ρ2yx(ω) is just the squared coherence given by (4.87). The similarity of(4.131) to the usual mean square error that results from predicting y from xis obvious. In that case, we would have

E(y − βx)2 = σ2y(1− ρ2xy)

Page 260: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

4.11 Signal Extraction and Optimum Filtering 247

for jointly distributed random variables x and y with zero means, variancesσ2x and σ2

y, and covariance σxy = ρxyσxσy. Because the mean squared errorin (4.131) satisfies MSE ≥ 0 with fyy(ω) a non-negative function, it followsthat the coherence satisfies

0 ≤ ρ2xy(ω) ≤ 1

for all ω. Furthermore, Problem 4.33 shows the squared coherence is one whenthe output are linearly related by the filter relation (4.122), and there isno noise, i.e., vt = 0. Hence, the multiple coherence gives a measure of theassociation or correlation between the input and output series as a functionof frequency.

The matter of verifying that the F -distribution claimed for (4.97) will holdwhen the sample coherence values are substituted for theoretical values stillremains. Again, the form of the F -statistic is exactly analogous to the usualt-test for no correlation in a regression context. We give an argument lead-ing to this conclusion later using the results in Appendix C, §C.3. Anotherquestion that has not been resolved in this section is the extension to thecase of multiple inputs xt1, xt2, . . . , xtq. Often, more than just a single inputseries is present that can possibly form a lagged predictor of the output seriesyt. An example is the cardiovascular mortality series that depended on possi-bly a number of pollution series and temperature. We discuss this particularextension as a part of the multivariate time series techniques considered inChapter 7.

4.11 Signal Extraction and Optimum Filtering

A model closely related to regression can be developed by assuming again that

yt =

∞∑r=−∞

βrxt−r + vt, (4.132)

but where the βs are known and xt is some unknown random signal that isuncorrelated with the noise process vt. In this case, we observe only yt andare interested in an estimator for the signal xt of the form

xt =∞∑

r=−∞aryt−r. (4.133)

In the frequency domain, it is convenient to make the additional assumptionsthat the series xt and vt are both mean-zero stationary series with spectrafxx(ω) and fvv(ω), often referred to as the signal spectrum and noise spec-trum, respectively. Often, the special case βt = δt, in which δt is the Kroneckerdelta, is of interest because (4.132) reduces to the simple signal plus noisemodel

Page 261: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

248 4 Spectral Analysis and Filtering

yt = xt + vt (4.134)

in that case. In general, we seek the set of filter coefficients at that minimizethe mean squared error of estimation, say,

MSE = E

(xt − ∞∑r=−∞

aryt−r

)2 . (4.135)

This problem was originally solved by Kolmogorov (1941) and by Wiener(1949), who derived the result in 1941 and published it in classified reportsduring World War II.

We can apply the orthogonality principle to write

E

[(xt −

∞∑r=−∞

aryt−r

)yt−s

]= 0

for s = 0,±1,±2, . . ., which leads to

∞∑r=−∞

arγyy(s− r) = γxy(s),

to be solved for the filter coefficients. Substituting the spectral representationsfor the autocovariance functions into the above and identifying the spectraldensities through the uniqueness of the Fourier transform produces

A(ω)fyy(ω) = fxy(ω), (4.136)

where A(ω) and the optimal filter at are Fourier transform pairs for B(ω) andβt. Now, a special consequence of the model is that (see Problem 4.23)

fxy(ω) = B(ω)fxx(ω) (4.137)

andfyy(ω) = |B(ω)|2fxx(ω) + fvv(ω), (4.138)

implying the optimal filter would be Fourier transform of

A(ω) =B(ω)(

|B(ω)|2 + fvv(ω)fxx(ω)

) , (4.139)

where the second term in the denominator is just the inverse of the signal tonoise ratio, say,

SNR(ω) =fxx(ω)

fvv(ω). (4.140)

The result shows the optimum filters can be computed for this model ifthe signal and noise spectra are both known or if we can assume knowledge

Page 262: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

4.11 Signal Extraction and Optimum Filtering 249

of the signal-to-noise ratio SNR(ω) as function of frequency. In Chapter 7,we show some methods for estimating these two parameters in conjunctionwith random effects analysis of variance models, but we assume here that it ispossible to specify the signal-to-noise ratio a priori. If the signal-to-noise ratiois known, the optimal filter can be computed by the inverse transform of thefunction A(ω). It is more likely that the inverse transform will be intractableand a finite filter approximation like that used in the previous section can beapplied to the data. In this case, we will have

aMt = M−1M−1∑k=0

A(ωk)e2πiωkt (4.141)

as the estimated filter function. It will often be the case that the form of thespecified frequency response will have some rather sharp transitions betweenregions where the signal-to-noise ratio is high and regions where there is littlesignal. In these cases, the shape of the frequency response function will haveripples that can introduce frequencies at different amplitudes. An aestheticsolution to this problem is to introduce tapering as was done with spectralestimation in (4.61)-(4.68). We use below the tapered filter at = htat whereht is the cosine taper given in (4.68). The squared frequency response of theresulting filter will be |A(ω)|2, where

A(ω) =∞∑

t=−∞athte

−2πiωt. (4.142)

The results are illustrated in the following example that extracts the El Ninocomponent of the sea surface temperature series.

Example 4.25 Estimating the El Nino Signal via Optimal Filters

Figure 4.5 shows the spectrum of the SOI series, and we note that essentiallytwo components have power, the El Nino frequency of about .02 cycles permonth (the four-year cycle) and a yearly frequency of about .08 cycles permonth (the annual cycle). We assume, for this example, that we wish topreserve the lower frequency as signal and to eliminate the higher orderfrequencies, and in particular, the annual cycle. In this case, we assume thesimple signal plus noise model

yt = xt + vt,

so that there is no convolving function βt. Furthermore, the signal-to-noiseratio is assumed to be high to about .06 cycles per month and zero thereafter.The optimal frequency response was assumed to be unity to .05 cycles perpoint and then to decay linearly to zero in several steps. Figure 4.28 showsthe coefficients as specified by (4.141) with M = 64, as well as the frequencyresponse function given by (4.142), of the cosine tapered coefficients; recallFigure 4.9, where we demonstrated the need for tapering to avoid severe

Page 263: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

250 4 Spectral Analysis and Filtering

−30 −20 −10 0 10 20 30

−0.0

20.

040.

08

Filter coefficients

s

a(s)

0.0 0.1 0.2 0.3 0.4 0.5

0.0

0.4

0.8

Desired and attained frequency response functions

freq

freq.

resp

onse

Fig. 4.28. Filter coefficients (top) and frequency response functions (bottom) fordesigned SOI filters.

ripples in the window. The constructed response function is compared tothe ideal window in Figure 4.28.

Figure 4.29 shows the original and filtered SOI index, and we see a smoothextracted signal that conveys the essence of the underlying El Nino signal.The frequency response of the designed filter can be compared with thatof the symmetric 12-month moving average applied to the same series inExample 4.19. The filtered series, shown in Figure 4.14, shows a good dealof higher frequency chatter riding on the smoothed version, which has beenintroduced by the higher frequencies that leak through in the squared fre-quency response, as in Figure 4.16.

The analysis can be replicated using the script SigExtract; see AppendixR, §R.1, for details.

1 SigExtract(soi, L=9, M=64, max.freq=.05)

The design of finite filters with a specified frequency response requires someexperimentation with various target frequency response functions and we haveonly touched on the methodology here. The filter designed here, sometimescalled a low-pass filter reduces the high frequencies and keeps or passes thelow frequencies. Alternately, we could design a high-pass filter to keep high

Page 264: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

4.11 Signal Extraction and Optimum Filtering 251

Original series

Time

serie

s

0 100 200 300 400

−1.0

0.0

1.0

Filtered series

Time

serie

s.fil

t

0 100 200 300 400

−0.4

0.0

0.4

Fig. 4.29. Original SOI series (top) compared to filtered version showing the esti-mated El Nino temperature signal (bottom).

frequencies if that is where the signal is located. An example of a simplehigh-pass filter is the first difference with a frequency response that is shownin Figure 4.16. We can also design band-pass filters that keep frequencies inspecified bands. For example, seasonal adjustment filters are often used ineconomics to reject seasonal frequencies while keeping both high frequencies,lower frequencies, and trend (see, for example, Grether and Nerlove, 1970).

The filters we have discussed here are all symmetric two-sided filters, be-cause the designed frequency response functions were purely real. Alterna-tively, we may design recursive filters to produce a desired response. An ex-ample of a recursive filter is one that replaces the input xt by the filteredoutput

yt =

p∑k=1

φkyt−k + xt −q∑

k=1

θkxt−k. (4.143)

Note the similarity between (4.143) and the ARIMA(p, 1, q) model, in whichthe white noise component is replaced by the input. Transposing the termsinvolving yt and using the basic linear filter result in Property 4.7 leads to

fy(ω) =|θ(e−2πiω)|2

|φ(e−2πiω)|2fx(ω), (4.144)

where

Page 265: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

252 4 Spectral Analysis and Filtering

φ(e−2πiω) = 1−p∑k=1

φke−2πikω

and

θ(e−2πiω) = 1−q∑

k=1

θke−2πikω.

Recursive filters such as those given by (4.144) distort the phases of arrivingfrequencies, and we do not consider the problem of designing such filters inany detail.

4.12 Spectral Analysis of Multidimensional Series

Multidimensional series of the form xsss, where sss = (s1, s2, . . . , sr)′ is an r-

dimensional vector of spatial coordinates or a combination of space and timecoordinates, were introduced in §1.7. The example given there, shown in Fig-ure 1.15, was a collection of temperature measurements taking on a rectan-gular field. These data would form a two-dimensional process, indexed byrow and column in space. In that section, the multidimensional autocovari-ance function of an r-dimensional stationary series was given as γx(hhh) =E[xsss+hhhxsss], where the multidimensional lag vector is hhh = (h1, h2, . . . , hr)

′.The multidimensional wavenumber spectrum is given as the Fourier trans-

form of the autocovariance, namely,

fx(ωωω) =∑hhh

γx(hhh)e−2πiωωω′hhh. (4.145)

Again, the inverse result

γx(hhh) =

∫ 1/2

−1/2fx(ωωω)e2πiωωω

′hhhdωωω (4.146)

holds, where the integral is over the multidimensional range of the vector ωωω.The wavenumber argument is exactly analogous to the frequency argument,and we have the corresponding intuitive interpretation as the cycling rate ωiper distance traveled si in the i-th direction.

Two-dimensional processes occur often in practical applications, and therepresentations above reduce to

fx(ω1, ω2) =∞∑

h1=−∞

∞∑h2=−∞

γx(h1, h2)e−2πi(ω1h1+ω2h2) (4.147)

and

γx(h1, h2) =

∫ 1/2

−1/2

∫ 1/2

−1/2fx(ω1, ω2)e2πi(ω1h1+ω2h2)dω1 dω2 (4.148)

Page 266: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

4.12 Spectral Analysis of Multidimensional Series 253

in the case r = 2. The notion of linear filtering generalizes easily to the two-dimensional case by defining the impulse response function as1,s2 and thespatial filter output as

ys1,s2 =∑u1

∑u2

au1,u2xs1−u1,s2−u2

. (4.149)

The spectrum of the output of this filter can be derived as

fy(ω1, ω2) = |A(ω1, ω2)|2fx(ω1, ω2), (4.150)

whereA(ω1, ω2) =

∑u1

∑u2

au1,u2e−2πi(ω1u1+ω2u2). (4.151)

These results are analogous to those in the one-dimensional case, describedby Property 4.7.

The multidimensional DFT is also a straightforward generalization of theunivariate expression. In the two-dimensional case with data on a rectangulargrid, {xs1,s2 ; s1 = 1, ..., n1, s2 = 1, ..., n2}, we will write, for −1/2 ≤ ω1, ω2 ≤1/2,

d(ω1, ω2) = (n1n2)−1/2n1∑s1=1

n2∑s2=1

xs1,s2e−2πi(ω1s1+ω2s2) (4.152)

as the two-dimensional DFT, where the frequencies ω1, ω2 are evaluated atmultiples of (1/n1, 1/n2) on the spatial frequency scale. The two-dimensionalwavenumber spectrum can be estimated by the smoothed sample wavenumberspectrum

fx(ω1, ω2) = (L1L2)−1∑`1,`2

|d(ω1 + `1/n1, ω2 + `2/n2)|2 , (4.153)

where the sum is taken over the grid {−mj ≤ `j ≤ mj ; j = 1, 2}, whereL1 = 2m1 + 1 and L2 = 2m2 + 1. The statistic

2L1L2fx(ω1, ω2)

fx(ω1, ω2)

·∼ χ22L1L2

(4.154)

can be used to set confidence intervals or make approximate tests againsta fixed assumed spectrum f0(ω1, ω2). We may also extend this analysis toweighted estimation and window estimation as discussed in §4.5.

Example 4.26 Soil Surface Temperatures

As an example, consider the periodogram of the two-dimensional tempera-ture series shown in Figure 1.15 and analyzed by Bazza et al. (1988). Werecall the spatial coordinates in this case will be (s1, s2), which define thespatial coordinates rows and columns so that the frequencies in the two

Page 267: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

254 4 Spectral Analysis and Filtering

cycles/row

−0.4

−0.2

0.0

0.2

0.4

cycl

es/c

olum

n

−0.4

−0.2

0.0

0.2

0.4

Periodogram

Ordinate

0

20

40

60

80

Fig. 4.30. Two-dimensional periodogram of soil temperature profile showing peakat .0625 cycles/row. The period is 16 rows, and this corresponds to 16× 17 ft = 272ft.

directions will be expressed as cycles per row and cycles per column. Fig-ure 4.30 shows the periodogram of the two-dimensional temperature series,and we note the ridge of strong spectral peaks running over rows at a columnfrequency of zero. An obvious periodic component appears at frequencies of.0625 and −.0625 cycles per row, which corresponds to 16 rows or about 272ft. On further investigation of previous irrigation patterns over this field,treatment levels of salt varied periodically over columns. This analysis isextended in Problem 4.17, where we recover the salt treatment profile overrows and compare it to a signal, computed by averaging over columns.

Figure 4.30 may be reproduced in R as follows. In the code for this exam-ple, the periodogram is computed in one step as per; the rest of the code issimply manipulation to obtain a nice graphic.

1 per = abs(fft(soiltemp-mean(soiltemp))/sqrt(64*36))^2

2 per2 = cbind(per[1:32,18:2], per[1:32,1:18])

3 per3 = rbind(per2[32:2,],per2)

4 par(mar=c(1,2.5,0,0)+.1)

5 persp(-31:31/64, -17:17/36, per3, phi=30, theta=30, expand=.6,

ticktype="detailed", xlab="cycles/row", ylab="cycles/column",

zlab="Periodogram Ordinate")

Page 268: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

Problems 255

Another application of two-dimensional spectral analysis of agriculturalfield trials is given in McBratney and Webster (1981), who used it to de-tect ridge and furrow patterns in yields. The requirement for regular, equallyspaced samples on fairly large grids has tended to limit enthusiasm for stricttwo-dimensional spectral analysis. An exception is when a propagating signalfrom a given velocity and azimuth is present so predicting the wavenumberspectrum as a function of velocity and azimuth becomes feasible (see Shumwayet al., 1999).

Problems

Section 4.2

4.1 Repeat the simulations and analyses in Examples 4.1 and 4.2 with thefollowing changes:

(a) Change the sample size to n = 128 and generate and plot the same seriesas in Example 4.1:

xt1 = 2 cos(2π .06 t) + 3 sin(2π .06 t),

xt2 = 4 cos(2π .10 t) + 5 sin(2π .10 t),

xt3 = 6 cos(2π .40 t) + 7 sin(2π .40 t),

xt = xt1 + xt2 + xt3.

What is the major difference between these series and the series generatedin Example 4.1? (Hint: The answer is fundamental. But if your answer isthe series are longer, you may be punished severely.)

(b) As in Example 4.2, compute and plot the periodogram of the series, xt,generated in (a) and comment.

(c) Repeat the analyses of (a) and (b) but with n = 100 (as in Example 4.1),and adding noise to xt; that is

xt = xt1 + xt2 + xt3 + wt

where wt ∼ iid N(0, 25). That is, you should simulate and plot the data,and then plot the periodogram of xt and comment.

4.2 With reference to equations (4.1) and (4.2), let Z1 = U1 and Z2 = −U2

be independent, standard normal variables. Consider the polar coordinates ofthe point (Z1, Z2), that is,

A2 = Z21 + Z2

2 and φ = tan−1(Z2/Z1).

(a) Find the joint density of A2 and φ, and from the result, conclude thatA2 and φ are independent random variables, where A2 is a chi-squaredrandom variable with 2 df, and φ is uniformly distributed on (−π, π).

Page 269: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

256 4 Spectral Analysis and Filtering

(b) Going in reverse from polar coordinates to rectangular coordinates, sup-pose we assume that A2 and φ are independent random variables, whereA2 is chi-squared with 2 df, and φ is uniformly distributed on (−π, π).With Z1 = A cos(φ) and Z2 = A sin(φ), where A is the positive squareroot of A2, show that Z1 and Z2 are independent, standard normal randomvariables.

4.3 Verify (4.4).

Section 4.3

4.4 A time series was generated by first drawing the white noise series wtfrom a normal distribution with mean zero and variance one. The observedseries xt was generated from

xt = wt − θwt−1, t = 0,±1,±2, . . . ,

where θ is a parameter.

(a) Derive the theoretical mean value and autocovariance functions for theseries xt and wt. Are the series xt and wt stationary? Give your reasons.

(b) Give a formula for the power spectrum of xt, expressed in terms of θ andω.

4.5 A first-order autoregressive model is generated from the white noise serieswt using the generating equations

xt = φxt−1 + wt,

where φ, for |φ| < 1, is a parameter and the wt are independent randomvariables with mean zero and variance σ2

w.

(a) Show that the power spectrum of xt is given by

fx(ω) =σ2w

1 + φ2 − 2φ cos(2πω).

(b) Verify the autocovariance function of this process is

γx(h) =σ2w φ|h|

1− φ2,

h = 0,±1,±2, . . ., by showing that the inverse transform of γx(h) is thespectrum derived in part (a).

4.6 In applications, we will often observe series containing a signal that hasbeen delayed by some unknown time D, i.e.,

xt = st +Ast−D + nt,

where st and nt are stationary and independent with zero means and spectraldensities fs(ω) and fn(ω), respectively. The delayed signal is multiplied bysome unknown constant A.

Page 270: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

Problems 257

Time

1750 1800 1850 1900 1950

050

100

150

200

Fig. 4.31. Smoothed 12-month sunspot numbers (sunspotz) sampled twice peryear.

(a) Provefx(ω) = [1 +A2 + 2A cos(2πωD)]fs(ω) + fn(ω).

(b) How could the periodicity expected in the spectrum derived in (a) be usedto estimate the delay D? (Hint: Consider the case where fn(ω) = 0; i.e.,there is no noise.)

4.7 Suppose xt and yt are stationary zero-mean time series with xt indepen-dent of ys for all s and t. Consider the product series

zt = xtyt.

Prove the spectral density for zt can be written as

fz(ω) =

∫ 1/2

−1/2fx(ω − ν)fy(ν) dν.

Section 4.4

4.8 Figure 4.31 shows the biyearly smoothed (12-month moving average) num-ber of sunspots from June 1749 to December 1978 with n = 459 points thatwere taken twice per year; the data are contained in sunspotz. With Exam-ple 4.10 as a guide, perform a periodogram analysis identifying the predom-inant periods and obtaining confidence intervals for the identified periods.Interpret your findings.

4.9 The levels of salt concentration known to have occurred over rows, corre-sponding to the average temperature levels for the soil science data consideredin Figures 1.15 and 1.16, are in salt and saltemp. Plot the series and thenidentify the dominant frequencies by performing separate spectral analyses onthe two series. Include confidence intervals for the dominant frequencies andinterpret your findings.

Page 271: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

258 4 Spectral Analysis and Filtering

4.10 Let the observed series xt be composed of a periodic signal and noise soit can be written as

xt = β1 cos(2πωkt) + β2 sin(2πωkt) + wt,

where wt is a white noise process with variance σ2w. The frequency ωk is as-

sumed to be known and of the form k/n in this problem. Suppose we considerestimating β1, β2 and σ2

w by least squares, or equivalently, by maximum like-lihood if the wt are assumed to be Gaussian.

(a) Prove, for a fixed ωk, the minimum squared error is attained by(β1β2

)= 2n−1/2

(dc(ωk)ds(ωk)

),

where the cosine and sine transforms (4.23) and (4.24) appear on theright-hand side.

(b) Prove that the error sum of squares can be written as

SSE =n∑t=1

x2t − 2Ix(ωk)

so that the value of ωk that minimizes squared error is the same as thevalue that maximizes the periodogram Ix(ωk) estimator (4.20).

(c) Under the Gaussian assumption and fixed ωk, show that the F -test of noregression leads to an F -statistic that is a monotone function of Ix(ωk).

4.11 Prove the convolution property of the DFT, namely,

n∑s=1

asxt−s =n−1∑k=0

dA(ωk)dx(ωk) exp{2πωkt},

for t = 1, 2, . . . , n, where dA(ωk) and dx(ωk) are the discrete Fourier trans-forms of at and xt, respectively, and we assume that xt = xt+n is periodic.

Section 4.5

4.12 Repeat Problem 4.8 using a nonparametric spectral estimation proce-dure. In addition to discussing your findings in detail, comment on your choiceof a spectral estimate with regard to smoothing and tapering.

4.13 Repeat Problem 4.9 using a nonparametric spectral estimation proce-dure. In addition to discussing your findings in detail, comment on your choiceof a spectral estimate with regard to smoothing and tapering.

Page 272: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

Problems 259

4.14 The periodic behavior of a time series induced by echoes can also beobserved in the spectrum of the series; this fact can be seen from the resultsstated in Problem 4.6(a). Using the notation of that problem, suppose weobserve xt = st + Ast−D + nt, which implies the spectra satisfy fx(ω) =[1 + A2 + 2A cos(2πωD)]fs(ω) + fn(ω). If the noise is negligible (fn(ω) ≈ 0)then log fx(ω) is approximately the sum of a periodic component, log[1 +A2 +2A cos(2πωD)], and log fs(ω). Bogart et al. (1962) proposed treating thedetrended log spectrum as a pseudo time series and calculating its spectrum,or cepstrum, which should show a peak at a quefrency corresponding to 1/D.The cepstrum can be plotted as a function of quefrency, from which the delatyD can be estimated.

For the speech series presented in Example 1.3, estimate the pitch periodusing cepstral analysis as follows. The data are in speech.

(a) Calculate and display the log-periodogram of the data. Is the periodogramperiodic, as predicted?

(b) Perform a cepstral (spectral) analysis on the detrended logged peri-odogram, and use the results to estimate the delay D. How does youranswer compare with the analysis of Example 1.24, which was based onthe ACF?

4.15 Use Property 4.2 to verify (4.63). Then verify (4.66) and (4.67).

4.16 Consider two time series

xt = wt − wt−1,

yt = 12 (wt + wt−1),

formed from the white noise series wt with variance σ2w = 1.

(a) Are xt and yt jointly stationary? Recall the cross-covariance function mustalso be a function only of the lag h and cannot depend on time.

(b) Compute the spectra fy(ω) and fx(ω), and comment on the differencebetween the two results.

(c) Suppose sample spectral estimators fy(.10) are computed for the seriesusing L = 3. Find a and b such that

P

{a ≤ fy(.10) ≤ b

}= .90.

This expression gives two points that will contain 90% of the sample spec-tral values. Put 5% of the area in each tail.

Page 273: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

260 4 Spectral Analysis and Filtering

Section 4.6

4.17 Analyze the coherency between the temperature and salt data discussedin Problem 4.9. Discuss your findings.

4.18 Consider two processes

xt = wt and yt = φxt−D + vt

where wt and vt are independent white noise processes with common varianceσ2, φ is a constant, and D is a fixed integer delay.

(a) Compute the coherency between xt and yt.(b) Simulate n = 1024 normal observations from xt and yt for φ = .9, σ2 = 1,

and D = 0. Then estimate and plot the coherency between the simulatedseries for the following values of L and comment:(i) L = 1, (ii) L = 3, (iii) L = 41, and (iv) L = 101.

Section 4.7

4.19 For the processes in Problem 4.18:

(a) Compute the phase between xt and yt.(b) Simulate n = 1024 observations from xt and yt for φ = .9, σ2 = 1, and

D = 1. Then estimate and plot the phase between the simulated series forthe following values of L and comment:(i) L = 1, (ii) L = 3, (iii) L = 41, and (iv) L = 101.

4.20 Consider the bivariate time series records containing monthly U.S. pro-duction as measured by the Federal Reserve Board Production Index andmonthly unemployment as given in Figure 3.21.

(a) Compute the spectrum and the log spectrum for each series, and identifystatistically significant peaks. Explain what might be generating the peaks.Compute the coherence, and explain what is meant when a high coherenceis observed at a particular frequency.

(b) What would be the effect of applying the filter

ut = xt − xt−1 followed by vt = ut − ut−12

to the series given above? Plot the predicted frequency responses of thesimple difference filter and of the seasonal difference of the first difference.

(c) Apply the filters successively to one of the two series and plot the out-put. Examine the output after taking a first difference and comment onwhether stationarity is a reasonable assumption. Why or why not? Plotafter taking the seasonal difference of the first difference. What can benoticed about the output that is consistent with what you have predictedfrom the frequency response? Verify by computing the spectrum of theoutput after filtering.

Page 274: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

Problems 261

4.21 Determine the theoretical power spectrum of the series formed by com-bining the white noise series wt to form

yt = wt−2 + 4wt−1 + 6wt + 4wt+1 + wt+2.

Determine which frequencies are present by plotting the power spectrum.

4.22 Let xt = cos(2πωt), and consider the output

yt =

∞∑k=−∞

akxt−k,

where∑k |ak| <∞. Show

yt = |A(ω)| cos(2πωt+ φ(ω)),

where |A(ω)| and φ(ω) are the amplitude and phase of the filter, respectively.Interpret the result in terms of the relationship between the input series, xt,and the output series, yt.

4.23 Suppose xt is a stationary series, and we apply two filtering operationsin succession, say,

yt =∑r

arxt−r then zt =∑s

bsyt−s.

(a) Show the spectrum of the output is

fz(ω) = |A(ω)|2|B(ω)|2fx(ω),

where A(ω) and B(ω) are the Fourier transforms of the filter sequences atand bt, respectively.

(b) What would be the effect of applying the filter

ut = xt − xt−1 followed by vt = ut − ut−12

to a time series?(c) Plot the predicted frequency responses of the simple difference filter and

of the seasonal difference of the first difference. Filters like these are calledseasonal adjustment filters in economics because they tend to attenuatefrequencies at multiples of the monthly periods. The difference filter tendsto attenuate low-frequency trends.

4.24 Suppose we are given a stationary zero-mean series xt with spectrumfx(ω) and then construct the derived series

yt = ayt−1 + xt, t = ±1,±2, ... .

(a) Show how the theoretical fy(ω) is related to fx(ω).(b) Plot the function that multiplies fx(ω) in part (a) for a = .1 and for a = .8.

This filter is called a recursive filter.

Page 275: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

262 4 Spectral Analysis and Filtering

Section 4.8

4.25 Often, the periodicities in the sunspot series are investigated by fittingan autoregressive spectrum of sufficiently high order. The main periodicityis often stated to be in the neighborhood of 11 years. Fit an autoregressivespectral estimator to the sunspot data using a model selection method ofyour choice. Compare the result with a conventional nonparametric spectralestimator found in Problem 4.8.

4.26 Fit an autoregressive spectral estimator to the Recruitment series andcompare it to the results of Example 4.13.

4.27 Suppose a sample time series with n = 256 points is available from thefirst-order autoregressive model. Furthermore, suppose a sample spectrumcomputed with L = 3 yields the estimated value fx(1/8) = 2.25. Is thissample value consistent with σ2

w = 1, φ = .5? Repeat using L = 11 if we justhappen to obtain the same sample value.

4.28 Suppose we wish to test the noise alone hypothesis H0 : xt = nt againstthe signal-plus-noise hypothesis H1 : xt = st+nt, where st and nt are uncorre-lated zero-mean stationary processes with spectra fs(ω) and fn(ω). Supposethat we want the test over a band of L = 2m + 1 frequencies of the formωj:n + k/n, for k = 0,±1,±2, . . . ,±m near some fixed frequency ω. Assumethat both the signal and noise spectra are approximately constant over theinterval.

(a) Prove the approximate likelihood-based test statistic for testingH0 againstH1 is proportional to

T =∑k

|dx(ωj:n + k/n)|2(

1

fn(ω)− 1

fs(ω) + fn(ω)

).

(b) Find the approximate distributions of T under H0 and H1.(c) Define the false alarm and signal detection probabilities as PF = P{T >

K|H0} and Pd = P{T > k|H1}, respectively. Express these probabilities interms of the signal-to-noise ratio fs(ω)/fn(ω) and appropriate chi-squaredintegrals.

Section 4.9

4.29 Repeat the dynamic Fourier analysis of Example 4.21 on the remainingseven earthquakes and seven explosions in the data file eqexp. Do the conclu-sions about the difference between earthquakes and explosions stated in theexample still seem valid?

Page 276: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

Problems 263

4.30 Repeat the wavelet analyses of Examples 4.22 and 4.23 on all earthquakeand explosion series in the data file eqexp. Do the conclusions about thedifference between earthquakes and explosions stated in Examples 4.22 and4.23 still seem valid?

4.31 Using Examples 4.21-4.23 as a guide, perform a dynamic Fourier analysisand wavelet analyses (dwt and waveshrink analysis) on the event of unknownorigin that took place near the Russian nuclear test facility in Novaya Zemlya.State your conclusion about the nature of the event at Novaya Zemlya.

Section 4.10

4.32 Consider the problem of approximating the filter output

yt =∞∑

k=−∞

akxt−k,∞∑−∞|ak| <∞,

by

yMt =∑

|k|<M/2

aMk xt−k

for t = M/2− 1,M/2, . . . , n−M/2, where xt is available for t = 1, . . . , n and

aMt = M−1M−1∑k=0

A(ωk) exp{2πiωkt}

with ωk = k/M . Prove

E{(yt − yMt )2} ≤ 4γx(0)

( ∑|k|≥M/2

|ak|)2

.

4.33 Prove the squared coherence ρ2y·x(ω) = 1 for all ω when

yt =∞∑

r=−∞arxt−r,

that is, when xt and yt can be related exactly by a linear filter.

4.34 The data set climhyd, contains 454 months of measured values for sixclimatic variables: (i) air temperature [Temp], (ii) dew point [DewPt], (iii) cloudcover [CldCvr], (iv) wind speed [WndSpd], (v) precipitation [Precip], and (vi)inflow [Inflow], at Lake Shasta in California; the data are displayed in Fig-ure 7.3. We would like to look at possible relations among the weather factorsand between the weather factors and the inflow to Lake Shasta.

Page 277: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

264 4 Spectral Analysis and Filtering

(a) First transform the inflow and precipitation series as follows: It = log it,where it is inflow, and Pt =

√pt, where pt is precipitation. Then, com-

pute the square coherencies between all the weather variables and trans-formed inflow and argue that the strongest determinant of the inflowseries is (transformed) precipitation. [Tip: If x contains multiple timeseries, then the easiest way to display all the squared coherencies is tofirst make an object of class spec; e.g., u = spectrum(x, span=c(7,7),

plot=FALSE) and then plot the coherencies suppressing the confidenceintervals, plot(u, ci=-1, plot.type="coh").]

(b) Fit a lagged regression model of the form

It = β0 +∞∑j=0

βjPt−j + wt,

using thresholding, and then comment of the predictive ability of precipi-tation for inflow.

Section 4.11

4.35 Consider the signal plus noise model

yt =∞∑

r=−∞βrxt−r + vt,

where the signal and noise series, xt and vt are both stationary with spectrafx(ω) and fv(ω), respectively. Assuming that xt and vt are independent ofeach other for all t, verify (4.137) and (4.138).

4.36 Consider the modelyt = xt + vt,

wherext = φxt−1 + wt,

such that vt is Gaussian white noise and independent of xt with var(vt) = σ2v ,

and wt is Gaussian white noise and independent of vt, with var(wt) = σ2w,

and |φ| < 1 and Ex0 = 0. Prove that the spectrum of the observed series yt is

fy(ω) =σ2|1− θe−2πiω|2

|1− φe−2πiω|2,

where

θ =c±√c2 − 4

2, σ2 =

σ2vφ

θ,

and

c =σ2w + σ2

v(1 + φ2)

σ2vφ

.

Page 278: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

Problems 265

4.37 Consider the same model as in the preceding problem.

(a) Prove the optimal smoothed estimator of the form

xt =

∞∑s=−∞

asyt−s

has

as =σ2w

σ2

θ|s|

1− θ2.

(b) Show the mean square error is given by

E{(xt − xt)2} =σ2vσ

2w

σ2(1− θ2).

(c) Compare mean square error of the estimator in part (b) with that of theoptimal finite estimator of the form

xt = a1yt−1 + a2yt−2

when σ2v = .053, σ2

w = .172, and φ1 = .9.

Section 4.12

4.38 Consider the two-dimensional linear filter given as the output (4.149).

(a) Express the two-dimensional autocovariance function of the output, say,γy(h1, h2), in terms of an infinite sum involving the autocovariance func-tion of xsss and the filter coefficients as1,s2 .

(b) Use the expression derived in (a), combined with (4.148) and (4.151) toderive the spectrum of the filtered output (4.150).

The following problems require supplemental material from Appendix C

4.39 Let wt be a Gaussian white noise series with variance σ2w. Prove that

the results of Theorem C.4 hold without error for the DFT of wt.

4.40 Show that condition (4.40) implies (C.19) by showing

n−1/2∑h≥0

h |γ(h)| ≤ σ2w

∑k≥0

|ψk|∑j≥0

√j |ψj |.

4.41 Prove Lemma C.4.

4.42 Finish the proof of Theorem C.5.

4.43 For the zero-mean complex random vector zzz = xxxc − ixxxs, with cov(zzz) =Σ = C − iQ, with Σ = Σ∗, define

w = 2Re(aaa∗zzz),

where aaa = aaac − iaaas is an arbitrary non-zero complex vector. Prove

cov(w) = 2aaa∗Σaaa.

Recall ∗ denotes the complex conjugate transpose.

Page 279: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory
Page 280: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

5

Additional Time Domain Topics

5.1 Introduction

In this chapter, we present material that may be considered special or ad-vanced topics in the time domain. Chapter 6 is devoted to one of the mostuseful and interesting time domain topics, state-space models. Consequently,we do not cover state-space models or related topics—of which there aremany—in this chapter. This chapter consists of sections of independent topicsthat may be read in any order. Most of the sections depend on a basic knowl-edge of ARMA models, forecasting and estimation, which is the material thatis covered in Chapter 3, §3.1-§3.8. A few sections, for example the section onlong memory models, require some knowledge of spectral analysis and relatedtopics covered in Chapter 4. In addition to long memory, we discuss unit roottesting, GARCH models, threshold models, regression with autocorrelated er-rors, lagged regression or transfer functions, and selected topics in multivariateARMAX models.

5.2 Long Memory ARMA and Fractional Differencing

The conventional ARMA(p, q) process is often referred to as a short-memoryprocess because the coefficients in the representation

xt =

∞∑j=0

ψjwt−j ,

obtained by solvingφ(z)ψ(z) = θ(z),

are dominated by exponential decay. As pointed out in §3.3, this result impliesthe ACF of the short memory process ρ(h)→ 0 exponentially fast as h→∞.When the sample ACF of a time series decays slowly, the advice given in

© Springer Science+Business Media, LLC 2011

R.H. Shumway and D.S. Stoffer, Time Series Analysis and Its Applications: With R Examples,Springer Texts in Statistics, DOI 10.1007/978-1-4419-7865-3_5,

267

Page 281: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

268 5 Additional Time Domain Topics

Chapter 3 has been to difference the series until it seems stationary. Followingthis advice with the glacial varve series first presented in Example 3.32 leadsto the first difference of the logarithms of the data being represented as afirst-order moving average. In Example 3.40, further analysis of the residualsleads to fitting an ARIMA(1, 1, 1) model,

∇xt = φ∇xt−1 + wt + θwt−1,

where we understand xt is the log-transformed varve series. In particular,the estimates of the parameters (and the standard errors) were φ = .23(.05),

θ = −.89(.03), and σ2w = .23. The use of the first difference ∇xt = (1−B)xt

can be too severe a modification in the sense that the nonstationary modelmight represent an overdifferencing of the original process.

Long memory (or persistent) time series were considered in Hosking (1981)and Granger and Joyeux (1980) as intermediate compromises between theshort memory ARMA type models and the fully integrated nonstationaryprocesses in the Box–Jenkins class. The easiest way to generate a long memoryseries is to think of using the difference operator (1−B)d for fractional valuesof d, say, 0 < d < .5, so a basic long memory series gets generated as

(1−B)dxt = wt, (5.1)

where wt still denotes white noise with variance σ2w. The fractionally differ-

enced series (5.1), for |d| < .5, is often called fractional noise (except when d iszero). Now, d becomes a parameter to be estimated along with σ2

w. Differenc-ing the original process, as in the Box–Jenkins approach, may be thought ofas simply assigning a value of d = 1. This idea has been extended to the classof fractionally integrated ARMA, or ARFIMA models, where −.5 < d < .5;when d is negative, the term antipersistent is used. Long memory processesoccur in hydrology (see Hurst, 1951, and McLeod and Hipel, 1978) and inenvironmental series, such as the varve data we have previously analyzed, tomention a few examples. Long memory time series data tend to exhibit sampleautocorrelations that are not necessarily large (as in the case of d = 1), butpersist for a long time. Figure 5.1 shows the sample ACF, to lag 100, of thelog-transformed varve series, which exhibits classic long memory behavior:

1 u = acf(log(varve), 100, plot=FALSE)

2 plot(u[1:100], ylim=c(-.1,1), main="log(varve)") # get rid of lag 0

To investigate its properties, we can use the binomial expansion (d > −1)to write

wt = (1−B)dxt =∞∑j=0

πjBjxt =

∞∑j=0

πjxt−j (5.2)

where

πj =Γ (j − d)

Γ (j + 1)Γ (−d)(5.3)

Page 282: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

5.2 Long Memory ARMA and Fractional Differencing 269

0 20 40 60 80 100

0.0

0.2

0.4

0.6

0.8

1.0

Lag

ACF

log(varve)

Fig. 5.1. Sample ACF of the log transformed varve series.

with Γ (x+ 1) = xΓ (x) being the gamma function. Similarly (d < 1), we canwrite

xt = (1−B)−dwt =

∞∑j=0

ψjBjwt =

∞∑j=0

ψjwt−j (5.4)

where

ψj =Γ (j + d)

Γ (j + 1)Γ (d). (5.5)

When |d| < .5, the processes (5.2) and (5.4) are well-defined stationary pro-cesses (see Brockwell and Davis, 1991, for details). In the case of fractionaldifferencing, however, the coefficients satisfy

∑π2j < ∞ and

∑ψ2j < ∞ as

opposed to the absolute summability of the coefficients in ARMA processes.Using the representation (5.4)–(5.5), and after some nontrivial manipula-

tions, it can be shown that the ACF of xt is

ρ(h) =Γ (h+ d)Γ (1− d)

Γ (h− d+ 1)Γ (d)∼ h2d−1 (5.6)

for large h. From this we see that for 0 < d < .5

∞∑h=−∞

|ρ(h)| =∞

and hence the term long memory.In order to examine a series such as the varve series for a possible long

memory pattern, it is convenient to look at ways of estimating d. Using (5.3)it is easy to derive the recursions

Page 283: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

270 5 Additional Time Domain Topics

πj+1(d) =(j − d)πj(d)

(j + 1), (5.7)

for j = 0, 1, . . ., with π0(d) = 1. Maximizing the joint likelihood of the errorsunder normality, say, wt(d), will involve minimizing the sum of squared errors

Q(d) =∑

w2t (d).

The usual Gauss–Newton method, described in §3.6, leads to the expansion

wt(d) = wt(d0) + w′t(d0)(d− d0),

where

w′t(d0) =∂wt∂d

∣∣∣∣d=d0

and d0 is an initial estimate (guess) at to the value of d. Setting up the usualregression leads to

d = d0 −∑t w′t(d0)wt(d0)∑t w′t(d0)

2 . (5.8)

The derivatives are computed recursively by differentiating (5.7) successivelywith respect to d: π′j+1(d) = [(j − d)π′j(d)− πj(d)]/(j + 1), where π′0(d) = 0.The errors are computed from an approximation to (5.2), namely,

wt(d) =t∑

j=0

πj(d)xt−j . (5.9)

It is advisable to omit a number of initial terms from the computation andstart the sum, (5.8), at some fairly large value of t to have a reasonableapproximation.

Example 5.1 Long Memory Fitting of the Glacial Varve Series

We consider analyzing the glacial varve series discussed in Examples 2.6and 3.32. Figure 2.6 shows the original and log-transformed series (whichwe denote by xt). In Example 3.40, we noted that xt could be modeled asan ARIMA(1, 1, 1) process. We fit the fractionally differenced model, (5.1),to the mean-adjusted series, xt − x. Applying the Gauss–Newton iterativeprocedure previously described, starting with d = .1 and omitting the first 30points from the computation, leads to a final value of d = .384, which impliesthe set of coefficients πj(.384), as given in Figure 5.2 with π0(.384) = 1. Wecan compare roughly the performance of the fractional difference operatorwith the ARIMA model by examining the autocorrelation functions of thetwo residual series as shown in Figure 5.3. The ACFs of the two residualseries are roughly comparable with the white noise model.

To perform this analysis in R, first download and install the fracdiff

package. Then use

Page 284: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

5.2 Long Memory ARMA and Fractional Differencing 271

Fig. 5.2. Coefficients πj(.384), j = 1, 2, . . . , 30 in the representation (5.7).

1 library(fracdiff)

2 lvarve = log(varve)-mean(log(varve))

3 varve.fd = fracdiff(lvarve, nar=0, nma=0, M=30)

4 varve.fd$d # = 0.3841688

5 varve.fd$stderror.dpq # = 4.589514e-06 (questionable result!!)

6 p = rep(1,31)

7 for (k in 1:30){ p[k+1] = (k-varve.fd$d)*p[k]/(k+1) }

8 plot(1:30, p[-1], ylab=expression(pi(d)), xlab="Index", type="h")

9 res.fd = diffseries(log(varve), varve.fd$d) # frac diff resids

10 res.arima = resid(arima(log(varve), order=c(1,1,1))) # arima resids

11 par(mfrow=c(2,1))

12 acf(res.arima, 100, xlim=c(4,97), ylim=c(-.2,.2), main="")

13 acf(res.fd, 100, xlim=c(4,97), ylim=c(-.2,.2), main="")

The R package uses a truncated maximum likelihood procedure that wasdiscussed in Haslett and Raftery (1989), which is a little more elaboratethan simply zeroing out initial values. The default truncation value in R isM = 100. In the default case, the estimate is d = .37 with approximatelythe same (questionable) standard error. The standard error is (supposedly)obtained from the Hessian as described in Example 3.29.

Forecasting long memory processes is similar to forecasting ARIMA mod-els. That is, (5.2) and (5.7) can be used to obtain the truncated forecasts

xnn+m = −n∑j=1

πj(d) xnn+m−j , (5.10)

for m = 1, 2, . . . . Error bounds can be approximated by using

Page 285: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

272 5 Additional Time Domain Topics

20 40 60 80 100

−0.2

−0.1

0.0

0.1

0.2

Lag

ACF

20 40 60 80 100

−0.2

−0.1

0.0

0.1

0.2

Lag

ACF

Fig. 5.3. ACF of residuals from the ARIMA(1, 1, 1) fit to the logged varve series(top) and of the residuals from the long memory model fit, (1 − B)dxt = wt, withd = .384 (bottom).

Pnn+m = σ2w

m−1∑j=0

ψ2j (d)

(5.11)

where, as in (5.7),

ψj(d) =(j + d)ψj(d)

(j + 1), (5.12)

with ψ0(d) = 1.No obvious short memory ARMA-type component can be seen in the ACF

of the residuals from the fractionally differenced varve series shown in Fig-ure 5.3. It is natural, however, that cases will exist in which substantial shortmemory-type components will also be present in data that exhibits long mem-ory. Hence, it is natural to define the general ARFIMA(p, d, q), −.5 < d < .5process as

φ(B)∇d(xt − µ) = θ(B)wt, (5.13)

where φ(B) and θ(B) are as given in Chapter 3. Writing the model in theform

φ(B)πd(B)(xt − µ) = θ(B)wt (5.14)

Page 286: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

5.2 Long Memory ARMA and Fractional Differencing 273

makes it clear how we go about estimating the parameters for the more generalmodel. Forecasting for the ARFIMA(p, d, q) series can be easily done, notingthat we may equate coefficients in

φ(z)ψ(z) = (1− z)−dθ(z) (5.15)

andθ(z)π(z) = (1− z)dφ(z) (5.16)

to obtain the representations

xt = µ+∞∑j=0

ψjwt−j

and

wt =∞∑j=0

πj(xt−j − µ).

We then can proceed as discussed in (5.10) and (5.11).Comprehensive treatments of long memory time series models are given

in the texts by Beran (1994), Palma (2007), and Robinson (2003), and itshould be noted that several other techniques for estimating the parameters,especially, the long memory parameter, can be developed in the frequencydomain. In this case, we may think of the equations as generated by an infiniteorder autoregressive series with coefficients πj given by (5.7) . Using the sameapproach as before, we obtain

fx(ω) =σ2w

|∑∞k=0 πke

−2πikω|2

= σ2w|1− e−2πiω|−2d = [4 sin2(πω)]−dσ2

w

(5.17)

as equivalent representations of the spectrum of a long memory process. Thelong memory spectrum approaches infinity as the frequency ω → 0.

The main reason for defining the Whittle approximation to the log likeli-hood is to propose its use for estimating the parameter d in the long memorycase as an alternative to the time domain method previously mentioned. Thetime domain approach is useful because of its simplicity and easily computedstandard errors. One may also use an exact likelihood approach by developingan innovations form of the likelihood as in Brockwell and Davis (1991).

For the approximate approach using the Whittle likelihood (4.78), we con-sider using the approach of Fox and Taqqu (1986) who showed that maximiz-ing the Whittle log likelihood leads to a consistent estimator with the usualasymptotic normal distribution that would be obtained by treating (4.78) as aconventional log likelihood (see also Dahlhaus, 1989; Robinson, 1995; Hurvichet al., 1998). Unfortunately, the periodogram ordinates are not asymptoti-cally independent (Hurvich and Beltrao, 1993), although a quasi-likelihood in

Page 287: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

274 5 Additional Time Domain Topics

the form of the Whittle approximation works well and has good asymptoticproperties.

To see how this would work for the purely long memory case, write thelong memory spectrum as

fx(ωk; d, σ2w) = σ2

wg−dk , (5.18)

wheregk = 4 sin2(πωk). (5.19)

Then, differentiating the log likelihood, say,

lnL(xxx; d, σ2w) ≈ −m lnσ2

w + dm∑k=1

ln gk −1

σ2w

m∑k=1

gdk I(ωk) (5.20)

at m = n/2− 1 frequencies and solving for σ2w yields

σ2w(d) =

1

m

m∑k=1

gdk I(ωk) (5.21)

as the approximate maximum likelihood estimator for the variance parameter.To estimate d, we can use a grid search of the concentrated log likelihood

lnL(xxx; d) ≈ −m lnσ2w(d) + d

m∑k=1

ln gk −m (5.22)

over the interval (−.5, .5), followed by a Newton–Raphson procedure to con-vergence.

Example 5.2 Long Memory Spectra for the Varve Series

In Example 5.1, we fit a long memory model to the glacial varve data via timedomain methods. Fitting the same model using frequency domain methodsand the Whittle approximation above gives d = .380, with an estimatedstandard error of .028. The earlier time domain method gave d = .384with M = 30 and d = .370 with M = 100. Both estimates obtained viatime domain methods had a standard error of about 4.6 × 10−6, whichseems implausible. The error variance estimate in this case is σ2

w = .2293;in Example 5.1, we could have used var(res.fd) as an estimate, in whichcase we obtain .2298. The R code to perform this analysis is

1 series = log(varve) # specify series to be analyzed

2 d0 = .1 # initial value of d

3 n.per = nextn(length(series))

4 m = (n.per)/2 - 1

5 per = abs(fft(series-mean(series))[-1])^2 # remove 0 freq

6 per = per/n.per # and scale the peridogram

7 g = 4*(sin(pi*((1:m)/n.per))^2)

Page 288: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

5.2 Long Memory ARMA and Fractional Differencing 275

0.00 0.05 0.10 0.15 0.20 0.25

−10

12

3

frequency

log(

spec

trum

)

Fig. 5.4. Long Memory (d = .380) [solid line] and autoregressive AR(8) [dashedline] spectral estimators for the paleoclimatic glacial varve series.

8 # Function to calculate -log.likelihood

9 whit.like = function(d){

10 g.d=g^d

11 sig2 = (sum(g.d*per[1:m])/m)

12 log.like = m*log(sig2) - d*sum(log(g)) + m

13 return(log.like) }

14 # Estimation (?optim for details - output not shown)

15 (est = optim(d0, whit.like, gr=NULL, method="L-BFGS-B",

hessian=TRUE, lower=-.5, upper=.5,

control=list(trace=1,REPORT=1)))

16 # Results: d.hat = .380, se(dhat) = .028, and sig2hat = .229

17 cat("d.hat =", est$par, "se(dhat) = ",1/sqrt(est$hessian),"\n")

18 g.dhat = g^est$par; sig2 = sum(g.dhat*per[1:m])/m

19 cat("sig2hat =",sig2,"\n")

One might also consider fitting an autoregressive model to these data usinga procedure similar to that used in Example 4.15. Following this approachgave an autoregressive model with p = 8 and φφφ = (.34, .11, .04, .09, .08, .08,.02, .09)′, with σ2

w = .2267 as the error variance. The two log spectra areplotted in Figure 5.4 for ω > 0, and we note that long memory spectrum willeventually become infinite, whereas the AR(8) spectrum is finite at ω = 0.The R code used for this part of the example (assuming the previous valueshave been retained) is

1 u = spec.ar(log(varve), plot=FALSE) # produces AR(8)

2 g = 4*(sin(pi*((1:500)/2000))^2)

3 fhat = sig2*g^{-est$par} # long memory spectral estimate

Page 289: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

276 5 Additional Time Domain Topics

4 plot(1:500/2000, log(fhat), type="l", ylab="log(spectrum)",

xlab="frequency")

5 lines(u$freq[1:250], log(u$spec[1:250]), lty="dashed")

6 ar.mle(log(varve)) # to get AR(8) estimates

Often, time series are not purely long memory. A common situation has thelong memory component multiplied by a short memory component, leadingto an alternate version of (5.18) of the form

fx(ωk; d, θ) = g−dk f0(ωk; θθθ), (5.23)

where f0(ωk; θ) might be the spectrum of an autoregressive moving averageprocess with vector parameter θθθ, or it might be unspecified. If the spectrumhas a parametric form, the Whittle likelihood can be used. However, there isa substantial amount of semiparametric literature that develops the estima-tors when the underlying spectrum f0(ω; θθθ) is unknown. A class of Gaussiansemi-parametric estimators simply uses the same Whittle likelihood (5.22),evaluated over a sub-band of low frequencies, say m′ =

√n. There is some

latitude in selecting a band that is relatively free from low frequency interfer-ence due to the short memory component in (5.23).

Geweke and Porter–Hudak (1983) developed an approximate method forestimating d based on a regression model, derived from (5.22). Note that wemay write a simple equation for the logarithm of the spectrum as

ln fx(ωk; d) = ln f0(ωk; θθθ)− d ln[4 sin2(πωk)], (5.24)

with the frequencies ωk = k/n restricted to a range k = 1, 2, . . . ,m′ near thezero frequency with m′ =

√n as the recommended value. Relationship (5.24)

suggests using a simple linear regression model of the form,

ln I(ωk) = β0 − d ln[4 sin2(πωk)] + ek (5.25)

for the periodogram to estimate the parameters σ2w and d. In this case,

one performs least squares using ln I(ωk) as the dependent variable, andln[4 sin2(πωk)] as the independent variable for k = 1, 2, . . . ,m. The resultingslope estimate is then used as an estimate of −d. For a good discussion ofvarious alternative methods for selecting m, see Hurvich and Deo (1999). TheR package fracdiff also provides this method via the command fdGPH();see the help file for further information.

One of the above two procedures works well for estimating the long mem-ory component but there will be cases (such as ARFIMA) where there willbe a parameterized short memory component f0(ωk; θθθ) that needs to be esti-mated. If the spectrum is highly parameterized, one might estimate using theWhittle log likelihood (5.19) and

fx(ωk; θθθ) = g−dk f0(ωk; θθθ)

Page 290: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

5.3 Unit Root Testing 277

and jointly estimating the parameters d and θθθ using the Newton–Raphsonmethod. If we are interested in a nonparametric estimator, using the conven-tional smoothed spectral estimator for the periodogram, adjusted for the longmemory component, say gdk I(ωk) might be a possible approach.

5.3 Unit Root Testing

As discussed in the previous section, the use of the first difference ∇xt =(1−B)xt can be too severe a modification in the sense that the nonstationarymodel might represent an overdifferencing of the original process. For example,consider a causal AR(1) process (we assume throughout this section that thenoise is Gaussian),

xt = φxt−1 + wt. (5.26)

Applying (1−B) to both sides shows that differencing, ∇xt = φ∇xt−1 +∇wt,or

yt = φyt−1 + wt − wt−1,

where yt = ∇xt, introduces extraneous correlation and invertibility problems.That is, while xt is a causal AR(1) process, working with the differencedprocess yt will be problematic because it is a non-invertible ARMA(1, 1).

A unit root test provides a way to test whether (5.26) is a random walk(the null case) as opposed to a causal process (the alternative). That is, itprovides a procedure for testing

H0 : φ = 1 versus H1 : |φ| < 1.

An obvious test statistic would be to consider (φ− 1), appropriately normal-

ized, in the hope to develop an asymptotically normal test statistic, where φis one of the optimal estimators discussed in Chapter 3, §3.6. Unfortunately,the theory of §3.6 will not work in the null case because the process is non-stationary. Moreover, as seen in Example 3.35, estimation near the boundaryof stationarity produces highly skewed sample distributions (see Figure 3.10)and this is a good indication that the problem will be atypical.

To examine the behavior of (φ− 1) under the null hypothesis that φ = 1,or more precisely that the model is a random walk, xt =

∑tj=1 wj , or xt =

xt−1 +wt with x0 = 0, consider the least squares estimator of φ. Noting thatµx = 0, the least squares estimator can be written as

φ =1n

∑nt=1 xtxt−1

1n

∑nt=1 x

2t−1

= 1 +1n

∑nt=1 wtxt−1

1n

∑nt=1 x

2t−1

, (5.27)

where we have written xt = xt−1 + wt in the numerator; recall that x0 = 0and in the least squares setting, we are regressing xt on xt−1 for t = 1, . . . , n.Hence, under H0, we have that

Page 291: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

278 5 Additional Time Domain Topics

φ− 1 =

1nσ2

w

∑nt=1 wtxt−1

1nσ2

w

∑nt=1 x

2t−1

. (5.28)

Consider the numerator of (5.28). Note first that by squaring both sidesof xt = xt−1 + wt, we obtain x2t = x2t−1 + 2xt−1wt + w2

t so that

xt−1wt =1

2(x2t − x2t−1 − w2

t ),

and summing,

1

nσ2w

n∑t=1

xt−1wt =1

2

( x2nnσ2

w

−∑nt=1 w

2t

nσ2w

).

Because xn =∑n

1 wt, we have that xn ∼ N(0, nσ2w), so that 1

nσ2wx2n ∼ χ2

1 , the

chi-squared distribution with one degree of freedom. Moreover, because wt iswhite Gaussian noise, 1

n

∑n1 w

2t →p σ

2w, or 1

nσ2w

∑n1 w

2t →p 1. Consequently,

(n→∞)

1

nσ2w

n∑t=1

xt−1wtd→ 1

2

(χ21 − 1

). (5.29)

Next we focus on the denominator of (5.28). First, we introduce standardBrownian motion.

Definition 5.1 A continuous time process {W (t); t ≥ 0} is called standardBrownian motion if it satisfies the following conditions:

(i) W (0) = 0;(ii) {W (t2)−W (t1),W (t3)−W (t2), . . . ,W (tn)−W (tn−1)} are independent

for any collection of points, 0 ≤ t1 < t2 · · · < tn, and integer n > 2;(iii) W (t+∆t)−W (t) ∼ N(0, ∆t) for ∆t > 0.

The result for the denominator uses the functional central limit theorem,which can be found in Billlingsley (1999, §2.8). In particular, if ξ1, . . . , ξn isa sequence of iid random variables with mean 0 and variance 1, then, for0 ≤ t ≤ 1, the continuous time process

Sn(t) =1√n

[[nt]]∑j=1

ξjd→ W (t), (5.30)

as n → ∞, where [[ ]] is the greatest integer function and W (t) is standardBrownian motion on [0, 1]. Note the under the null hypothesis, xs = w1 +· · · + ws ∼ N(0, sσ2

w), and based on (5.30), we have xsσw√n→d W (s). From

this fact, we can show that (n→∞)

n∑t=1

( xt−1σw√n

)2 1

n

d→∫ 1

0

W 2(t) dt . (5.31)

Page 292: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

5.3 Unit Root Testing 279

The denominator in (5.28) is off from the left side of (5.31) by a factor of n−1,and we adjust accordingly to finally obtain (n→∞),

n(φ− 1) =

1nσ2

w

∑nt=1 wtxt−1

1n2σ2

w

∑nt=1 x

2t−1

d→12

(χ21 − 1

)∫ 1

0W 2(t) dt

. (5.32)

The test statistic n(φ−1) is known as the unit root or Dickey-Fuller (DF)statistic (see Fuller, 1996), although the actual DF test statistic is normalizeda little differently. Because the distribution of the test statistic does not havea closed form, quantiles of the distribution must be computed by numericalapproximation or by simulation. The R package tseries provides this testalong with more general tests that we mention briefly.

Toward a more general model, we note that the DF test was establishedby noting that if xt = φxt−1 +wt, then ∇xt = (φ−1)xt−1 +wt = γxt−1 +wt,and one could test H0 : γ = 0 by regressing ∇xt on xt−1. They formed aWald statistic and derived its limiting distribution [the previous derivationbased on Brownian motion is due to Phillips (1987)]. The test was extendedto accommodate AR(p) models, xt =

∑pj=1 φjxt−j + wt, as follows. Subtract

xt−1 from the model to obtain

∇xt = γxt−1 +

p−1∑j=1

ψj∇xt−j + wt, (5.33)

where γ =∑pj=1 φj−1 and ψj = −

∑pj=i φi for j = 2, . . . , p. For a quick check

of (5.33) when p = 2, note that xt = (φ1 + φ2)xt−1 − φ2(xt−1 − xt−2) + wt;now subtract xt−1 from both sides. To test the hypothesis that the pro-cess has a unit root at 1 (i.e., the AR polynoimial φ(z) = 0 when z = 1),we can test H0 : γ = 0 by estimating γ in the regression of ∇xt onxt−1,∇xt−1, . . . ,∇xt−p+1, and forming a Wald test based on tγ = γ/se(γ).This test leads to the so-called augmented Dickey-Fuller test (ADF). While thecalculations for obtaining the asymptotic null distribution change, the basicideas and machinery remain the same as in the simple case. The choice of p iscrucial, and we will discuss some suggestions in the example. For ARMA(p, q)models, the ADF test can be used by assuming p is large enough to capture theessential correlation structure; another alternative is the Phillips-Perron (PP)test, which differs from the ADF tests mainly in how they deal with serialcorrelation and heteroskedasticity in the errors.

One can extend the model to include a constant, or even non-stochastictrend. For example, consider the model

xt = β0 + β1t+ φxt−1 + wt.

If we assume β1 = 0, then under the null hypothesis, φ = 1, the process is arandom walk with drift β0. Under the alternate hypothesis, the process is acausal AR(1) with mean µx = β0(1−φ). If we cannot assume β1 = 0, then the

Page 293: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

280 5 Additional Time Domain Topics

interest here is testing the null that (β1, φ) = (0, 1), simultaneously, versus thealternative that β1 6= 0 and |φ| < 1. In this case, the null hypothesis is that theprocess is a random walk with drift, versus the alternative hypothesis that theprocess is stationary around a global trend (consider the global temperatureseries examined in Example 2.1).

Example 5.3 Testing Unit Roots in the Glacial Varve Series

In this example we use the R package tseries to test the null hypothesisthat the log of the glacial varve series has a unit root, versus the alternatehypothesis that the process is stationary. We test the null hypothesis usingthe available DF, ADF and PP tests; note that in each case, the generalregression equation incorporates a constant and a linear trend. In the ADFtest, the default number of AR components included in the model, say k, is[[(n− 1)

13 ]], which corresponds to the suggested upper bound on the rate at

which the number of lags, k, should be made to grow with the sample sizefor the general ARMA(p, q) setup. For the PP test, the default value of k is

[[.04n14 ]].

1 library(tseries)

2 adf.test(log(varve), k=0) # DF test

Dickey-Fuller = -12.8572, Lag order = 0, p-value < 0.01

alternative hypothesis: stationary

3 adf.test(log(varve)) # ADF test

Dickey-Fuller = -3.5166, Lag order = 8, p-value = 0.04071

alternative hypothesis: stationary

4 pp.test(log(varve)) # PP test

Dickey-Fuller Z(alpha) = -304.5376,

Truncation lag parameter = 6, p-value < 0.01

alternative hypothesis: stationary

In each test, we reject the null hypothesis that the logged varve series hasa unit root. The conclusion of these tests supports the conclusion of theprevious section that the logged varve series is long memory rather thanintegrated.

5.4 GARCH Models

Recent problems in finance have motivated the study of the volatility, or vari-ability, of a time series. Although ARMA models assume a constant variance,models such as the autoregressive conditionally heteroscedastic or ARCHmodel, first introduced by Engle (1982), were developed to model changes involatility. These models were later extended to generalized ARCH, or GARCHmodels by Bollerslev (1986).

In §3.8, we discussed the return or growth rate of a series. For example, ifxt is the value of a stock at time t, then the return or relative gain, yt, of thestock at time t is

Page 294: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

5.4 GARCH Models 281

yt =xt − xt−1xt−1

. (5.34)

Definition (5.34) implies that xt = (1+yt)xt−1. Thus, based on the discussionin §3.8, if the return represents a small (in magnitude) percentage change then

∇[log(xt)] ≈ yt. (5.35)

Either value, ∇[log(xt)] or (xt−xt−1)/xt−1, will be called the return, and willbe denoted by yt. It is the study of yt that is the focus of ARCH, GARCH,and other volatility models. Recently there has been interest in stochasticvolatility models and we will discuss these models in Chapter 6 because theyare state-space models.

Typically, for financial series, the return yt, does not have a constant con-ditional variance, and highly volatile periods tend to be clustered together.In other words, there is a strong dependence of sudden bursts of variabilityin a return on the series own past. For example, Figure 1.4 shows the dailyreturns of the New York Stock Exchange (NYSE) from February 2, 1984 toDecember 31, 1991. In this case, as is typical, the return yt is fairly stable,except for short-term bursts of high volatility.

The simplest ARCH model, the ARCH(1), models the return as

yt = σtεt (5.36)

σ2t = α0 + α1y

2t−1, (5.37)

where εt is standard Gaussian white noise; that is, εt ∼ iid N(0, 1). As withARMA models, we must impose some constraints on the model parametersto obtain desirable properties. One obvious constraint is that α1 must not benegative, or else σ2

t may be negative.As we shall see, the ARCH(1) models return as a white noise process with

nonconstant conditional variance, and that conditional variance depends onthe previous return. First, notice that the conditional distribution of yt givenyt−1 is Gaussian:

yt∣∣ yt−1 ∼ N(0, α0 + α1y

2t−1). (5.38)

In addition, it is possible to write the ARCH(1) model as a non-GaussianAR(1) model in the square of the returns y2t . First, rewrite (5.36)-(5.37) as

y2t = σ2t ε

2t

α0 + α1y2t−1 = σ2

t ,

and subtract the two equations to obtain

y2t − (α0 + α1y2t−1) = σ2

t ε2t − σ2

t .

Now, write this equation as

y2t = α0 + α1y2t−1 + vt, (5.39)

Page 295: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

282 5 Additional Time Domain Topics

where vt = σ2t (ε2t − 1). Because ε2t is the square of a N(0, 1) random variable,

ε2t − 1 is a shifted (to have mean-zero), χ21 random variable.

To explore the properties of ARCH, we define Ys = {ys, ys−1, ...}. Then,using (5.38), we immediately see that yt has a zero mean:

E(yt) = EE(yt∣∣ Yt−1) = EE(yt

∣∣ yt−1) = 0. (5.40)

Because E(yt∣∣ Yt−1) = 0, the process yt is said to be a martingale difference.

Because yt is a martingale difference, it is also an uncorrelated sequence.For example, with h > 0,

cov(yt+h, yt) = E(ytyt+h) = EE(ytyt+h |Yt+h−1)

= E {ytE(yt+h |Yt+h−1)} = 0. (5.41)

The last line of (5.41) follows because yt belongs to the information set Yt+h−1for h > 0, and, E(yt+h |Yt+h−1) = 0, as determined in (5.40).

An argument similar to (5.40) and (5.41) will establish the fact that theerror process vt in (5.39) is also a martingale difference and, consequently, anuncorrelated sequence. If the variance of vt is finite and constant with respectto time, and 0 ≤ α1 < 1, then based on Property 3.1, (5.39) specifies a causalAR(1) process for y2t . Therefore, E(y2t ) and var(y2t ) must be constant withrespect to time t. This, implies that

E(y2t ) = var(yt) =α0

1− α1(5.42)

and, after some manipulations,

E(y4t ) =3α2

0

(1− α1)21− α2

1

1− 3α21

, (5.43)

provided 3α21 < 1. These results imply that the kurtosis, κ, of yt is

κ =E(y4t )

[E(y2t )]2= 3

1− α21

1− 3α21

, (5.44)

which is always larger than 3 (unless α1 = 0), the kurtosis of the normaldistribution. Thus, the marginal distribution of the returns, yt, is leptokurtic,or has “fat tails.”

In summary, an ARCH(1) process, yt, as given by (5.36)-(5.37), or equiv-alently (5.38), is characterized by the following properties.

• If 0 ≤ α1 < 1, the process yt itself is white noise and its unconditionaldistribution is symmetrically distributed around zero; this distribution isleptokurtic.

• If, in addition, 3α21 < 1, the square of the process, y2t , follows a causal

AR(1) model with ACF given by ρy2(h) = αh1 ≥ 0, for all h > 0. If3α1 ≥ 1, but α1 < 1, then y2t is strictly stationary with infinite variance.

Page 296: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

5.4 GARCH Models 283

Estimation of the parameters α0 and α1 of the ARCH(1) model is typi-cally accomplished by conditional MLE. The conditional likelihood of the datay2, ...., yn given y1, is given by

L(α0, α1

∣∣ y1) =n∏t=2

fα0,α1(yt∣∣ yt−1), (5.45)

where the density fα0,α1(yt

∣∣ yt−1) is the normal density specified in (5.38).

Hence, the criterion function to be minimized, l(α0, α1) ∝ − lnL(α0, α1

∣∣ y1)is given by

l(α0, α1) =1

2

n∑t=2

ln(α0 + α1y2t−1) +

1

2

n∑t=2

(y2t

α0 + α1y2t−1

). (5.46)

Estimation is accomplished by numerical methods, as described in §3.6. Inthis case, analytic expressions for the gradient vector, l(1)(α0, α1), and Hessianmatrix, l(2)(α0, α1), as described in Example 3.29, can be obtained by straight-forward calculations. For example, the 2 × 1 gradient vector, l(1)(α0, α1), isgiven by (

∂l/∂α0

∂l/∂α1

)=

n∑t=2

(1

y2t−1

)×α0 + α1y

2t−1 − y2t

2(α0 + α1y2t−1

)2 .The calculation of the Hessian matrix is left as an exercise (Problem 5.9).The likelihood of the ARCH model tends to be flat unless n is very large. Adiscussion of this problem can be found in Shephard (1996).

It is also possible to combine a regression or an ARMA model for the meanwith an ARCH model for the errors. For example, a regression with ARCH(1)errors model would have the observations xt as linear function of p regressors,zzzt = (zt1, ..., ztp)

′, and ARCH(1) noise yt, say,

xt = βββ′zzzt + yt,

where yt satisfies (5.36)-(5.37), but, in this case, is unobserved. Similarly, forexample, an AR(1) model for data xt exhibiting ARCH(1) errors would be

xt = φ0 + φ1xt−1 + yt.

These types of models were explored by Weiss (1984).

Example 5.4 Analysis of U.S. GNP

In Example 3.38, we fit an MA(2) model and an AR(1) model to theU.S. GNP series and we concluded that the residuals from both fits ap-peared to behave like a white noise process. In Example 3.42 we concludedthat the AR(1) is probably the better model in this case. It has been sug-gested that the U.S. GNP series has ARCH errors, and in this example, wewill investigate this claim. If the GNP noise term is ARCH, the squares of

Page 297: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

284 5 Additional Time Domain Topics

1 2 3 4 5

0.0

0.2

0.4

0.6

0.8

1.0

Series: innov^2

LAG

AC

F

1 2 3 4 5

0.0

0.2

0.4

0.6

0.8

1.0

LAG

PAC

F

Fig. 5.5. ACF and PACF of the squares of the residuals from the AR(1) fit onU.S. GNP.

the residuals from the fit should behave like a non-Gaussian AR(1) process,as pointed out in (5.39). Figure 5.5 shows the ACF and PACF of the squaredresiduals it appears that there may be some dependence, albeit small, leftin the residuals. The figure was generated in R as follows.

1 gnpgr = diff(log(gnp)) # get the growth rate

2 sarima(gnpgr, 1, 0, 0) # fit an AR(1)

3 acf2(innov^2, 24) # get (p)acf of the squared residuals

We used the R package fGarch to fit an AR(1)-ARCH(1) model to theU.S. GNP returns with the following results. A partial output is shown; wenote that garch(1,0) specifies an arch(1) in the code below (details later).

1 library(fGarch)

2 summary(garchFit(~arma(1,0)+garch(1,0), gnpgr))

Estimate Std. Error t value Pr(>|t|)

mu 5.278e-03 8.996e-04 5.867 4.44e-09

ar1 3.666e-01 7.514e-02 4.878 1.07e-06

omega 7.331e-05 9.011e-06 8.135 4.44e-16

alpha1 1.945e-01 9.554e-02 2.035 0.0418

Page 298: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

5.4 GARCH Models 285

Standardised Residuals Tests: Statistic p-Value

Jarque-Bera Test R Chi^2 9.118036 0.01047234

Shapiro-Wilk Test R W 0.9842407 0.01433690

Ljung-Box Test R Q(10) 9.874326 0.4515875

Ljung-Box Test R Q(15) 17.55855 0.2865844

Ljung-Box Test R Q(20) 23.41363 0.2689437

Ljung-Box Test R^2 Q(10) 19.2821 0.03682246

Ljung-Box Test R^2 Q(15) 33.23648 0.004352736

Ljung-Box Test R^2 Q(20) 37.74259 0.009518992

LM Arch Test R TR^2 25.41625 0.01296901

In this example, we obtain φ0 = .005 (called mu in the output) and φ1 =.367 (called ar1) for the AR(1) parameter estimates; in Example 3.38 thevalues were .005 and .347, respectively. The ARCH(1) parameter estimatesare α0 = 0 (called omega) for the constant and α1 = .195, which is significantwith a p-value of about .04. There are a number of tests that are performedon the residuals [R] or the squared residuals [R^2]. For example, the Jarque–Bera statistic tests the residuals of the fit for normality based on the observedskewness and kurtosis, and it appears that the residuals have some non-normal skewness and kurtosis. The Shapiro–Wilk statistic tests the residualsof the fit for normality based on the empirical order statistics. The othertests, primarily based on the Q-statistic, are used on the residuals and theirsquares.

The ARCH(1) model can be extended to the general ARCH(m) model inan obvious way. That is, (5.36), yt = σtεt, is retained, but (5.37) is extendedto

σ2t = α0 + α1y

2t−1 + · · ·+ αmy

2t−m. (5.47)

Estimation for ARCH(m) also follows in an obvious way from the discussionof estimation for ARCH(1) models. That is, the conditional likelihood of thedata ym+1, ...., yn given y1, . . . , ym, is given by

L(ααα∣∣ y1, . . . , ym) =

n∏t=m+1

fααα(yt∣∣ yt−1, . . . , yt−m), (5.48)

where ααα = (α0, α1, . . . , αm) and the conditional densities fααα(·|·) in (5.48) arenormal densities; that is, for t > m,

yt∣∣ yt−1, . . . , yt−m ∼ N(0, α0 + α1y

2t−1 + · · ·+ αmy

2t−m).

Another extension of ARCH is the generalized ARCH or GARCH modeldeveloped by Bollerslev (1986). For example, a GARCH(1, 1) model retains(5.36), yt = σtεt, but extends (5.37) as follows:

σ2t = α0 + α1y

2t−1 + β1σ

2t−1. (5.49)

Under the condition that α1 + β1 < 1, using similar manipulations as in(5.39), the GARCH(1, 1) model, (5.36) and (5.49), admits a non-GaussianARMA(1, 1) model for the squared process

Page 299: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

286 5 Additional Time Domain Topics

y2t = α0 + (α1 + β1)y2t−1 + vt − β1vt−1, (5.50)

where vt is as defined in (5.39). Representation (5.50) follows by writing (5.36)as

y2t − σ2t = σ2

t (ε2t − 1)

β1(y2t−1 − σ2t−1) = β1σ

2t−1(ε2t−1 − 1),

subtracting the second equation from the first, and using the fact that, from(5.49), σ2

t − β1σ2t−1 = α0 + α1y

2t−1, on the left-hand side of the result. The

GARCH(m, r) model retains (5.36) and extends (5.49) to

σ2t = α0 +

m∑j=1

αjy2t−j +

r∑j=1

βjσ2t−j . (5.51)

Conditional maximum likelihood estimation of the GARCH(m, r) modelparameters is similar to the ARCH(m) case, wherein the conditional like-lihood, (5.48), is the product of N(0, σ2

t ) densities with σ2t given by (5.51)

and where the conditioning is on the first max(m, r) observations, withσ21 = · · · = σ2

r = 0. Once the parameter estimates are obtained, the model canbe used to obtain one-step-ahead forecasts of the volatility, say σ2

t+1, given by

σ2t+1 = α0 +

m∑j=1

αjy2t+1−j +

r∑j=1

βj σ2t+1−j . (5.52)

We explore these concepts in the following example.

Example 5.5 GARCH Analysis of the NYSE Returns

As previously mentioned, the daily returns of the NYSE shown in Figure 1.4exhibit classic GARCH features. We used the R fGarch package to fit aGARCH(1, 1) model to the series with the following results:

1 library(fGarch)

2 summary(nyse.g <- garchFit(~garch(1,1), nyse))

Estimate Std. Error t value Pr(>|t|)

mu 7.369e-04 1.786e-04 4.126 3.69e-05

omega 6.542e-06 1.455e-06 4.495 6.94e-06

alpha1 1.141e-01 1.604e-02 7.114 1.13e-12

beta1 8.061e-01 2.973e-02 27.112 < 2e-16

Standardised Residuals Tests:

Statistic p-Value

Jarque-Bera Test R Chi^2 3628.415 0

Shapiro-Wilk Test R W 0.9515562 0

Ljung-Box Test R Q(10) 29.69242 0.0009616813

Ljung-Box Test R Q(15) 30.50938 0.01021164

Ljung-Box Test R Q(20) 32.81143 0.03538324

Page 300: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

5.4 GARCH Models 287

Time

NYSE

Ret

urns

900 920 940 960 980 1000

−0.2

−0.1

0.0

0.1

0.2

Fig. 5.6. GARCH predictions of the NYSE volatility, ±2σt, displayed as dashedlines.

Ljung-Box Test R^2 Q(10) 3.510505 0.9667405

Ljung-Box Test R^2 Q(15) 4.408852 0.9960585

Ljung-Box Test R^2 Q(20) 6.68935 0.9975864

LM Arch Test R TR^2 3.967784 0.9840107

To explore the GARCH predictions of volatility, we calculated and plot-ted the 100 observations from the middle of the data (which includes theOctober 19, 1987 crash) along with the one-step-ahead predictions of thecorresponding volatility, σ2

t . The results are displayed as the data ±2σt asa dashed line surrounding the data in Figure 5.6.

3 u = [email protected]

4 plot(window(nyse, start=900, end=1000), ylim=c(-.22,.2), ylab="NYSE

Returns")

5 lines(window(nyse-2*u, start=900, end=1000), lty=2, col=4)

6 lines(window(nyse+2*u, start=900, end=1000), lty=2, col=4)

Some key points can be gleaned from the examples of this section. First, itis apparent that the conditional distribution of the returns is rarely normal.fGarch allows for various distributions to be fit to the data; see the help filefor information. Some drawbacks of the GARCH model are: (i) the modelassumes positive and negative returns have the same effect because volatilitydepends on squared returns; (ii) the model is restrictive because of the tightconstraints on the model parameters (e.g., for an ARCH(1), 0 ≤ α2

1 <13 ); (iii)

the likelihood is flat unless n is very large; (iv) the model tends to overpredictvolatility because it responds slowly to large isolated returns.

Page 301: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

288 5 Additional Time Domain Topics

Various extensions to the original model have been proposed to overcomesome of the shortcomings we have just mentioned. For example, we havealready discussed the fact that the S-PLUS Garch module will fit some non-normal, albeit symmetric, distributions. For asymmetric return dynamics, onecan use the EGARCH (exponential GARCH) model, which is a complex modelthat has different components for positive returns and for negative returns. Inthe case of persistence in volatility, the integrated GARCH (IGARCH) modelmay be used. Recall (5.50) where we showed the GARCH(1, 1) model can bewritten as

y2t = α0 + (α1 + β1)y2t−1 + vt − β1vt−1and y2t is stationary if α1 + β1 < 1. The IGARCH model sets α1 + β1 = 1, inwhich case the IGARCH(1, 1) model is

yt = σtεt and σ2t = α0 + (1− β1)y2t−1 + β1σ

2t−1.

There are many different extensions to the basic ARCH model that were de-veloped to handle the various situations noticed in practice. Interested readersmight find the general discussions in Engle et al. (1994) and Shephard (1996)worthwhile reading. Also, Gourieroux (1997) gives a detailed presentation ofARCH and related models with financial applications and contains an ex-tensive bibliography. Two excellent texts on financial time series analysis areChan (2002) and Tsay (2002).

Finally, we briefly discuss stochastic volatility models; a detailed treatmentof these models is given in Chapter 6. The volatility component, σ2

t , in theGARCH model is conditionally nonstochastic. In the ARCH(1) model forexample, any time the previous return is zero, i.e., yt−1 = 0, it must be thecase that σ2

t = α0, and so on. This assumption seems a bit unrealistic. Thestochastic volatility model adds a stochastic component to the volatility inthe following way. In the GARCH model, a return, say yt, is

yt = σtεt ⇒ log y2t = log σ2t + log ε2t . (5.53)

Thus, the observations log y2t are generated by two components, the unob-served volatility log σ2

t and the unobserved noise log ε2t . While, for example,the GARCH(1, 1) models volatility without error, σ2

t+1 = α0 + α1r2t + β1σ

2t ,

the basic stochastic volatility model assumes the latent variable is an autore-gressive process,

log σ2t+1 = φ0 + φ1 log σ2

t + wt (5.54)

where wt ∼ iid N(0, σ2w). The introduction of the noise term wt makes the

latent volatility process stochastic. Together (5.53) and (5.54) comprise thestochastic volatility model. Given n observations, the goals are to estimatethe parameters φ0, φ1 and σ2

w, and then predict future observations log y2n+m.Details are provided in §6.10.

Page 302: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

5.5 Threshold Models 289

Time

flu

1968 1970 1972 1974 1976 1978

0.2

0.3

0.4

0.5

0.6

0.7

0.8 J

F

M

AMJJA

SO

N

D

J

F

M

AMJJAS

ON

D

J

F

M

AMJJAS

OND

JFM

A

MJJASON

D

J

F

M

AMJJAS

ON

D

J

F

M

AMJJA

SONDJFM

A

MJJAS

ON

D

JF

M

AMJJAS

ON

DJ

F

M

A

MJJAS

ONDJFM

A

MJJAS

ON

D

J

F

M

AMJJAS

OND

Fig. 5.7. U.S. monthly pneumonia and influenza deaths per 10,000.

5.5 Threshold Models

In §3.5 we discussed the fact that, for a stationary time series, best linearprediction forward in time is the same as best linear prediction backward intime. This result followed from the fact that the variance–covariance matrixof xxx1:n = (x1, x2, ..., xn)′, say, Γ = {γ(i−j)}ni,j=1, is the same as the variance–covariance matrix of xxxn:1 = (xn, xn−1, ..., x1)′. In addition, if the process isGaussian, the distributions of xxx1:n and xxxn:1 are identical. In this case, a timeplot of xxx1:n (that is, the data plotted forward in time) should look similar toa time plot of xxxn:1 (that is, the data plotted backward in time).

There are, however, many series that do not fit into this category. Forexample, Figure 5.7 shows a plot of monthly pneumonia and influenza deathsper 10,000 in the U.S. for 11 years, 1968 to 1978. Typically, the number ofdeaths tends to increase slower than it decreases. Thus, if the data were plottedbackward in time, the backward series would tend to increase faster than itdecreases. Also, if monthly pneumonia and influenza deaths followed a linearGaussian process, we would not expect to see such large bursts of positive andnegative changes that occur periodically in this series. Moreover, although thenumber of deaths is typically largest during the winter months, the data arenot perfectly seasonal. That is, although the peak of the series often occursin January, in other years, the peak occurs in February or in March.

Page 303: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

290 5 Additional Time Domain Topics

Many approaches to modeling nonlinear series exist that could be used(see Priestley, 1988); here, we focus on the class of threshold autoregressivemodels presented in Tong (1983, 1990). The basic idea of these models isthat of fitting local linear AR(p) models, and their appeal is that we canuse the intuition from fitting global linear AR(p) models. Suppose we knowp, and given the vectors xxxt−1 = (xt−1, ..., xt−p)

′, we can identify r mutuallyexclusive and exhaustive regions for xxxt−1, say, R1, ..., Rr, where the dynamicsof the system changes. The threshold model is then written as r AR(p) models,

xt = α(j) + φ(j)1 xt−1 + · · ·+ φ(j)p xt−p + w

(j)t , xxxt−1 ∈ Rj , (5.55)

for j = 1, ..., r. In (5.55), the w(j)t are independent white noise series, each with

variance σ2j , for j = 1, ..., r. Model estimation, identification, and diagnostics

proceed as in the case in which r = 1.

Example 5.6 Threshold Modeling of the Influenza Series

As previously discussed, examination of Figure 5.7 leads us to believe thatthe monthly pneumonia and influenza deaths time series, say flut, is notlinear. It is also evident from Figure 5.7 that there is a slight negative trendin the data. We have found that the most convenient way to fit a thresholdmodel to this data set, while removing the trend, is to work with the firstdifference of the data. The differenced data,

xt = flut − flut−1

is exhibited in Figure 5.8 as the dark solid line with circles representingobservations. The dashed line with squares in Figure 5.8 are the one-month-ahead predictions, and we will discuss this series later.

The nonlinearity of the data is more pronounced in the plot of the firstdifferences, xt. Clearly, the change in the numbers of deaths, xt, slowly risesfor some months and, then, sometime in the winter, has a possibility ofjumping to a large number once xt exceeds about .05. If the process doesmake a large jump, then a subsequent significant decrease occurs in fludeaths. As an initial analysis, we fit the following threshold model

xt = α(1) +

p∑j=1

φ(1)j xt−j + w

(1)t , xt−1 < .05

xt = α(2) +

p∑j=1

φ(2)j xt−j + w

(2)t , xt−1 ≥ .05, (5.56)

with p = 6, assuming this would be larger than necessary.Model (5.56) is easy to fit using two linear regression runs. That is, let

δ(1)t = 12 if xt−1 < .05, and zero otherwise, and let δ

(2)t = 1 if xt−1 ≥ .05,

and zero otherwise. Then, using the notation of §2.2, for t = p + 1, ..., n,either equation in (5.56) can be written as

Page 304: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

5.5 Threshold Models 291

Time

dflu

1968 1970 1972 1974 1976 1978

−0.4

−0.2

0.0

0.2

0.4

Fig. 5.8. First differenced U.S. monthly pneumonia and influenza deaths (triangles);one-month-ahead predictions (dashed line) with ±2 prediction error bounds (solidline).

yt = βββ′zzzt + wt

where, for i = 1, 2,

yt = δ(i)t xt, zzz′t,= δ

(i)t (1, xt−1, ..., xt−p), wt = δ

(i)t w

(i)t ,

andβββ′ = (α(i), φ

(i)1 , φ

(i)2 , ..., φ(i)p ).

Parameter estimates can then be obtained using the regression techniquesof §2.2 twice, once for i = 1 and again for i = 2.

An order p = 4 was finally selected for each part of the model. The finalmodel was

xt = .51(.08)xt−1 − .20(.06)xt−2 + .12(.05)xt−3

−.11(.05)xt−4 + w(1)t , when xt−1 < .05

xt = .40− .75(.17)xt−1 − 1.03(.21)xt−2 − 2.05(1.05)xt−3

−6.71(1.25)xt−4 + w(2)t , when xt−1 ≥ .05,

Page 305: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

292 5 Additional Time Domain Topics

where σ1 = .05 and σ2 = .07. The threshold of .05 was exceeded 17 times.Using the final model, one-month-ahead predictions can be made, and theseare shown in Figure 5.8 as a dashed line with squares. The model doesextremely well at predicting a flu epidemic; the peak at 1976, however, wasmissed by this model. When we fit a model with a smaller threshold of .04, fluepidemics were somewhat underestimated, but the flu epidemic in the eighthyear was predicted one month early. We chose the model with a thresholdof .05 because the residual diagnostics showed no obvious departure fromthe model assumption (except for one outlier at 1976); the model with athreshold of .04 still had some correlation left in the residuals and therewas more than one outlier. Finally, prediction beyond one-month-ahead forthis model is very complicated, but some approximate techniques exist (seeTong, 1983).

The following commands can be used to perform this analysis in R. Thefirst three lines generate Figure 5.7.

1 # Plot data with month initials as points

2 plot(flu, type="c")

3 Months = c("J","F","M","A","M","J","J","A","S","O","N","D")

4 points(flu, pch=Months, cex=.8, font=2)

5 # Start analysis

6 dflu = diff(flu)

7 thrsh = .05 # threshold

8 Z = ts.intersect(dflu, lag(dflu,-1), lag(dflu,-2), lag(dflu,-3),

lag(dflu,-4))

9 ind1 = ifelse(Z[,2] < thrsh, 1, NA) # indicator < thrsh

10 ind2 = ifelse(Z[,2] < thrsh, NA, 1) # indicator >= thrsh

11 X1 = Z[,1]*ind1

12 X2 = Z[,1]*ind2

13 summary(fit1<-lm(X1~Z[,2:5])) # case 1

14 summary(fit2<-lm(X2~Z[,2:5])) # case 2

15 D = cbind(rep(1, nrow(Z)), Z[,2:5]) # get predictions

16 b1 = fit1$coef

17 b2 = fit2$coef

18 p1 = D%*%b1

19 p2 = D%*%b2

20 prd = ifelse(Z[,2] < thrsh, p1, p2)

21 plot(dflu, type="p", pch=2, ylim=c(-.5,.5))

22 lines(prd, lty=4)

23 prde1 = sqrt(sum(resid(fit1)^2)/df.residual(fit1))

24 prde2 = sqrt(sum(resid(fit2)^2)/df.residual(fit2))

25 prde = ifelse(Z[,2] < thrsh, prde1, prde2)

26 lines(prd + 2*prde)

27 lines(prd - 2*prde)

Page 306: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

5.6 Regression with Autocorrelated Errors 293

5.6 Regression with Autocorrelated Errors

In §2.2, we covered the classical regression model with uncorrelated errors wt.In this section, we discuss the modifications that might be considered whenthe errors are correlated. That is, consider the regression model

yt = βββ′zzzt + xt, (5.57)

where βββ is an r × 1 vector of regression parameters and zt is an r × 1 vectorof regressors (fixed inputs), for t = 1, . . . , n, where xt is a process with somecovariance function γ(s, t). Then, we have the matrix form

yyy = Zβββ + xxx, (5.58)

where xxx = (x1, . . . , xn)′ is a n × 1 vector with n × n covariance matrix Γ ={γ(s, t)}, assumed to be positive definite. Note that Z = [zzz1, zzz2, . . . , zzzn]′ is then × r matrix of input variables, as before. If we know the covariance matrixΓ , it is possible to find a transformation matrix A, such that AΓA′ = I,where I denotes the n × n identity matrix. Then, the underlying model canbe transformed into

Ayyy = AZβββ +Axxx = Uβββ +www,

where U = AZ and www is a white noise vector with covariance matrix I. Then,applying least squares or maximum likelihood to the vector Ayyy gives what istypically called the weighted least squares estimator,

βββw = (U ′U)−1U ′Ayyy = (Z ′A′AZ)−1Z ′A′Ayyy

= (Z ′Γ−1Z)−1Z ′Γ−1yyy,(5.59)

because Γ−1 = A′A; the variance-covariance matrix of the estimator is

var(βββw) = (Z ′Γ−1Z)−1.

The difficulty in applying (5.59) is, we do not know the form of the matrix Γ .It may be possible, however, in the time series case, to assume a station-

ary covariance structure for the error process xt that corresponds to a linearprocess and try to find an ARMA representation for xt. For example, if wehave a pure AR(p) error, then

φ(B)xt = wt,

and φ(B) is the linear transformation that, when applied to the error process,produces the white noise wt. Regarding this transformation as the appropriatematrix A of the preceding paragraph produces the transformed regressionequation

φ(B)yt = βββ′φ(B)zzzt + wt,

Page 307: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

294 5 Additional Time Domain Topics

and we are back to the same model as before. Defining ut = φ(B)yt andvvvt = φ(B)zzzt leads to the simple regression problem

ut = βββ′vvvt + wt (5.60)

considered before. The preceding discussion suggests an algorithm, due toCochrane and Orcutt (1949), for fitting a regression model with autocorrelatederrors.

(i) First, run an ordinary regression of yt on zt (acting as if the errors areuncorrelated). Retain the residuals.

(ii) Fit an ARMA model to the residuals xt = yt − βββ′zzzt, say,

φ(B)xt = θ(B)wt (5.61)

(iii) Then, apply the ARMA transformation to both sides (5.57), that is,

ut =φ(B)

θ(B)yt and vvvt =

φ(B)

θ(B)zzzt,

to obtain the transformed regression model (5.60).(iv) Run an ordinary least squares regression model assuming uncorrelated

errors on the transformed regression model (5.60), obtaining

βββw = (V ′V )−1V ′uuu, (5.62)

where V = [vvv1, . . . , vvvn]′ and uuu = (u1, . . . , un)′ are the correspondingtransformed components.

The above procedure can be repeated until convergence and will approachthe maximum likelihood solution under normality of the errors (for details,see Sargan, 1964).

A more modern approach to the problem is to follow steps (i) and (ii)of the above procedure to identify a model, then fit the model using maxi-mum likelihood. For maximum likelihood estimation, note that from (5.57)xt(βββ) = yt − βββ′zt is ARMA(p, q), so we substitute xt(βββ) into the innova-tions likelihood given in (3.116), then (numerically) find the MLEs of all theregression parameters, {β1, . . . , βr, φ1, . . . , φp, θ1, . . . , θq}, and σ2

w, simultane-ously. Finally, inspect the residuals for whiteness, and adjust the model ifnecessary.

Example 5.7 Mortality, Temperature and Pollution

We consider the analyses presented in Example 2.2, relating mean adjustedtemperature Tt, and particulate levels Pt to cardiovascular mortality Mt.We consider the regression model

Mt = β1 + β2t+ β3Tt + β4T2t + β5Pt + xt, (5.63)

Page 308: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

5.6 Regression with Autocorrelated Errors 295

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Series: resid(fit)

LAG

ACF

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

LAG

PACF

Fig. 5.9. Sample ACF and PACF of the mortality residuals indicating an AR(2)process.

where, for now, we assume that xt is white noise. The sample ACF and PACFof the residuals from the ordinary least squares fit of (5.63) are shown inFigure 5.9, and the results suggest an AR(2) model for the residuals.

Our next step is to fit the correlated error model

Mt = β1 + β2t+ β3Tt + β4T2t + β5Pt + xt. (5.64)

where xt is AR(2),xt = φ1xt−1 + φ2xt−2 + wt

and wt is white noise. The model can be fit using the arima function asfollows (partial output shown).

1 trend = time(cmort)

2 temp = tempr - mean(tempr)

3 temp2 = temp^2

4 fit = lm(cmort~trend + temp + temp2 + part, na.action=NULL)

5 acf2(resid(fit), 52) # implies AR2

6 (fit2 = arima(cmort, order=c(2,0,0),

xreg=cbind(trend,temp,temp2,part)))

Coefficients:

ar1 ar2 intercept trend temp temp2 part

0.3848 0.4326 80.2116 -1.5165 -0.0190 0.0154 0.1545

s.e. 0.0436 0.0400 1.8072 0.4226 0.0495 0.0020 0.0272

sigma^2 estimated as 26.01: loglikelihood = -1549.04, aic = 3114.07

Page 309: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

296 5 Additional Time Domain Topics

7 acf2(resid(fit2), 52) # no obvious departures from whiteness

8 (Q = Box.test(resid(fit2), 12, type="Ljung")$statistic) # = 6.91

9 pchisq(as.numeric(Q), df=(12-2), lower=FALSE) # p-value = .73

The sample ACF and PACF of the residuals (not shown) show no obvi-ous departure from whiteness. The Q-statistic supports the whiteness ofthe residuals (note the residuals are from an AR(2) fit, and this is used inthe calculation of the p-value). We also note that the trend is no longersignificant, and we may wish to rerun the analysis with the trend removed.

5.7 Lagged Regression: Transfer Function Modeling

In §3.9, we considered lagged regression in a frequency domain approach basedon coherency. In this section we focus on a time domain approach to the sameproblem. In the previous section, we looked at autocorrelated errors but, stillregarded the input series zzzt as being fixed unknown functions of time. Thisconsideration made sense for the time argument t, but was less satisfactoryfor the other inputs, which are probably stochastic processes. For example,consider the SOI and Recruitment series that were presented in Example 1.5.The series are displayed in Figure 1.5. In this case, the interest is in predictingthe output Recruitment series, say, yt, from the input SOI, say xt. We mightconsider the lagged regression model

yt =∞∑j=0

αjxt−j + ηt = α(B)xt + ηt, (5.65)

where∑j |αj | < ∞. We assume the input process xt and noise process

ηt in (5.65) are both stationary and mutually independent. The coefficientsα0, α1, . . . describe the weights assigned to past values of xt used in predictingyt and we have used the notation

α(B) =

∞∑j=0

αjBj . (5.66)

In the Box and Jenkins (1970) formulation, we assign ARIMA models,say, ARIMA(p, d, q) and ARIMA(pη, dη, qη), to the series xt and ηt, respec-tively. The components of (5.65) in backshift notation, for the case of simpleARMA(p, q) modeling of the input and noise, would have the representation

φ(B)xt = θ(B)wt (5.67)

andφη(B)ηt = θη(B)zt, (5.68)

where wt and zt are independent white noise processes with variances σ2w and

σ2z , respectively. Box and Jenkins (1970) proposed that systematic patterns

Page 310: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

5.7 Lagged Regression: Transfer Function Modeling 297

often observed in the coefficients αj , for j = 1, 2, ..., could often be expressedas a ratio of polynomials involving a small number of coefficients, along witha specified delay, d, so

α(B) =δ(B)Bd

ω(B), (5.69)

whereω(B) = 1− ω1B − ω2B

2 − · · · − ωrBr (5.70)

andδ(B) = δ0 + δ1B + · · ·+ δsB

s (5.71)

are the indicated operators; in this section, we find it convenient to representthe inverse of an operator, say, [ω(B)]−1, as 1/ω(B).

Determining a parsimonious model involving a simple form for α(B) andestimating all of the parameters in the above model are the main tasks in thetransfer function methodology. Because of the large number of parameters, itis necessary to develop a sequential methodology. Suppose we focus first onfinding the ARIMA model for the input xt and apply this operator to bothsides of (5.65), obtaining the new model

yt =φ(B)

θ(B)yt = α(B)wt +

φ(B)

θ(B)ηt = α(B)wt + ηt,

where wt and the transformed noise ηt are independent.The series wt is a prewhitened version of the input series, and its cross-

correlation with the transformed output series yt will be just

γyw(h) = E[yt+hwt] = E

∞∑j=0

αjwt+h−jwt

= σ2wαh, (5.72)

because the autocovariance function of white noise will be zero except whenj = h in (5.72). Hence, by computing the cross-correlation between theprewhitened input series and the transformed output series should yield arough estimate of the behavior of α(B).

Example 5.8 Relating the Prewhitened SOI to the TransformedRecruitment Series

We give a simple example of the suggested procedure for the SOI and theRecruitment series. Figure 5.10 shows the sample ACF and PACF of thedetrended SOI, and it is clear, from the PACF, that an autoregressive serieswith p = 1 will do a reasonable job. Fitting the series gave φ = .588 withσ2w = .092, and we applied the operator (1 − .588B) to both xt and yt

and computed the cross-correlation function, which is shown in Figure 5.11.Noting the apparent shift of d = 5 months and the decrease thereafter, itseems plausible to hypothesize a model of the form

Page 311: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

298 5 Additional Time Domain Topics

0.0 0.5 1.0 1.5 2.0 2.5 3.0

−0.4

0.0

0.4

0.8

LAG

ACF

0.0 0.5 1.0 1.5 2.0 2.5 3.0

−0.4

0.0

0.4

0.8

LAG

PACF

Fig. 5.10. Sample ACF and PACF of SOI.

α(B) = δ0B5(1 + ω1B + ω2

1B2 + · · · ) =

δ0B5

1− ω1B

for the transfer function. In this case, we would expect ω1 to be negative.The following R code was used for this example.

1 acf2(soi)

2 (fit = arima(soi, xreg=time(soi), order=c(1, 0, 0)))

Coefficients: ar1 intercept time(soi)

0.5875 13.7507 -0.0069

s.e. 0.0379 6.1775 0.0031

sigma^2 estimated as 0.0918

3 ar1 = as.numeric(fit$coef[1]) # = 0.5875387

4 soi.pw = resid(fit)

5 rec.d = resid(lm(rec~time(rec), na.action=NULL))

6 rec.fil = filter(rec.d, filter=c(1, -ar1), method="conv", sides=1)

7 ccf(soi.pw, rec.fil, main="", ylab="CCF", na.action=na.omit)

In the code above, soi.pw is the prewhitened detrended SOI series, rec.dis the detrended Recruitment series, and rec.fil is the filtered, detrendedRecruitment series. In the ccf calculation, na.action=na.omit is used be-cause rec.fil[1] is NA.

In some cases, we may postulate the form of the separate components δ(B)and ω(B), so we might write the equation

yt =δ(B)Bd

ω(B)xt + ηt

Page 312: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

5.7 Lagged Regression: Transfer Function Modeling 299

−3 −2 −1 0 1 2 3

−0.6

−0.4

−0.2

0.00.2

0.4

Lag

CCF

Fig. 5.11. Sample CCF of the prewhitened, detrended SOI and the similarly trans-formed Recruitment series; negative lags indicate that SOI leads Recruitment.

asω(B)yt = δ(B)Bdxt + ω(B)ηt,

or in regression form

yt =r∑

k=1

ωkyt−k +s∑

k=0

δkxt−d−k + ut, (5.73)

whereut = ω(B)ηt. (5.74)

The form of (5.73) suggests doing a regression on the lagged versions of both

the input and output series to obtain βββ, the estimate of the (r + s + 1) × 1regression vector

βββ = (ω1, . . . , ωr, δ0, δ1, . . . , δs)′.

The residuals from the regression above, say,

ut = yt − βββ′zzzt,

wherezzzt = (yt−1, . . . , yt−r, xt−d, . . . , xt−d−s)

denotes the usual vector of independent variables, could be used to approxi-mate the best ARMA model for the noise process ηt, because we can computean estimator for that process from (5.74), using ut and ω(B) and applyingthe moving average operator to get ηt. Fitting an ARMA(pη, qη) model to thethis estimated noise then completes the specification. The preceding suggeststhe following sequential procedure for fitting the transfer function model todata.

Page 313: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

300 5 Additional Time Domain Topics

(i) Fit an ARMA model to the input series xt to estimate the parametersφ1, . . . , φp, θ1, . . . , θq, σ

2w in the specification (5.67). Retain ARMA coef-

ficients for use in step (ii) and the fitted residuals wt for use in Step(iii).

(ii) Apply the operator determined in step (i), that is,

φ(B)yt = θ(B)yt,

to determine the transformed output series yt.(iii) Use the cross-correlation function between yt and wt in steps (i) and (ii)

to suggest a form for the components of the polynomial

α(B) =δ(B)Bd

ω(B)

and the estimated time delay d.(iv) Obtain βββ = (ω1, . . . , ωr, δ0, δ1, . . . , δs) by fitting a linear regression of the

form (5.73). Retain the residuals ut for use in step (v).(v) Apply the moving average transformation (5.74) to the residuals ut to

find the noise series ηt, and fit an ARMA model to the noise, obtainingthe estimated coefficients in φη(B) and θη(B).

The above procedure is fairly reasonable, but does not have any recogniz-able overall optimality. Simultaneous least squares estimation, based on theobserved xt and yt, can be accomplished by noting that the transfer functionmodel can be written as

yt =δ(B)Bd

ω(B)xt +

θη(B)

φη(B)zt,

which can be put in the form

ω(B)φη(B)yt = φη(B)δ(B)Bdxt + ω(B)θη(B)zt, (5.75)

and it is clear that we may use least squares to minimize∑t z

2t , as in earlier

sections. We may also express the transfer function in state-space form (seeBrockwell and Davis, 1991, Chapter 12). It is often easier to fit a transferfunction model in the spectral domain as presented in §4.10.

Example 5.9 Transfer Function Model for SOI and Recruitment

We illustrate the procedure for fitting a transfer function model of the formsuggested in Example 5.8 to the detrended SOI series (xt) and the detrendedRecruitment series (yt). The results reported here can be compared with theresults obtained from the frequency domain approach used in Example 4.24.Note first that steps (i)–(iii). have already been applied to determine theARMA model

(1− .588B)xt = wt,

Page 314: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

5.8 Multivariate ARMAX Models 301

where σ2w = .092. Using the model determined in Example 5.8, we run the

regressionyt = ω1yt−1 + δ0xt−5 + ut,

yielding ω1 = .853, δ0 = −18.86, where the residuals satisfy

ut = (1− .853B)ηt.

This completes step (iv). To complete the specification, we apply the movingaverage operator above to estimate the original noise series ηt and fit asecond-order autoregressive model, based on the ACF and PACF shown inFigure 5.12. We obtain

(1− 1.426B + 1.30B2 − .356B3)ηt = zt,

with σ2z = 50.85 as the estimated error variance.

The following R code was used for this example.1 rec.d = resid(lm(rec~time(rec), na.action=NULL))

2 soi.d = resid(lm(soi~time(soi), na.action=NULL))

3 fish = ts.intersect(rec.d, rec.d1=lag(rec.d,-1), soi.d5=lag(soi,-5),

dframe=TRUE)

4 summary(fish.fit <- lm(rec.d~0+rec.d1+soi.d5, data=fish))

Estimate Std Error t value Pr(>|t|)

rec.d1 0.8531 0.0144 59.23 <2e-16

soi.d5 -18.8627 1.0038 -18.79 <2e-16

Residual standard error: 8.02 on 446 degrees of freedom

5 om1 = as.numeric(fish.fit$coef[1])

6 eta.hat = filter(resid(fish.fit), filter=c(1,-om1), method="recur",

sides=1)

7 acf2(eta.hat)

8 (eta.fit <- arima(eta.hat, order=c(3,0,0)))

Coefficients: ar1 ar2 ar3 intercept

1.4261 -1.3017 0.3557 1.6015

s.e. 0.0441 0.0517 0.0441 0.6483

sigma^2 estimated as 50.85

5.8 Multivariate ARMAX Models

To understand multivariate time series models and their capabilities, wefirst present an introduction to multivariate time series regression techniques.A useful extension of the basic univariate regression model presented in §2.2is the case in which we have more than one output series, that is, multivariateregression analysis. Suppose, instead of a single output variable yt, a collectionof k output variables yt1, yt2, . . . , ytk exist that are related to the inputs as

Page 315: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

302 5 Additional Time Domain Topics

Fig. 5.12. ACF and PACF of the estimated noise ηt departures from the transferfunction model.

yti = βi1zt1 + βi2zt2 + · · ·+ βirztr + wti (5.76)

for each of the i = 1, 2, . . . , k output variables. We assume the wti variablesare correlated over the variable identifier i, but are still independent overtime. Formally, we assume cov{wsi, wtj} = σij for s = t and is zero otherwise.Then, writing (5.76) in matrix notation, with yyyt = (yt1, yt2, . . . , ytk)′ beingthe vector of outputs, and B = {βij}, i = 1, . . . , k, j = 1, . . . , r being a k × rmatrix containing the regression coefficients, leads to the simple looking form

yyyt = Bzzzt +wwwt. (5.77)

Here, the k×1 vector process wwwt is assumed to be a collection of independentvectors with common covariance matrix E{wwwtwww′t} = Σw, the k × k matrixcontaining the covariances σij . The maximum likelihood estimator, under theassumption of normality, for the regression matrix in this case is

B = Y ′Z(Z ′Z)−1, (5.78)

where Z ′ = [zzz1, zzz2, . . . , zzzn] is as before and Y ′ = [yyy1, yyy2, . . . , yyyn]. The errorcovariance matrix Σw is estimated by

Σw =1

n− r

n∑t=1

(yyyt − Bzzzt)(yyyt − Bzzzt)′. (5.79)

The uncertainty in the estimators can be evaluated from

Page 316: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

5.8 Multivariate ARMAX Models 303

se(βij) =√ciiσjj , (5.80)

for i = 1, . . . , r, j = 1, . . . , k, where se denotes estimated standard error, σjjis the j-th diagonal element of Σw, and cii is the i-th diagonal element of

(∑nt=1 zzztzzz

′t)−1

.Also, the information theoretic criterion changes to

AIC = ln |Σw|+2

n

(kr +

k(k + 1)

2

). (5.81)

and BIC replaces the second term in (5.81) by K lnn/n where K = kr +k(k + 1)/2. Bedrick and Tsai (1994) have given a corrected form for AIC inthe multivariate case as

AICc = ln |Σw|+k(r + n)

n− k − r − 1. (5.82)

Many data sets involve more than one time series, and we are often in-terested in the possible dynamics relating all series. In this situation, weare interested in modeling and forecasting k × 1 vector-valued time seriesxxxt = (xt1, . . . , xtk)′, t = 0,±1,±2, . . .. Unfortunately, extending univariateARMA models to the multivariate case is not so simple. The multivariate au-toregressive model, however, is a straight-forward extension of the univariateAR model.

For the first-order vector autoregressive model, VAR(1), we take

xxxt = ααα+ Φxxxt−1 +wwwt, (5.83)

where Φ is a k × k transition matrix that expresses the dependence of xxxt onxxxt−1. The vector white noise process wwwt is assumed to be multivariate normalwith mean-zero and covariance matrix

E (wwwtwww′t) = ΣΣΣw. (5.84)

The vector ααα = (α1, α2, . . . , αk)′ appears as the constant in the regressionsetting. If E(xxxt) = µµµ, then ααα = (I − Φ)µµµ.

Note the similarity between the VAR model and the multivariate linearregression model (5.77). The regression formulas carry over, and we can, onobserving xxx1, . . . , xxxn, set up the model (5.83) with yyyt = xxxt, B = (ααα,Φ) and zzzt =(1, xxx′t−1)′. Then, write the solution as (5.78) with the conditional maximumlikelihood estimator for the covariance matrix given by

Σw = (n− 1)−1n∑t=2

(xxxt − ααα− Φxxxt−1)(xxxt − ααα− Φxxxt−1)′. (5.85)

The special form assumed for the constant component, ααα, of the vector ARmodel in (5.83) can be generalized to include a fixed r × 1 vector of inputs,uuut. That is, we could have proposed the vector ARX model,

Page 317: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

304 5 Additional Time Domain Topics

xxxt = Γuuut +

p∑j=1

Φjxxxt−j +wwwt, (5.86)

where Γ is a p× r parameter matrix. The X in ARX refers to the exogenousvector process we have denoted here by uuut. The introduction of exogenousvariables through replacing ααα by Γuuut does not present any special problemsin making inferences and we will often drop the X for being superfluous.

Example 5.10 Pollution, Weather, and Mortality

For example, for the three-dimensional series composed of cardiovascularmortality xt1, temperature xt2, and particulate levels xt3, introduced inExample 2.2, take xxxt = (xt1, xt2, xt3)′ as a vector of dimension k = 3. Wemight envision dynamic relations among the three series defined as the firstorder relation,

xt1 = α1 + β1t+ φ11xt−1,1 + φ12xt−1,2 + φ13xt−1,3 + wt1,

which expresses the current value of mortality as a linear combination oftrend and its immediate past value and the past values of temperature andparticulate levels. Similarly,

xt2 = α2 + β2t+ φ21xt−1,1 + φ22xt−1,2 + φ23xt−1,3 + wt2

andxt3 = α3 + β3t+ φ31xt−1,1 + φ32xt−1,2 + φ33xt−1,3 + wt3

express the dependence of temperature and particulate levels on the otherseries. Of course, methods for the preliminary identification of these modelsexist, and we will discuss these methods shortly. The model in the form of(5.86) is

xxxt = Γuuut + Φxxxt−1 +wwwt,

where, in obvious notation, Γ = [ααα | βββ] is 3× 2 and uuut = (1, t)′ is 2× 1.Throughout much of this section we will use the R package vars to fit

vector AR models via least squares. For this particular example, we have(partial output shown):

1 library(vars)

2 x = cbind(cmort, tempr, part)

3 summary(VAR(x, p=1, type="both")) # "both" fits constant + trend

Estimation results for equation cmort: (other equations not shown)

cmort = cmort.l1 + tempr.l1 + part.l1 + const + trend

Estimate Std. Error t value Pr(>|t|)

cmort.l1 0.464824 0.036729 12.656 < 2e-16 ***

tempr.l1 -0.360888 0.032188 -11.212 < 2e-16 ***

part.l1 0.099415 0.019178 5.184 3.16e-07 ***

const 73.227292 4.834004 15.148 < 2e-16 ***

trend -0.014459 0.001978 -7.308 1.07e-12 ***

Page 318: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

5.8 Multivariate ARMAX Models 305

Residual standard error: 5.583 on 502 degrees of freedom

Multiple R-Squared: 0.6908, Adjusted R-squared: 0.6883

F-statistic: 280.3 on 4 and 502 DF, p-value: < 2.2e-16

Covariance matrix of residuals:

cmort tempr part

cmort 31.172 5.975 16.65

tempr 5.975 40.965 42.32

part 16.654 42.323 144.26

Note that t here is time(cmort), which is 1970+∆ (t−1), where ∆ = 1/52,for t = 1, ..., 508, ending at 1979.75. For this particular case, we obtain

ααα = (73.23, 67.59, 67.46)′, βββ = (−0.014,−0.007,−0.005)′,

Φ =

.46(.04) −.36(.03) .10(.02)−.24(.04) .49(.04) −.13(.02)−.12(.08) −.48(.07) .58(.04)

,

where the standard errors, computed as in (5.80), are given in parentheses.The estimate of Σw seen to be

Σw =

31.17 5.98 16.655.98 40.965 42.32

16.65 42.32 144.26

.

For the vector (xt1, xt2, xt3) = (Mt, Tt, Pt), with Mt, Tt and Pt denotingmortality, temperature, and particulate level, respectively, we obtain theprediction equation for mortality,

Mt = 73.23− .014t+ .46Mt−1 − .36Tt−1 + .10Pt−1.

Comparing observed and predicted mortality with this model leads to anR2 of about .69.

It is easy to extend the VAR(1) process to higher orders, VAR(p). To dothis, we use the notation of (5.77) and write the vector of regressors as

zzzt = (1, xxx′t−1, xxx′t−2, . . . xxx

′t−p)

and the regression matrix as B = (ααα,Φ1, Φ2, . . . , Φp). Then, this regressionmodel can be written as

xxxt = ααα+

p∑j=1

Φjxxxt−j +wwwt (5.87)

for t = p+ 1, . . . , n. The k × k error sum of products matrix becomes

Page 319: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

306 5 Additional Time Domain Topics

SSE =

n∑t=p+1

(xxxt − Bzzzt)(xxxt − Bzzzt)′, (5.88)

so that the conditional maximum likelihood estimator for the error covariancematrix Σw is

Σw = SSE/(n− p), (5.89)

as in the multivariate regression case, except now only n− p residuals exist in(5.88). For the multivariate case, we have found that the Schwarz criterion

BIC = log |Σw|+ k2p lnn/n, (5.90)

gives more reasonable classifications than either AIC or corrected versionAICc. The result is consistent with those reported in simulations by Lutkepohl(1985).

Example 5.11 Pollution, Weather, and Mortality (cont)

We used the R package first to select a VAR(p) model and then fit themodel. The selection criteria used in the package are AIC, Hannan-Quinn(HQ; Hannan & Quinn, 1979), BIC (SC), and Final Prediction Error (FPE).The Hannan-Quinn procedure is similar to BIC, but with lnn replaced by2 ln(ln(n)) in the penalty term. FPE finds the model that minimizes theapproximate mean squared one-step-ahead prediction error (see Akaike, 1969for details); it is rarely used.

1 VARselect(x, lag.max=10, type="both")

$selection

AIC(n) HQ(n) SC(n) FPE(n)

9 5 2 9

$criteria

1 2 3 4

AIC(n) 11.73780 11.30185 11.26788 11.23030

HQ(n) 11.78758 11.38149 11.37738 11.36967

SC(n) 11.86463 11.50477 11.54689 11.58541

FPE(n) 125216.91717 80972.28678 78268.19568 75383.73647

5 6 7 8

AIC(n) 11.17634 11.15266 11.15247 11.12878

HQ(n) 11.34557 11.35176 11.38144 11.38760

SC(n) 11.60755 11.65996 11.73587 11.78827

FPE(n) 71426.10041 69758.25113 69749.89175 68122.40518

9 10

AIC(n) 11.11915 11.12019

HQ(n) 11.40784 11.43874

SC(n) 11.85473 11.93187

FPE(n) 67476.96374 67556.45243

Note that BIC picks the order p = 2 model while AIC and FPE pick anorder p = 9 model and Hannan-Quinn selects an order p = 5 model.

Fitting the model selected by BIC we obtain

Page 320: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

5.8 Multivariate ARMAX Models 307

ααα = (56.1, 49.9, 59.6)′, βββ = (−0.011,−0.005,−0.008)′,

Φ1 =

.30(.04) −.20(.04) .04(.02)−.11(.05) .26(.05) −.05(.03).08(.09) −.39(.09) .39(.05)

,

Φ2 =

.28(.04) −.08(.03) .07(.03)−.04(.05) .36(.05) −.10(.03)−.33(.09) .05(.09) .38(.05)

,

where the standard errors are given in parentheses. The estimate of Σw is

Σw =

28.03 7.08 16.337.08 37.63 40.88

16.33 40.88 123.45

.

To fit the model using the vars package use the following line (partial resultsdisplayed):

2 summary(fit <- VAR(x, p=2, type="both"))

cmort = cmort.l1 + tempr.l1 + part.l1 + cmort.l2 + tempr.l2

+ part.l2 + const + trend

Estimate Std. Error t value Pr(>|t|)

cmort.l1 0.297059 0.043734 6.792 3.15e-11 ***

tempr.l1 -0.199510 0.044274 -4.506 8.23e-06 ***

part.l1 0.042523 0.024034 1.769 0.07745 .

cmort.l2 0.276194 0.041938 6.586 1.15e-10 ***

tempr.l2 -0.079337 0.044679 -1.776 0.07639 .

part.l2 0.068082 0.025286 2.692 0.00733 **

const 56.098652 5.916618 9.482 < 2e-16 ***

trend -0.011042 0.001992 -5.543 4.84e-08 ***

Covariance matrix of residuals:

cmort tempr part

cmort 28.034 7.076 16.33

tempr 7.076 37.627 40.88

part 16.325 40.880 123.45

Using the notation of the previous example, the prediction model for car-diovascular mortality is estimated to be

Mt = 56− .01t+ .3Mt−1− .2Tt−1 + .04Pt−1 + .28Mt−2− .08Tt−2 + .07Pt−2.

To examine the residuals, we can plot the cross-correlations of the resid-uals and examine the multivariate version of the Q-test as follows:

3 acf(resid(fit), 52)

4 serial.test(fit, lags.pt=12, type="PT.adjusted")

Portmanteau Test (adjusted)

data: Residuals of VAR object fit

Chi-squared = 162.3502, df = 90, p-value = 4.602e-06

Page 321: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

308 5 Additional Time Domain Topics

Fig. 5.13. ACFs (diagonals) and CCFs (off-diagonals) for the residuals of thethree-dimensional VAR(2) fit to the LA mortality – pollution data set. On theoff-diagonals, the second-named series is the one that leads.

The cross-correlation matrix is shown in Figure 5.13. The figure shows theACFs of the individual residual series along the diagonal. For example, thefirst diagonal graph is the ACF of Mt−Mt, and so on. The off diagonals dis-play the CCFs between pairs of residual series. If the title of the off-diagonalplot is x & y, then y leads in the graphic; that is, on the upper-diagonal, theplot shows corr[x(t+Lag), y(t)] whereas in the lower-diagonal, if the titleis x & y, you get a plot of corr[x(t+Lag), y(t)] (yes, it is the same thing,but the lags are negative in the lower diagonal). The graphic is labeled in astrange way, just remember the second named series is the one that leads. InFigure 5.13 we notice that most of the correlations in the residual series arenegligible, however, the zero-order correlations of mortality with tempera-ture residuals is about .22 and mortality with particulate residuals is about.28 (type acf(resid(fit),52)$acf) to see the actual values. This meansthat the AR model is not capturing the concurrent effect of temperature and

0 10 20 30 40 50

0.0

0.2

0.4

0.6

0.8

1.0

Lag

cmort

0 10 20 30 40 50

0.0

0.2

0.4

0.6

0.8

1.0

Lag

cmrt & tmpr

0 10 20 30 40 50

0.0

0.2

0.4

0.6

0.8

1.0

Lag

cmrt & part

−50 −40 −30 −20 −10 0

0.0

0.2

0.4

0.6

0.8

1.0

Lag

tmpr & cmrt

0 10 20 30 40 50

0.0

0.2

0.4

0.6

0.8

1.0

Lag

tempr

0 10 20 30 40 50

0.0

0.2

0.4

0.6

0.8

1.0

Lag

tmpr & part

−50 −40 −30 −20 −10 0

0.0

0.2

0.4

0.6

0.8

1.0

Lag

part & cmrt

−50 −40 −30 −20 −10 0

0.0

0.2

0.4

0.6

0.8

1.0

Lag

part & tmpr

0 10 20 30 40 50

0.0

0.2

0.4

0.6

0.8

1.0

Lag

part

Page 322: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

5.8 Multivariate ARMAX Models 309

pollution on mortality (recall the data evolves over a week). It is possibleto fit simultaneous models; see Reinsel (1997) for further details. Thus, notunexpectedly, the Q-test rejects the null hypothesis that the noise is white.The Q-test statistic is given by

Q = n2H∑h=1

1

n− htr[Γw(h)Γw(0)−1Γw(h)Γw(0)−1

], (5.91)

where

Γw(h) = n−1n−h∑t=1

wwwt+hwww′t,

and wwwt is the residual process. Under the null that wwwt is white noise, (5.91)has an asymptotic χ2 distribution with k2(H − p) degrees of freedom.

Finally, prediction follows in a straight forward manner from the univari-ate case. Using the R package vars, use the predict command and thefanchart command, which produces a nice graphic:

5 (fit.pr = predict(fit, n.ahead = 24, ci = 0.95)) # 4 weeks ahead

6 fanchart(fit.pr) # plot prediction + error

The results are displayed in Figure 5.14; we note that the package strippedtime when plotting the fanchart and the horizontal axis is labeled 1, 2, 3, . . ..

For pure VAR(p) models, the autocovariance structure leads to the multi-variate version of the Yule–Walker equations:

Γ (h) =

p∑j=1

ΦjΓ (h− j), h = 1, 2, ..., (5.92)

Γ (0) =

p∑j=1

ΦjΓ (−j) +Σw. (5.93)

where Γ (h) = cov(xxxt+h, xxxt) is a k × k matrix and Γ (−h) = Γ (h)′.Estimation of the autocovariance matrix is similar to the univariate case,

that is, with xxx = n−1∑nt=1 xxxt, as an estimate of µµµ = Exxxt,

Γ (h) = n−1n−h∑t=1

(xxxt+h − xxx)(xxxt − xxx)′, h = 0, 1, 2, .., n− 1, (5.94)

and Γ (−h) = Γ (h)′. If γi,j(h) denotes the element in the i-th row and j-th

column of Γ (h), the cross-correlation functions (CCF), as discussed in (1.35),are estimated by

ρi,j(h) =γi,j(h)√

γi,i(0)√γj,j(0)

h = 0, 1, 2, .., n− 1. (5.95)

Page 323: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

310 5 Additional Time Domain Topics

Fanchart for variable cmort

0 100 200 300 400 500

7090

110

130

Fanchart for variable tempr

0 100 200 300 400 500

5070

90

Fanchart for variable part

0 100 200 300 400 500

2040

6080

100

Fig. 5.14. Predictions from a VAR(2) fit to the LA mortality – pollution data.

When i = j in (5.95), we get the estimated autocorrelation function (ACF)of the individual series.

Although least squares estimation was used in Examples 5.10 and 5.11,we could have also used Yule-Walker estimation, conditional or unconditionalmaximum likelihood estimation. As in the univariate case, the Yule–Walkerestimators, the maximum likelihood estimators, and the least squares estima-tors are asymptotically equivalent. To exhibit the asymptotic distribution ofthe autoregression parameter estimators, we write

φφφ = vec (Φ1, ..., Φp) ,

where the vec operator stacks the columns of a matrix into a vector. Forexample, for a bivariate AR(2) model,

φφφ = vec (Φ1, Φ2) = (Φ111 , Φ121 , Φ112 , Φ122Φ211 , Φ221 , Φ212 , Φ222)′,

where Φ`ij is the ij-th element of Φ`, ` = 1, 2. Because (Φ1, ..., Φp) is a k× kpmatrix, φφφ is a k2p× 1 vector. We now state the following property.

Page 324: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

5.8 Multivariate ARMAX Models 311

Property 5.1 Large-Sample Distribution of the Vector Autoregres-sion Estimators

Let φφφ denote the vector of parameter estimators (obtained via Yule–Walker,least squares, or maximum likelihood) for a k-dimensional AR(p) model. Then,

√n(φφφ− φφφ

)∼ AN(000, Σw ⊗ Γ−1pp ), (5.96)

where Γpp = {Γ (i−j)}pi,j=1 is a kp×kp matrix and Σw⊗Γ−1pp = {σijΓ−1pp }ki,j=1

is a k2p× k2p matrix with σij denoting the ij-th element of Σw.

The variance–covariance matrix of the estimator φφφ is approximated byreplacing Σw by Σw, and replacing Γ (h) by Γ (h) in Γpp. The square root

of the diagonal elements of Σw ⊗ Γ−1pp divided by√n gives the individual

standard errors. For the mortality data example, the estimated standard errorsfor the VAR(2) fit are listed in Example 5.11; although those standard errorswere taken from a regression run, they could have also been calculated usingProperty 5.1.

A k × 1 vector-valued time series xxxt, for t = 0,±1,±2, . . ., is said to beVARMA(p, q) if xxxt is stationary and

xxxt = ααα+ Φ1xxxt−1 + · · ·+ Φpxxxt−p +wwwt +Θ1wwwt−1 + · · ·+Θqwwwt−q, (5.97)

with Φp 6= 0, Θq 6= 0, and Σw > 0 (that is, Σw is positive definite). Thecoefficient matrices Φj ; j = 1, ..., p and Θj ; j = 1, ..., q are, of course, k × kmatrices. If xxxt has mean µµµ then ααα = (I−Φ1−· · ·−Φp)µµµ. As in the univariatecase, we will have to place a number of conditions on the multivariate ARMAmodel to ensure the model is unique and has desirable properties such ascausality. These conditions will be discussed shortly.

As in the VAR model, the special form assumed for the constant compo-nent can be generalized to include a fixed r × 1 vector of inputs, uuut. That is,we could have proposed the vector ARMAX model,

xxxt = Γuuut +

p∑j=1

Φjxxxt−j +

q∑k=1

Θkwwwt−k +wwwt, (5.98)

where Γ is a p× r parameter matrix.While extending univariate AR (or pure MA) models to the vector case is

fairly easy, extending univariate ARMA models to the multivariate case is nota simple matter. Our discussion will be brief, but interested readers can getmore details in Lutkepohl (1993), Reinsel (1997), and Tiao and Tsay (1989).

In the multivariate case, the autoregressive operator is

Φ(B) = I − Φ1B − · · · − ΦpBp, (5.99)

and the moving average operator is

Page 325: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

312 5 Additional Time Domain Topics

Θ(B) = I +Θ1B + · · ·+ΘqBq, (5.100)

The zero-mean VARMA(p, q) model is then written in the concise form as

Φ(B)xxxt = Θ(B)wwwt. (5.101)

The model is said to be causal if the roots of |Φ(z)| (where | · | denotesdeterminant) are outside the unit circle, |z| > 1; that is, |Φ(z)| 6= 0 for anyvalue z such that |z| ≤ 1. In this case, we can write

xxxt = Ψ(B)wwwt,

where Ψ(B) =∑∞j=0 ΨjB

j , Ψ0 = I, and∑∞j=0 ||Ψj || < ∞. The model is said

to be invertible if the roots of |Θ(z)| lie outside the unit circle. Then, we canwrite

wwwt = Π(B)xxxt,

where Π(B) =∑∞j=0ΠjB

j , Π0 = I, and∑∞j=0 ||Πj || < ∞. Analogous to

the univariate case, we can determine the matrices Ψj by solving Ψ(z) =Φ(z)−1Θ(z), |z| ≤ 1, and the matrices Πj by solving Π(z) = Θ(z)−1Φ(z), |z| ≤1.

For a causal model, we can write xxxt = Ψ(B)wwwt so the general autocovari-ance structure of an ARMA(p, q) model is

Γ (h) = cov(xxxt+h, xxxt) =∞∑j=0

Ψj+hΣwΨ′j . (5.102)

and Γ (−h) = Γ (h)′. For pure MA(q) processes, (5.102) becomes

Γ (h) =

q−h∑j=0

Θj+hΣwΘ′j , (5.103)

where Θ0 = I. Of course, (5.103) implies Γ (h) = 0 for h > q.As in the univariate case, we will need conditions for model uniqueness.

These conditions are similar to the condition in the univariate case that theautoregressive and moving average polynomials have no common factors. Toexplore the uniqueness problems that we encounter with multivariate ARMAmodels, consider a bivariate AR(1) process, xxxt = (xt,1, xt,2)′, given by

xt,1 = φxt−1,2 + wt,1,

xt,2 = wt,2,

where wt,1 and wt,2 are independent white noise processes and |φ| < 1.Both processes, xt,1 and xt,2 are causal and invertible. Moreover, the pro-cesses are jointly stationary because cov(xt+h,1, xt,2) = φ cov(xt+h−1,2, xt,2) ≡φ γ2,2(h− 1) = φσ2

w2δh1 does not depend on t; note, δh1 = 1 when h = 1, oth-

erwise, δh1 = 0. In matrix notation, we can write this model as

Page 326: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

5.8 Multivariate ARMAX Models 313

xxxt = Φxxxt−1 +wwwt, where Φ =

[0 φ0 0

]. (5.104)

We can write (5.104) in operator notation as

Φ(B)xxxt = wwwt where Φ(z) =

[1 − φz0 1

].

In addition, model (5.104) can be written as a bivariate ARMA(1,1) model

xxxt = Φ1xxxt−1 +Θ1wwwt−1 +wwwt, (5.105)

where

Φ1 =

[0 φ+ θ0 0

]and Θ1 =

[0 − θ0 0

],

and θ is arbitrary. To verify this, we write (5.105), as Φ1(B)xxxt = Θ1(B)wwwt, or

Θ1(B)−1Φ1(B)xxxt = wwwt,

where

Φ1(z) =

[1 − (φ+ θ)z0 1

]and Θ1(z) =

[1 − θz0 1

].

Then,

Θ1(z)−1Φ1(z) =

[1 θz0 1

] [1 − (φ+ θ)z0 1

]=

[1 − φz0 1

]= Φ(z),

where Φ(z) is the polynomial associated with the bivariate AR(1) model in(5.104). Because θ is arbitrary, the parameters of the ARMA(1,1) model givenin (5.105) are not identifiable. No problem exists, however, in fitting the AR(1)model given in (5.104).

The problem in the previous discussion was caused by the fact that bothΘ(B) and Θ(B)−1 are finite; such a matrix operator is called unimodular.If U(B) is unimodular, |U(z)| is constant. It is also possible for two seem-ingly different multivariate ARMA(p, q) models, say, Φ(B)xxxt = Θ(B)wwwt andΦ∗(B)xxxt = Θ∗(B)wwwt, to be related through a unimodular operator, U(B) asΦ∗(B) = U(B)Φ(B) and Θ∗(B) = U(B)Θ(B), in such a way that the orders ofΦ(B) and Θ(B) are the same as the orders of Φ∗(B) and Θ∗(B), respectively.For example, consider the bivariate ARMA(1,1) models given by

Φxxxt ≡[1 −φB0 1

]xxxt =

[1 θB0 1

]wwwt ≡ Θwt

and

Φ∗(B)xxxt ≡[1 (α− φ)B0 1

]xxxt =

[1 (α+ θ)B0 1

]wwwt ≡ Θ∗(B)wwwt,

Page 327: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

314 5 Additional Time Domain Topics

where α, φ, and θ are arbitrary constants. Note,

Φ∗(B) ≡[1 (α− φ)B0 1

]=

[1 αB0 1

] [1 −φB0 1

]≡ U(B)Φ(B)

and

Θ∗(B) ≡[1 (α+ θ)B0 1

]=

[1 αB0 1

] [1 θB0 1

]≡ U(B)Θ(B).

In this case, both models have the same infinite MA representation xxxt =Ψ(B)wwwt, where

Ψ(B) = Φ(B)−1Θ(B) = Φ(B)−1U(B)−1U(B)Θ(B) = Φ∗(B)−1Θ∗(B).

This result implies the two models have the same autocovariance functionΓ (h). Two such ARMA(p, q) models are said to be observationally equivalent.

As previously mentioned, in addition to requiring causality and invert-iblity, we will need some additional assumptions in the multivariate case tomake sure that the model is unique. To ensure the identifiability of the param-eters of the multivariate ARMA(p, q) model, we need the following additionaltwo conditions: (i) the matrix operators Φ(B) and Θ(B) have no commonleft factors other than unimodular ones [that is, if Φ(B) = U(B)Φ∗(B) andΘ(B) = U(B)Θ∗(B), the common factor must be unimodular] and (ii) with qas small as possible and p as small as possible for that q, the matrix [Φp, Θq]must be full rank, k. One suggestion for avoiding most of the aforementionedproblems is to fit only vector AR(p) models in multivariate situations. Al-though this suggestion might be reasonable for many situations, this philos-ophy is not in accordance with the law of parsimony because we might haveto fit a large number of parameters to describe the dynamics of a process.

Asymptotic inference for the general case of vector ARMA models is morecomplicated than pure AR models; details can be found in Reinsel (1997)or Lutkepohl (1993), for example. We also note that estimation for VARMAmodels can be recast into the problem of estimation for state-space modelsthat will be discussed in Chapter 6.

A simple algorithm for fitting vector ARMA models from Spliid (1983)is worth mentioning because it repeatedly uses the multivariate regressionequations. Consider a general ARMA(p, q) model for a time series with anonzero mean

xxxt = ααα+ Φ1xxxt−1 + · · ·+ Φpxxxt−p +wwwt +Θ1wwwt−1 + · · ·+Θqwwwt−q. (5.106)

If µµµ = Exxxt, then ααα = (I −Φ1−· · ·−Φp)µµµ. If wwwt−1, ...,wwwt−q were observed, wecould rearrange (5.106) as a multivariate regression model

xxxt = Bzzzt +wwwt, (5.107)

with

Page 328: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

Problems 315

zzzt = (1, xxx′t−1, ..., xxx′t−p,www

′t−1, ...,www

′t−q)

′ (5.108)

andB = [ααα,Φ1, ..., Φp, Θ1, ..., Θq], (5.109)

for t = p + 1, ..., n. Given an initial estimator B0, of B, we can reconstruct{wwwt−1, ...,wwwt−q} by setting

wwwt−j = xxxt−j − B0zzzt−j , t = p+ 1, ..., n, j = 1, ..., q, (5.110)

where, if q > p, we put wwwt−j = 000 for t − j ≤ 0. The new values of{wwwt−1, ...,wwwt−q} are then put into the regressors zzzt and a new estimate, say,B1, is obtained. The initial value, B0, can be computed by fitting a pure autore-gression of order p or higher, and taking Θ1 = · · · = Θq = 0. The procedureis then iterated until the parameter estimates stabilize. The algorithm oftenconverges, but not to the maximum likelihood estimators. Experience suggeststhe estimators can be reasonably close to the maximum likelihood estimators.The algorithm can be considered as a quick and easy way to fit an initialVARMA model as a starting point to using maximum likelihood estimation,which is best done via state-space models covered in the next chapter.

Problems

Section 5.2

5.1 The data set arf is 1000 simulated observations from an ARFIMA(1, 1, 0)model with φ = .75 and d = .4.

(a) Plot the data and comment.(b) Plot the ACF and PACF of the data and comment.

(c) Estimate the parameters and test for the significance of the estimates φ

and d.(d) Explain why, using the results of parts (a) and (b), it would seem reason-

able to difference the data prior to the analysis. That is, if xt representsthe data, explain why we might choose to fit an ARMA model to ∇xt.

(e) Plot the ACF and PACF of ∇xt and comment.(f) Fit an ARMA model to ∇xt and comment.

5.2 Compute the sample ACF of the absolute values of the NYSE returnsdisplayed in Figure 1.4 up to lag 200, and comment on whether the ACFindicates long memory. Fit an ARFIMA model to the absolute values andcomment.

Page 329: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

316 5 Additional Time Domain Topics

Section 5.3

5.3 Plot the global temperature series, gtemp, and then test whether thereis a unit root versus the alternative that the process is stationary using thethree tests, DF, ADF, and PP, discussed in Example 5.3. Comment.

5.4 Plot the GNP series, gnp, and then test for a unit root against the alter-native that the process is explosive. State your conclusion.

5.5 Verify (5.33).

Section 5.4

5.6 Investigate whether the quarterly growth rate of GNP exhibit ARCHbehavior. If so, fit an appropriate model to the growth rate. The actual valuesare in gnp; also, see Example 3.38.

5.7 Weekly crude oil spot prices in dollars per barrel are in oil; see Prob-lem 2.11 and Appendix R for more details. Investigate whether the growthrate of the weekly oil price exhibits GARCH behavior. If so, fit an appropriatemodel to the growth rate.

5.8 The stats package of R contains the daily closing prices of four major Eu-ropean stock indices; type help(EuStockMarkets) for details. Fit a GARCHmodel to the returns of one of these series and discuss your findings. (Note:The data set contains actual values, and not returns. Hence, the data mustbe transformed prior to the model fitting.)

5.9 The 2× 1 gradient vector, l(1)(α0, α1), given for an ARCH(1) model wasdisplayed in (5.47). Verify (5.47) and then use the result to calculate the 2×2Hessian matrix

l(2)(α0, α1) =

(∂2l/∂α2

0 ∂2l/∂α0∂α1

∂2l/∂α0∂α1 ∂2l/∂α21

).

Section 5.5

5.10 The sunspot data (sunspotz) are plotted in Chapter 4, Figure 4.31.From a time plot of the data, discuss why it is reasonable to fit a thresholdmodel to the data, and then fit a threshold model.

Section 5.6

5.11 Let St represent the monthly sales data in sales (n = 150), and letLt be the leading indicator in lead. Fit the regression model ∇St = β0 +β1∇Lt−3 + xt, where xt is an ARMA process.

Page 330: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

Problems 317

5.12 Consider the correlated regression model, defined in the text by (5.58),say,

yyy = Zβββ + xxx,

where xxx has mean zero and covariance matrix Γ . In this case, we know thatthe weighted least squares estimator is (5.59), namely,

βββw = (Z ′Γ−1Z)−1Z ′Γ−1yyy.

Now, a problem of interest in spatial series can be formulated in terms ofthis basic model. Suppose yi = y(σi), i = 1, 2, . . . , n is a function of thespatial vector coordinates σi = (si1, si2, . . . , sir)

′, the error is xi = x(σi), andthe rows of Z are defined as zzz(σi)

′, i = 1, 2, . . . , n. The Kriging estimator isdefined as the best spatial predictor of y0 = zzz′0βββ + x0 using the estimator

y0 = aaa′yyy,

subject to the unbiased condition Ey0 = Ey0, and such that the mean squareprediction error

MSE = E[(y0 − y0)2]

is minimized.

(a) Prove the estimator is unbiased when Z ′aaa = zzz0.(b) Show the MSE is minimized by solving the equations

Γaaa+ Zλλλ = γγγ0 and Z ′aaa = zzz0,

where γγγ0 = E[xxxx0] represents the vector of covariances between the errorvector of the observed data and the error of the new point the vector λλλ isa q × 1 vector of LaGrangian multipliers.

(c) Show the predicted value can be expressed as

y0 = zzz′0βββw + γγγ′0Γ−1(yyy − Zβββw),

so the optimal prediction is a linear combination of the usual predictorand the least squares residuals.

Section 5.7

5.13 The data in climhyd have 454 months of measured values for the climaticvariables air temperature, dew point, cloud cover, wind speed, precipitation(pt), and inflow (it), at Lake Shasta; the data are displayed in Figure 7.3. Wewould like to look at possible relations between the weather factors and theinflow to Lake Shasta.

(a) Fit ARIMA(0, 0, 0) × (0, 1, 1)12 models to (i) transformed precipitationPt =

√pt and (ii) transformed inflow It = log it.

Page 331: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

318 5 Additional Time Domain Topics

(b) Apply the ARIMA model fitted in part (a) for transformed precipitationto the flow series to generate the prewhitened flow residuals assuming theprecipitation model. Compute the cross-correlation between the flow resid-uals using the precipitation ARIMA model and the precipitation residualsusing the precipitation model and interpret. Use the coefficients from theARIMA model to construct the transformed flow residuals.

5.14 For the climhyd data set, consider predicting the transformed flows It =log it from transformed precipitation values Pt =

√pt using a transfer function

model of the form

(1−B12)It = α(B)(1−B12)Pt + nt,

where we assume that seasonal differencing is a reasonable thing to do. Youmay think of it as fitting

yt = α(B)xt + nt,

where yt and xt are the seasonally differenced transformed flows and precipi-tations.

(a) Argue that xt can be fitted by a first-order seasonal moving average, anduse the transformation obtained to prewhiten the series xt.

(b) Apply the transformation applied in (a) to the series yt, and computethe cross-correlation function relating the prewhitened series to the trans-formed series. Argue for a transfer function of the form

α(B) =δ0

1− ω1B.

(c) Write the overall model obtained in regression form to estimate δ0 andω1. Note that you will be minimizing the sums of squared residuals forthe transformed noise series (1− ω1B)nt. Retain the residuals for furthermodeling involving the noise nt. The observed residual is ut = (1−ω1B)nt.

(d) Fit the noise residuals obtained in (c) with an ARMA model, and give thefinal form suggested by your analysis in the previous parts.

(e) Discuss the problem of forecasting yt+m using the infinite past of yt andthe present and infinite past of xt. Determine the predicted value and theforecast variance.

Section 5.8

5.15 Consider the data set econ5 containing quarterly U.S. unemployment,GNP, consumption, and government and private investment from 1948-III to1988-II. The seasonal component has been removed from the data. Concen-trating on unemployment (Ut), GNP (Gt), and consumption (Ct), fit a vectorARMA model to the data after first logging each series, and then removingthe linear trend. That is, fit a vector ARMA model to xxxt = (x1t, x2t, x3t)

′,

where, for example, x1t = log(Ut) − β0 − β1t, where β0 and β1 are the leastsquares estimates for the regression of log(Ut) on time, t. Run a complete setof diagnostics on the residuals.

Page 332: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

6

State-Space Models

6.1 Introduction

A very general model that seems to subsume a whole class of special cases ofinterest in much the same way that linear regression does is the state-spacemodel or the dynamic linear model, which was introduced in Kalman (1960)and Kalman and Bucy (1961). Although the model was originally introducedas a method primarily for use in aerospace-related research, it has been appliedto modeling data from economics (Harrison and Stevens, 1976; Harvey andPierse, 1984; Harvey and Todd, 1983; Kitagawa and Gersch 1984, Shumwayand Stoffer, 1982), medicine (Jones, 1984) and the soil sciences (Shumway,1988, §3.4.5). An excellent modern treatment of time series analysis based onthe state space model is the text by Durbin and Koopman (2001).

The state-space model or dynamic linear model (DLM), in its basic form,employs an order one, vector autoregression as the state equation,

xxxt = Φxxxt−1 +wwwt, (6.1)

state vector xxxt from the past p×1 state xxxt−1, for time points t = 1, . . . , n. Weassume the wwwt are p × 1 independent and identically distributed, zero-meannormal vectors with covariance matrix Q. In the DLM, we assume the processstarts with a normal vector xxx0 that has mean µµµ0 and p× p covariance matrixΣ0.

The DLM, however, adds an additional component to the model in assum-ing we do not observe the state vector xxxt directly, but only a linear transformedversion of it with noise added, say

yyyt = Atxxxt + vvvt, (6.2)

where At is a q×p measurement or observation matrix; equation (6.2) is calledthe observation equation. The model arose originally in the space trackingsetting, where the state equation defines the motion equations for the position

© Springer Science+Business Media, LLC 2011

R.H. Shumway and D.S. Stoffer, Time Series Analysis and Its Applications: With R Examples, 319

where the state equation determines the rule for the generation of the p × 1

Springer Texts in Statistics, DOI 10.1007/978-1-4419-7865-3_6,

Page 333: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

320 6 State-Space Models

1.5

2.0

2.5

3.0

3.5

4.0

WBC

4.0

4.5

5.0

PLT

2224

2628

3032

3436

0 20 40 60 80

HCT

day

Fig. 6.1. Longitudinal series of blood parameter levels monitored,log (white blood count) [WBC; top], log (platelet) [PLT; middle], and hematocrit[HCT; bottom], after a bone marrow transplant (n = 91 days).

or state of a spacecraft with location xxxt and yyyt reflects information that can beobserved from a tracking device such as velocity and azimuth. The observeddata vector, yyyt, is q-dimensional, which can be larger than or smaller thanp, the state dimension. The additive observation noise vvvt is assumed to bewhite and Gaussian with q × q covariance matrix R. In addition, we initiallyassume, for simplicity, xxx0, {wwwt} and {vvvt} are uncorrelated; this assumptionis not necessary, but it helps in the explanation of first concepts. The case ofcorrelated errors is discussed in §6.6.

As in the ARMAX model of §5.8, exogenous variables, or fixed inputs,may enter into the states or into the observations. In this case, we suppose wehave an r × 1 vector of inputs uuut, and write the model as

xxxt = Φxxxt−1 + Υuuut +wwwt (6.3)

yyyt = Atxxxt + Γuuut + vvvt (6.4)

where Υ is p× r and Γ is q × r.

Example 6.1 A Biomedical Example

Suppose we consider the problem of monitoring the level of several biomedi-cal markers after a cancer patient undergoes a bone marrow transplant. Thedata in Figure 6.1, used by Jones (1984), are measurements made for 91 days

Page 334: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

6.1 Introduction 321

on three variables, log(white blood count) [WBC], log(platelet) [PLT], andhematocrit [HCT], denoted yt1, yt2, and yt3, respectively. Approximately40% of the values are missing, with missing values occurring primarily afterthe 35th day. The main objectives are to model the three variables usingthe state-space approach, and to estimate the missing values. According toJones, “Platelet count at about 100 days post transplant has previouslybeen shown to be a good indicator of subsequent long term survival.” Forthis particular situation, we model the three variables in terms of the stateequation (6.1); that is,xt1xt2

xt3

=

φ11 φ12 φ13φ21 φ22 φ23φ31 φ32 φ33

xt−1,1xt−1,2xt−1,3

+

wt1wt2wt3

. (6.5)

The 3×3 observation matrix, At, is either the identity matrix, or the identitymatrix with all zeros in a row when that variable is missing. The covariancematrices R and Q are 3×3 matrices with R = diag{r11, r22, r33}, a diagonalmatrix, required for a simple approach when data are missing.

The following R code was used to produce Figure 6.1. These data havezero as the missing data code; to produce a cleaner graphic, we set the zerosto NA.

1 blood = cbind(WBC, PLT, HCT)

2 blood = replace(blood, blood==0, NA)

3 plot(blood, type="o", pch=19, xlab="day", main="")

The model given in (6.1) involving only a single lag is not unduly restric-tive. A multivariate model with m lags, such as the VAR(m) discussed in §5.8,could be developed by replacing the p×1 state vector, xxxt, by the pm×1 statevector XXXt = (xxx′t, xxx

′t−1, . . . , xxx

′t−m+1)′ and the transition matrix by

Φ =

Φ1 Φ2 . . . Φm−1 ΦmI 0 . . . 0 00 I . . . 0 0...

.... . .

......

0 0 . . . I 0

. (6.6)

Letting WWW t = (www′t,000′, . . . ,000′)′ be the new pm × 1 state error vector, the new

state equation will beXXXt = ΦXXXt−1 +WWW t, (6.7)

where var(WWW t) is a pm×pm matrix with Q in the upper left-hand corner andzeros elsewhere. The observation equation can then be written as

yyyt =[At∣∣ 0∣∣ · · · ∣∣ 0

]XXXt + vvvt. (6.8)

This simple recoding shows one way of handling higher order lags within thecontext of the single lag structure. It is not necessary and often not desirable

Page 335: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

322 6 State-Space Models

Time

Tem

pera

ture

Dev

iatio

ns

1880 1900 1920 1940 1960 1980 2000

−0.4

−0.2

0.0

0.2

0.4

0.6

0.8

Fig. 6.2. Annual global temperature deviation series, measured in degrees centi-grade, 1880–2009. The solid line is the land-marine series (gtemp) whereas the dashedline shows the land-based series (gtemp2).

to have a singular WWW t process in the state equation, (6.7). Further discussionof this topic is given in §6.6.

The real advantages of the state-space formulation, however, do not reallycome through in the simple example given above. The special forms thatcan be developed for various versions of the matrix At and for the transitionscheme defined by the matrix Φ allow fitting more parsimonious structureswith fewer parameters needed to describe a multivariate time series. We willgive some examples of structural models in §6.5, but the simple example shownbelow is instructive.

Example 6.2 Global Warming

Figure 6.2 shows two different estimators for the global temperature seriesfrom 1880 to 2009. The solid line is gtemp, which was considered in thefirst chapter, and are the global mean land-ocean temperature index data.The second series, gtemp2, are the surface air temperature index data us-ing only meteorological station data. Precise details may be obtained fromhttp://data.giss.nasa.gov/gistemp/graphs/. Conceptually, both seriesshould be measuring the same underlying climatic signal, and we may con-sider the problem of extracting this underlying signal. The R code to gen-erate the figure is

1 ts.plot(gtemp, gtemp2, lty=1:2, ylab="Temperature Deviations")

We suppose both series are observing the same signal with different noises;that is,

Page 336: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

6.1 Introduction 323

yt1 = xt + vt1 and yt2 = xt + vt2,

or more compactly as (yt1yt2

)=

(11

)xt +

(vt1vt2

), (6.9)

where

R = var

(vt1vt2

)=

(r11 r12r21 r22

).

It is reasonable to suppose that the unknown common signal, xt, can bemodeled as a random walk with drift of the form

xt = δ + xt−1 + wt, (6.10)

with Q = var(wt). In this example, p = 1, q = 2, Φ = 1, and Υ = δ withut ≡ 1.

The introduction of the state-space approach as a tool for modeling data inthe social and biological sciences requires model identification and parameterestimation because there is rarely a well-defined differential equation describ-ing the state transition. The questions of general interest for the dynamiclinear model (6.3) and (6.4) relate to estimating the unknown parameterscontained in Φ, Υ,Q, Γ,At, and R, that define the particular model, and es-timating or forecasting values of the underlying unobserved process xxxt. Theadvantages of the state-space formulation are in the ease with which we cantreat various missing data configurations and in the incredible array of modelsthat can be generated from (6.1) and (6.2). The analogy between the obser-vation matrix At and the design matrix in the usual regression and analysisof variance setting is a useful one. We can generate fixed and random effectstructures that are either constant or vary over time simply by making ap-propriate choices for the matrix At and the transition structure Φ. We give afew examples in this chapter; for further examples, see Durbin and Koopman(2001), Harvey (1993), Jones (1993), or Shumway (1988) to mention a few.

Before continuing our investigation of the more complex model, it is in-structive to consider a simple univariate state-space model wherein an AR(1)process is observed using a noisy instrument.

Example 6.3 An AR(1) Process with Observational Noise

Consider a univariate state-space model where the observations are noisy,

yt = xt + vt, (6.11)

and the signal (state) is an AR(1) process,

xt = φxt−1 + wt, (6.12)

Page 337: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

324 6 State-Space Models

for t = 1, 2, . . . , n, where vt ∼ iid N(0, σ2v), wt ∼ iid N(0, σ2

w), and x0 ∼N(0, σ2

w/(1− φ2)); {vt}, {wt}, x0 are independent.

In Chapter 3, we investigated the properties of the state, xt, because itis a stationary AR(1) process (recall Problem 3.2e). For example, we knowthe autocovariance function of xt is

γx(h) =σ2w

1− φ2φh, h = 0, 1, 2, . . . . (6.13)

But here, we must investigate how the addition of observation noise affectsthe dynamics. Although it is not a necessary assumption, we have assumedin this example that xt is stationary. In this case, the observations are alsostationary because yt is the sum of two independent stationary componentsxt and vt. We have

γy(0) = var(yt) = var(xt + vt) =σ2w

1− φ2+ σ2

v , (6.14)

and, when h ≥ 1,

γy(h) = cov(yt, yt−h) = cov(xt + vt, xt−h + vt−h) = γx(h). (6.15)

Consequently, for h ≥ 1, the ACF of the observations is

ρy(h) =γy(h)

γy(0)=

(1 +

σ2v

σ2w

(1− φ2)

)−1φh. (6.16)

It should be clear from the correlation structure given by (6.16) that theobservations, yt, are not AR(1) unless σ2

v = 0. In addition, the autocor-relation structure of yt is identical to the autocorrelation structure of anARMA(1,1) process, as presented in Example 3.13. Thus, the observationscan also be written in an ARMA(1,1) form,

yt = φyt−1 + θut−1 + ut,

where ut is Gaussian white noise with variance σ2u, and with θ and σ2

u suit-ably chosen. We leave the specifics of this problem alone for now and deferthe discussion to §6.6; in particular, see Example 6.11.

Although an equivalence exists between stationary ARMA models and sta-tionary state-space models (see §6.6), it is sometimes easier to work with oneform than another. As previously mentioned, in the case of missing data, com-plex multivariate systems, mixed effects, and certain types of nonstationarity,it is easier to work in the framework of state-space models; in this chapter,we explore some of these situations.

Page 338: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

6.2 Filtering, Smoothing, and Forecasting 325

6.2 Filtering, Smoothing, and Forecasting

From a practical view, the primary aims of any analysis involving the state-space model as defined by (6.1)-(6.2), or by (6.3)-(6.4), would be to pro-duce estimators for the underlying unobserved signal xxxt, given the dataYs = {yyy1, . . . , yyys}, to time s. When s < t, the problem is called forecast-ing or prediction. When s = t, the problem is called filtering, and whens > t, the problem is called smoothing. In addition to these estimates, wewould also want to measure their precision. The solution to these problemsis accomplished via the Kalman filter and smoother and is the focus of thissection.

Throughout this chapter, we will use the following definitions:

xxxst = E(xxxt∣∣ Ys) (6.17)

andP st1,t2 = E

{(xxxt1 − xxxst1)(xxxt2 − xxxst2)′

}. (6.18)

When t1 = t2 (= t say) in (6.18), we will write P st for convenience.In obtaining the filtering and smoothing equations, we will rely heavily

on the Gaussian assumption. Some knowledge of the material covered in Ap-pendix B, §B.1, will be helpful in understanding the details of this section(although these details may be skipped on a casual reading of the material).Even in the non-Gaussian case, the estimators we obtain are the minimummean-squared error estimators within the class of linear estimators. That is, wecan think of E in (6.17) as the projection operator in the sense of §B.1 ratherthan expectation and Ys as the space of linear combinations of {yyy1, . . . , yyys};in this case, P st is the corresponding mean-squared error. When we assume,as in this section, the processes are Gaussian, (6.18) is also the conditionalerror covariance; that is,

P st1,t2 = E{

(xxxt1 − xxxst1)(xxxt2 − xxxst2)′∣∣ Ys} .

This fact can be seen, for example, by noting the covariance matrix between(xxxt −xxxst ) and Ys, for any t and s, is zero; we could say they are orthogonal inthe sense of §B.1. This result implies that (xxxt − xxxst ) and Ys are independent(because of the normality), and hence, the conditional distribution of (xxxt−xxxst )given Ys is the unconditional distribution of (xxxt − xxxst ). Derivations of thefiltering and smoothing equations from a Bayesian perspective are given inMeinhold and Singpurwalla (1983); more traditional approaches based on theconcept of projection and on multivariate normal distribution theory are givenin Jazwinski (1970) and Anderson and Moore (1979).

First, we present the Kalman filter, which gives the filtering and forecastingequations. The name filter comes from the fact that xxxtt is a linear filter of theobservations yyy1, . . . , yyyt; that is, xxxtt =

∑ts=1Bsyyys for suitably chosen p × q

matrices Bs. The advantage of the Kalman filter is that it specifies how toupdate the filter from xxxt−1t−1 to xxxtt once a new observation yyyt is obtained, withouthaving to reprocess the entire data set yyy1, . . . , yyyt.

Page 339: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

326 6 State-Space Models

Property 6.1 The Kalman FilterFor the state-space model specified in (6.3) and (6.4), with initial condi-

tions xxx00 = µµµ0 and P 00 = Σ0, for t = 1, . . . , n,

xxxt−1t = Φxxxt−1t−1 + Υuuut, (6.19)

P t−1t = ΦP t−1t−1Φ′ +Q, (6.20)

withxxxtt = xxxt−1t +Kt(yyyt −Atxxxt−1t − Γuuut), (6.21)

P tt = [I −KtAt]Pt−1t , (6.22)

whereKt = P t−1t A′t[AtP

t−1t A′t +R]−1 (6.23)

is called the Kalman gain. Prediction for t > n is accomplished via (6.19) and(6.20) with initial conditions xxxnn and Pnn . Important byproducts of the filterare the innovations (prediction errors)

εεεt = yyyt − E(yyyt∣∣ Yt−1) = yyyt −Atxxxt−1t − Γuuut, (6.24)

and the corresponding variance-covariance matrices

Σtdef= var(εεεt) = var[At(xxxt − xxxt−1t ) + vvvt] = AtP

t−1t A′t +R (6.25)

for t = 1, . . . , n

Proof. The derivations of (6.19) and (6.20) follow from straight forward cal-culations, because from (6.3) we have

xxxt−1t = E(xxxt∣∣ Yt−1) = E(Φxxxt−1 + Υuuut +wwwt

∣∣ Yt−1) = Φxt−1t−1 + Υuuut,

and thus

P t−1t = E{

(xxxt − xxxt−1t )(xxxt − xxxt−1t )′}

= E{[Φ(xxxt−1 − xxxt−1t−1) +wwwt

] [Φ(xxxt−1 − xxxt−1t−1) +wwwt

]′}= ΦP t−1t−1Φ

′ +Q.

To derive (6.21), we note that E(εεεtyyy′s) = 0 for s < t, which in view of the

fact the innovation sequence is a Gaussian process, implies that the innova-tions are independent of the past observations. Furthermore, the conditionalcovariance between xxxt and εεεt given Yt−1 is

cov(xxxt, εεεt∣∣ Yt−1) = cov(xxxt, yyyt −Atxxxt−1t − Γuuut

∣∣ Yt−1)

= cov(xxxt − xxxt−1t , yyyt −Atxxxt−1t − Γuuut∣∣ Yt−1)

= cov[xxxt − xxxt−1t , At(xxxt − xxxt−1t ) + vvvt]

= P t−1t A′t. (6.26)

Page 340: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

6.2 Filtering, Smoothing, and Forecasting 327

Using these results we have that the joint conditional distribution of xxxt andεεεt given Yt−1 is normal(

xxxtεεεt

) ∣∣∣ Yt−1 ∼ N

([xxxt−1t

000

],

[P t−1t P t−1t A′tAtP

t−1t Σt

]). (6.27)

Thus, using (B.9) of Appendix B, we can write

xxxtt = E(xxxt∣∣ yyy1, . . . , yyyt−1, yyyt) = E(xxxt

∣∣ Yt−1, εεεt) = xxxt−1t +Ktεεεt, (6.28)

whereKt = P t−1t A′tΣ

−1t = P t−1t A′t(AtP

t−1t A′t +R)−1.

The evaluation of P tt is easily computed from (6.27) [see (B.10)] as

P tt = cov(xxxt∣∣ Yt−1, εεεt) = P t−1t − P t−1t A′tΣ

−1t AtP

t−1t ,

which simplifies to (6.22). ut

Nothing in the proof of Property 6.1 precludes the cases where some or allof the parameters vary with time, or where the observation dimension changeswith time, which leads to the following corollary.

Corollary 6.1 Kalman Filter: The Time-Varying CaseIf, in the DLM (6.3) and (6.4), any or all of the parameters are time

dependent, Φ = Φt, Υ = Υt, Q = Qt in the state equation or Γ = Γt, R = Rtin the observation equation, or the dimension of the observational equation istime dependent, q = qt, Property 6.1 holds with the appropriate substitutions.

Next, we explore the model, prediction, and filtering from a density pointof view. For the sake of brevity, consider the Gaussian DLM without inputs,as described in (6.1) and (6.2); that is,

xxxt = Φxxxt−1 +wwwt and yyyt = Atxxxt + vvvt.

Recall wwwt and vvvt are independent, white Gaussian sequences, and the initialstate is normal, say, xxx0 ∼ N(µµµ0, Σ0); we will denote the initial p-variate statenormal density by f0(xxx0). Now, letting pΘ(·) denote a generic density functionwith parameters represented by Θ, we could describe the state relationship as

pΘ(xxxt∣∣ xxxt−1, xxxt−2, . . . , xxx0) = pΘ(xxxt

∣∣ xxxt−1) = fw(xxxt − Φxxxt−1), (6.29)

where fw(·) denotes the p-variate normal density with mean zero and variance-covariance matrix Q. In (6.29), we are stating the process is Markovian, linear,and Gaussian. The relationship of the observations to the state process iswritten as

pΘ(yyyt∣∣ xxxt, Yt−1) = pΘ(yyyt

∣∣ xxxt) = fv(yyyt −Atxxxt), (6.30)

where fv(·) denotes the q-variate normal density with mean zero and variance-covariance matrix R. In (6.30), we are stating the observations are condition-ally independent given the state, and the observations are linear and Gaussian.

Page 341: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

328 6 State-Space Models

Note, (6.29), (6.30), and the initial density, f0(·), completely specify the modelin terms of densities, namely,

pΘ(xxx0, xxx1, . . . , xxxn, yyy1, . . . , yyyn) = f0(xxx0)n∏t=1

fw(xxxt − Φxxxt−1)fv(yyyt −Atxxxt),

(6.31)where Θ = {µµµ0, Σ0, Φ,Q,R}.

Given the data, Yt−1 = {yyy1, . . . , yyyt−1}, and the current filter density,pΘ(xxxt−1 |Yt−1), Property 6.1 tells us, via conditional means and variances,how to recursively generate the Gaussian forecast density, pΘ(xxxt |Yt−1), andhow to update the density given the current observation, yyyt, to obtain theGaussian filter density, pΘ(xxxt |Yt). In terms of densities, the Kalman filtercan be seen as a simple Bayesian updating scheme, where, to determine theforecast and filter densities, we have

pΘ(xxxt∣∣ Yt−1) =

∫RppΘ(xxxt, xxxt−1

∣∣ Yt−1) dxxxt−1

=

∫RppΘ(xxxt

∣∣ xxxt−1)pΘ(xxxt−1∣∣ Yt−1) dxxxt−1

=

∫Rpfw(xxxt − Φxxxt−1)pΘ(xxxt−1

∣∣ Yt−1)dxxxt−1, (6.32)

which simplifies to the p-variate N(xxxt−1t , P t−1t ) density, and

pΘ(xxxt∣∣ Yt) = pΘ(xxxt

∣∣ yyyt, Yt−1)

∝ pΘ(yyyt∣∣ xxxt) pΘ(xxxt

∣∣ Yt−1),

= fv(yyyt −Atxxxt)pΘ(xxxt∣∣ Yt−1), (6.33)

from which we can deduce pΘ(xxxt∣∣ Yt) is the p-variate N(xxxtt, P

tt ) density. These

statements are true for t = 1, . . . , n, with initial condition pΘ(xxx0∣∣ Y0) =

f0(xxx0). The prediction and filter recursions of Property 6.1 could also havebeen calculated directly from the density relationships (6.32) and (6.33) usingmultivariate normal distribution theory. The following example illustrates theBayesian updating scheme.

Example 6.4 Bayesian Analysis of a Local Level Model

In this example, we suppose that we observe a univariate series yt thatconsists of a trend component, µt, and a noise component, vt, where

yt = µt + vt (6.34)

and vt ∼ iid N(0, σ2v). In particular, we assume the trend is a random walk

given byµt = µt−1 + wt (6.35)

Page 342: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

6.2 Filtering, Smoothing, and Forecasting 329

where wt ∼ iid N(0, σ2w) is independent of {vt}. Recall Example 6.2, where

we suggested this type of trend model for the global temperature series.The model is, of course, a state-space model with (6.34) being the ob-

servation equation, and (6.35) being the state equation. For forecasting, weseek the posterior density p(µt

∣∣ Yt−1). We will use the following notationintroduced in Blight (1974) for the multivariate case. Let

{x;µ, σ2} = exp

{− 1

2σ2(x− µ)2

}, (6.36)

then simple manipulation shows

{x;µ, σ2} = {µ;x, σ2} (6.37)

and

{x;µ1, σ21}{x;µ2, σ

22} =

{x;µ1/σ

21 + µ2/σ

22

1/σ21 + 1/σ2

2

, (1/σ21 + 1/σ2

2)−1}

×{µ1;µ2, σ

21 + σ2

2

}.

(6.38)

Thus, using (6.32), (6.37) and (6.38) we have

p(µt∣∣ Yt−1) ∝

∫ {µt;µt−1, σ

2w

} {µt−1;µt−1t−1, P

t−1t−1}dµt−1

=

∫ {µt−1;µt, σ

2w

} {µt−1;µt−1t−1, P

t−1t−1}dµt−1

={µt;µ

t−1t−1, P

t−1t−1 + σ2

w

}. (6.39)

From (6.39) we conclude that

µt∣∣ Yt−1 ∼ N(µt−1t , P t−1t ) (6.40)

whereµt−1t = µt−1t−1 and P t−1t = P t−1t−1 + σ2

w (6.41)

which agrees with the first part of Property 6.1.To derive the filter density using (6.33) and (6.37) we have

p(µt∣∣ Yt) ∝ {yt;µt, σ2

v

} {µt;µ

t−1t , P t−1t

}={µt; yt, σ

2v

} {µt;µ

t−1t , P t−1t

}. (6.42)

An application of (6.38) gives

µt∣∣ Yt ∼ N(µtt, P

tt ) (6.43)

with

µtt =σ2vµ

t−1t

P t−1t + σ2v

+P t−1t yt

P t−1t + σ2v

= µt−1t +Kt(yt − µt−1t ), (6.44)

Page 343: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

330 6 State-Space Models

where we have defined

Kt =P t−1t

P t−1t + σ2v

, (6.45)

and

P tt =

(1

σ2v

+1

P t−1t

)−1=

σ2vP

t−1t

P t−1t + σ2v

= (1−Kt)Pt−1t . (6.46)

The filter for this specific case, of course, agrees with Property 6.1.

Next, we consider the problem of obtaining estimators for xxxt based onthe entire data sample yyy1, . . . , yyyn, where t ≤ n, namely, xxxnt . These estimatorsare called smoothers because a time plot of the sequence {xxxnt ; t = 1, . . . , n}is typically smoother than the forecasts {xxxt−1t ; t = 1, . . . , n} or the filters{xxxtt; t = 1, . . . , n}. As is obvious from the above remarks, smoothing impliesthat each estimated value is a function of the present, future, and past, whereasthe filtered estimator depends on the present and past. The forecast dependsonly on the past, as usual.

Property 6.2 The Kalman SmootherFor the state-space model specified in (6.3) and (6.4), with initial condi-

tions xxxnn and Pnn obtained via Property 6.1, for t = n, n− 1, . . . , 1,

xxxnt−1 = xxxt−1t−1 + Jt−1(xxxnt − xxxt−1t

), (6.47)

Pnt−1 = P t−1t−1 + Jt−1(Pnt − P t−1t

)J ′t−1, (6.48)

whereJt−1 = P t−1t−1Φ

′ [P t−1t

]−1. (6.49)

Proof. The smoother can be derived in many ways. Here we provide a proofthat was given in Ansley and Kohn (1982). First, for 1 ≤ t ≤ n, define

Yt−1 = {yyy1, . . . , yyyt−1} and ηt = {vvvt, . . . , vvvn,wwwt+1, . . . ,wwwn},

with Y0 being empty, and let

qqqt−1 = E{xxxt−1∣∣ Yt−1, xxxt − xxxt−1t , ηt}.

Then, because Yt−1, {xxxt − xxxt−1t }, and ηt are mutually independent, and xxxt−1and ηt are independent, using (B.9) we have

qqqt−1 = xxxt−1t−1 + Jt−1(xxxt − xxxt−1t ), (6.50)

whereJt−1 = cov(xxxt−1, xxxt − xxxt−1t )[P t−1t ]−1 = P t−1t−1Φ

′[P t−1t ]−1.

Finally, because Yt−1, xxxt − xxxt−1t , and ηt generate Yn = {yyy1, . . . , yyyn},

Page 344: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

6.2 Filtering, Smoothing, and Forecasting 331

xnt−1 = E{xt−1∣∣ Yn} = E{qqqt−1

∣∣ Yn} = xxxt−1t−1 + Jt−1(xxxnt − xxxt−1t ),

which establishes (6.47).The recursion for the error covariance, Pnt−1, is obtained by straight-

forward calculation. Using (6.47) we obtain

xxxt−1 − xxxnt−1 = xxxt−1 − xxxt−1t−1 − Jt−1(xxxnt − Φxxxt−1t−1

),

or (xxxt−1 − xxxnt−1

)+ Jt−1xxx

nt =

(xxxt−1 − xxxt−1t−1

)+ Jt−1Φxxx

t−1t−1. (6.51)

Multiplying each side of (6.51) by the transpose of itself and taking expecta-tion, we have

Pnt−1 + Jt−1E(xxxnt xxxn′

t )J ′t−1 = P t−1t−1 + Jt−1ΦE(xxxt−1t−1xxxt−1′t−1 )Φ′J ′t−1, (6.52)

using the fact the cross-product terms are zero. But,

E(xxxnt xxxn′

t ) = E(xxxtxxx′t)− Pnt = ΦE(xxxt−1xxx

′t−1)Φ′ +Q− Pnt ,

andE(xxxt−1t−1xxx

t−1′t−1 ) = E(xxxt−1xxx

′t−1)− P t−1t−1 ,

so (6.52) simplifies to (6.48). ut

Example 6.5 Prediction, Filtering and Smoothing for the LocalLevel Model

For this example, we simulated n = 50 observations from the local leveltrend model discussed in Example 6.4. We generated a random walk

µt = µt−1 + wt (6.53)

with wt ∼ iid N(0, 1) and µ0 ∼ N(0, 1). We then supposed that we observea univariate series yt consisting of the trend component, µt, and a noisecomponent, vt ∼ iid N(0, 1), where

yt = µt + vt. (6.54)

The sequences {wt}, {vt} and µ0 were generated independently. We then ranthe Kalman filter and smoother, Properties 6.1 and 6.2, using the actualparameters. The top panel of Figure 6.3 shows the actual values of µt aspoints, and the predictions µt−1t , for t = 1, 2, . . . , 50, superimposed on thegraph as a line. In addition, we display µt−1t ± 2

√P t−1t as dashed lines on

the plot. The middle panel displays the filter, µtt, for t = 1, . . . , 50, as a linewith µtt ± 2

√P tt as dashed lines. The bottom panel of Figure 6.3 shows a

similar plot for the smoother µnt .Table 6.1 shows the first 10 observations as well as the corresponding

state values, the predictions, filters and smoothers. Note that in Table 6.1,

Page 345: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

332 6 State-Space Models

0 10 20 30 40 50

−50

510

Prediction

Time

mu[−1

]

0 10 20 30 40 50

−50

510

Filter

Time

mu[−1

]

0 10 20 30 40 50

−50

510

Smoother

Time

mu[−1

]

Fig. 6.3. Displays for Example 6.5. The simulated values of µt, for t = 1, . . . , 50,given by (6.53) are shown as points. Top: The predictions µt−1

t obtained via theKalman filter are shown as a line. Error bounds, µt−1

t ± 2√P t−1t , are shown as

dashed lines. Middle: The filter µtt obtained via the Kalman filter are shown as aline. Error bounds, µtt±2

√P tt , are shown as dashed lines. Bottom: The smoothers µnt

obtained via the Kalman smoother are shown as a line. Error bounds, µnt ± 2√Pnt ,

are shown as dashed lines.

one-step-ahead prediction is more uncertain than the corresponding filteredvalue, which, in turn, is more uncertain than the corresponding smoothervalue (that is P t−1t ≥ P tt ≥ Pnt ). Also, in each case, the error variancesstabilize quickly.

The R code for this example is as follows. In the example we use Ksmooth0,which calls Kfilter0 for the filtering part (see Appendix R). In the re-turned values from Ksmooth0, the letters p, f, s denote prediction, filter,and smooth, respectively (e.g., xp is xt−1t , xf is xtt, xs is xnt , and so on).

Page 346: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

6.2 Filtering, Smoothing, and Forecasting 333

Table 6.1. Forecasts, Filters, and Smoothers for Example 6.5

t yt µt µt−1t P t−1

t µtt P tt µnt Pnt

0 — −.63 — — .00 1.00 −.32 .621 −1.05 −.44 .00 2.00 −.70 .67 −.65 .472 −.94 −1.28 −.70 1.67 −.85 .63 −.57 .453 −.81 .32 −.85 1.63 −.83 .62 −.11 .454 2.08 .65 −.83 1.62 .97 .62 1.04 .455 1.81 −.17 .97 1.62 1.49 .62 1.16 .456 −.05 .31 1.49 1.62 .53 .62 .63 .457 .01 1.05 .53 1.62 .21 .62 .78 .458 2.20 1.63 .21 1.62 1.44 .62 1.70 .459 1.19 1.32 1.44 1.62 1.28 .62 2.12 .45

10 5.24 2.83 1.28 1.62 3.73 .62 3.48 .45

These scripts use the Cholesky decomposition1 of Q and R; they are de-noted by cQ and cR. Practically, the scripts only require that Q or R may bereconstructed as t(cQ)%*%(cQ) or t(cR)%*%(cR), respectively, which allowsmore flexibility. For example, the model (6.7) - (6.8) does not pose a problemeven though the state noise covariance matrix is not positive definite.

1 # generate data

2 set.seed(1); num = 50

3 w = rnorm(num+1,0,1); v = rnorm(num,0,1)

4 mu = cumsum(w) # state: mu[0], mu[1],..., mu[50]

5 y = mu[-1] + v # obs: y[1],..., y[50]

6 # filter and smooth

7 mu0 = 0; sigma0 = 1; phi = 1; cQ = 1; cR = 1

8 ks = Ksmooth0(num, y, 1, mu0, sigma0, phi, cQ, cR)

9 # start figure

10 par(mfrow=c(3,1)); Time = 1:num

11 plot(Time, mu[-1], main="Prediction", ylim=c(-5,10))

12 lines(ks$xp)

13 lines(ks$xp+2*sqrt(ks$Pp), lty="dashed", col="blue")

14 lines(ks$xp-2*sqrt(ks$Pp), lty="dashed", col="blue")

15 plot(Time, mu[-1], main="Filter", ylim=c(-5,10))

16 lines(ks$xf)

17 lines(ks$xf+2*sqrt(ks$Pf), lty="dashed", col="blue")

18 lines(ks$xf-2*sqrt(ks$Pf), lty="dashed", col="blue")

19 plot(Time, mu[-1], main="Smoother", ylim=c(-5,10))

20 lines(ks$xs)

21 lines(ks$xs+2*sqrt(ks$Ps), lty="dashed", col="blue")

22 lines(ks$xs-2*sqrt(ks$Ps), lty="dashed", col="blue")

23 mu[1]; ks$x0n; sqrt(ks$P0n) # initial value info

1 Given a positive definite matrix A, its Cholesky decomposition is an upper trian-gular matrix U with strictly positive diagonal entries such that A = U ′U . In R,use chol(A). For the univariate case, it is simply the positive square root of A.

Page 347: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

334 6 State-Space Models

When we discuss maximum likelihood estimation via the EM algorithmin the next section, we will need a set of recursions for obtaining Pnt,t−1, asdefined in (6.18). We give the necessary recursions in the following property.

Property 6.3 The Lag-One Covariance SmootherFor the state-space model specified in (6.3) and (6.4), with Kt, Jt (t =

1, . . . , n), and Pnn obtained from Properties 6.1 and 6.2, and with initial con-dition

Pnn,n−1 = (I −KnAn)ΦPn−1n−1 , (6.55)

for t = n, n− 1, . . . , 2,

Pnt−1,t−2 = P t−1t−1 J′t−2 + Jt−1

(Pnt,t−1 − ΦP t−1t−1

)J ′t−2. (6.56)

Proof. Because we are computing covariances, we may assume uuut ≡ 000 withoutloss of generality. To derive the initial term (6.55), we first define

xxxst = xxxt − xxxst .

Then, using (6.21) and (6.47), we write

P tt,t−1 = E(xxxtt xxx

t′

t−1

)= E

{[xxxt−1t −Kt(yyyt −Atxxxt−1t )][xxxt−1t−1 − Jt−1Kt(yyyt −Atxxxt−1t )]′

}= E

{[xxxt−1t −Kt(Atxxx

t−1t + vvvt)][xxx

t−1t−1 − Jt−1Kt(Atxxx

t−1t + vvvt)]

′}.

Expanding terms and taking expectation, we arrive at

P tt,t−1 = P t−1t,t−1 − Pt−1t A′tK

′tJ′t−1 −KtAtP

t−1t,t−1 +Kt(AtP

t−1t A′t +R)K ′tJ

′t−1,

noting E(xxxt−1t vvv′t) = 0. The final simplification occurs by realizing thatKt(AtP

t−1t A′t + R) = P t−1t A′t, and P t−1t,t−1 = ΦP t−1t−1 . These relationships hold

for any t = 1, . . . , n, and (6.55) is the case t = n.We give the basic steps in the derivation of (6.56). The first step is to use

(6.47) to writexxxnt−1 + Jt−1xxx

nt = xxxt−1t−1 + Jt−1Φxxx

t−1t−1 (6.57)

andxxxnt−2 + Jt−2xxx

nt−1 = xxxt−2t−2 + Jt−2Φxxx

t−2t−2. (6.58)

Next, multiply the left-hand side of (6.57) by the transpose of the left-handside of (6.58), and equate that to the corresponding result of the right-handsides of (6.57) and (6.58). Then, taking expectation of both sides, the left-handside result reduces to

Pnt−1,t−2 + Jt−1E(xxxnt xxxn′

t−1)J ′t−2 (6.59)

and the right-hand side result reduces to

Page 348: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

6.3 Maximum Likelihood Estimation 335

P t−2t−1,t−2 −Kt−1At−1Pt−2t−1,t−2 + Jt−1ΦKt−1At−1P

t−2t−1,t−2

+ Jt−1ΦE(xxxt−1t−1xxxt−2′t−2 )Φ′J ′t−2.

(6.60)

In (6.59), write

E(xxxnt xxxn′

t−1) = E(xxxtxxx′t−1)− Pnt,t−1 = ΦE(xxxt−1xxx

′t−2)Φ′ + ΦQ− Pnt,t−1,

and in (6.60), write

E(xxxt−1t−1xxxt−2′t−2 ) = E(xxxt−2t−1xxx

t−2′t−2 ) = E(xxxt−1xxx

′t−2)− P t−2t−1,t−2.

Equating (6.59) to (6.60) using these relationships and simplifying the resultleads to (6.56). ut

6.3 Maximum Likelihood Estimation

The estimation of the parameters that specify the state-space model, (6.3)and (6.4), is quite involved. We use Θ = {µµµ0, Σ0, Φ,Q,R, Υ, Γ} to representthe vector of parameters containing the elements of the initial mean and co-variance µµµ0 and Σ0, the transition matrix Φ, and the state and observationcovariance matrices Q and R and the input coefficient matrices, Υ and Γ . Weuse maximum likelihood under the assumption that the initial state is normal,xxx0 ∼ N(µµµ0, Σ0), and the errors www1, . . . ,wwwn and vvv1, . . . , vvvn are jointly normaland uncorrelated vector variables. We continue to assume, for simplicity, {wwwt}and {vvvt} are uncorrelated.

The likelihood is computed using the innovations εεε1, εεε2, . . . , εεεn, defined by(6.24),

εεεt = yyyt −Atxxxt−1t − Γuuut.

The innovations form of the likelihood function, which was first given bySchweppe (1965), is obtained using an argument similar to the one leadingto (3.116) and proceeds by noting the innovations are independent Gaussianrandom vectors with zero means and, as shown in (6.25), covariance matrices

Σt = AtPt−1t A′t +R. (6.61)

Hence, ignoring a constant, we may write the likelihood, LY (Θ), as

− lnLY (Θ) =1

2

n∑t=1

log |Σt(Θ)|+ 1

2

n∑t=1

εεεt(Θ)′Σt(Θ)−1εεεt(Θ), (6.62)

where we have emphasized the dependence of the innovations on the parame-ters Θ. Of course, (6.62) is a highly nonlinear and complicated function of theunknown parameters. The usual procedure is to fix xxx0 and then develop a setof recursions for the log likelihood function and its first two derivatives (for

Page 349: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

336 6 State-Space Models

example, Gupta and Mehra, 1974). Then, a Newton–Raphson algorithm (seeExample 3.29) can be used successively to update the parameter values untilthe negative of the log likelihood is minimized. This approach is advocated, forexample, by Jones (1980), who developed ARMA estimation by putting theARMA model in state-space form. For the univariate case, (6.62) is identical,in form, to the likelihood for the ARMA model given in (3.116).

The steps involved in performing a Newton–Raphson estimation procedureare as follows.

(i) Select initial values for the parameters, say, Θ(0).(ii) Run the Kalman filter, Property 6.1, using the initial parameter values,

Θ(0), to obtain a set of innovations and error covariances, say, {εεε(0)t ; t =

1, . . . , n} and {Σ(0)t ; t = 1, . . . , n}.

(iii) Run one iteration of a Newton–Raphson procedure with − lnLY (Θ) asthe criterion function (refer to Example 3.29 for details), to obtain a newset of estimates, say Θ(1).

(iv) At iteration j, (j = 1, 2, . . .), repeat step 2 using Θ(j) in place of Θ(j−1) to

obtain a new set of innovation values {εεε(j)t ; t = 1, . . . , n} and {Σ(j)t ; t =

1, . . . , n}. Then repeat step 3 to obtain a new estimate Θ(j+1). Stopwhen the estimates or the likelihood stabilize; for example, stop whenthe values of Θ(j+1) differ from Θ(j), or when LY (Θ(j+1)) differs fromLY (Θ(j)), by some predetermined, but small amount.

Example 6.6 Newton–Raphson for Example 6.3

In this example, we generated n = 100 observations, y1, . . . , y100, using themodel in Example 6.3, to perform a Newton–Raphson estimation of theparameters φ, σ2

w, and σ2v . In the notation of §6.2, we would have Φ = φ,

Q = σ2w and R = σ2

v . The actual values of the parameters are φ = .8,σ2w = σ2

v = 1.In the simple case of an AR(1) with observational noise, initial estimation

can be accomplished using the results of Example 6.3. For example, using(6.16), we set

φ(0) = ρy(2)/ρy(1).

Similarly, from (6.15), γx(1) = γy(1) = φσ2w/(1 − φ2), so that, initially, we

setσ2(0)

w = (1− φ(0)2

)γy(1)/φ(0).

Finally, using (6.14) we obtain an initial estimate of σ2v , namely,

σ2(0)

v = γy(0)− [σ2(0)

w /(1− φ(0)2

)].

Newton–Raphson estimation was accomplished using the R programoptim. The code used for this example is given below. In that program,we must provide an evaluation of the function to be minimized, namely,

Page 350: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

6.3 Maximum Likelihood Estimation 337

− lnLY (Θ). In this case, the function call combines steps 2 and 3, usingthe current values of the parameters, Θ(j−1), to obtain first the filteredvalues, then the innovation values, and then calculating the criterion func-tion, − lnLY (Θ(j−1)), to be minimized. We can also provide analytic formsof the gradient or score vector, −∂ lnLY (Θ)/∂Θ, and the Hessian matrix,−∂2 lnLY (Θ)/∂Θ ∂Θ′, in the optimization routine, or allow the programto calculate these values numerically. In this example, we let the programproceed numerically and we note the need to be cautious when calculat-ing gradients numerically. For better stability, we can also provide an it-erative solution for obtaining analytic gradients and Hessians of the loglikelihood function; for details, see Problems 6.11 and 6.12 and Gupta andMehra (1974).

1 # Generate Data

2 set.seed(999); num = 100; N = num+1

3 x = arima.sim(n=N, list(ar = .8, sd=1))

4 y = ts(x[-1] + rnorm(num,0,1))

5 # Initial Estimates

6 u = ts.intersect(y, lag(y,-1), lag(y,-2))

7 varu = var(u); coru = cor(u)

8 phi = coru[1,3]/coru[1,2]

9 q = (1-phi^2)*varu[1,2]/phi; r = varu[1,1] - q/(1-phi^2)

10 (init.par = c(phi, sqrt(q), sqrt(r))) # = .91, .51, 1.03

11 # Function to evaluate the likelihood

12 Linn=function(para){

13 phi = para[1]; sigw = para[2]; sigv = para[3]

14 Sigma0 = (sigw^2)/(1-phi^2); Sigma0[Sigma0<0]=0

15 kf = Kfilter0(num, y, 1, mu0=0, Sigma0, phi, sigw, sigv)

16 return(kf$like) }

17 # Estimation (partial output shown)

18 (est = optim(init.par, Linn, gr=NULL, method="BFGS", hessian=TRUE))

19 SE = sqrt(diag(solve(est$hessian)))

20 cbind(estimate=c(phi=est$par[1],sigw=est$par[2],sigv=est$par[3]),SE)

estimate SE

phi 0.8137623 0.08060636

sigw 0.8507863 0.17528895

sigv 0.8743968 0.14293192

As seen from the output, the final estimates, along with their standard errors(in parentheses), are φ = .81 (.08), σw = .85 (.18), σv = .87 (.14). Addingcontrol=list(trace=1, REPORT=1) to the optim call in line 19 yielded thefollowing results of the estimation procedure:

initial value 79.601468

iter 2 value 79.060391

iter 3 value 79.034121

iter 4 value 79.032615

iter 5 value 79.014817

iter 6 value 79.014453

final value 79.014452

Page 351: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

338 6 State-Space Models

Time

Tem

pera

ture

Dev

iatio

ns

1880 1900 1920 1940 1960 1980 2000

−0.4

−0.2

0.0

0.2

0.4

0.6

0.8

Fig. 6.4. Plot for Example 6.7. The thin solid and dashed lines are the two aver-age global temperature deviations shown in Figure 6.2. The thick solid line is theestimated smoother xnt .

Note that the algorithm converged in six (or seven?) steps with the finalvalue of the negative of the log likelihood being 79.014452. The standarderrors are a byproduct of the estimation procedure, and we will discusstheir evaluation later in this section, after Property 6.4.

Example 6.7 Newton–Raphson for Example 6.2

In Example 6.2, we considered two different global temperature series of n =130 observations each, and they are plotted in Figure 6.2. In that example,we argued that both series should be measuring the same underlying climaticsignal, xt, which we model as a random walk with drift,

xt = δ + xt−1 + wt.

Recall that the observation equation was written as(yt1yt2

)=

(11

)xt +

(vt1vt2

),

and the model covariance matrices are given by Q = q11 and

R =

(r11 r12r21 r22

).

Hence, there are five parameters to estimate, δ, the drift, and the variancecomponents, q11, r11, r12, r22, noting that r21 = r12 We hold the the initialstate parameters fixed in this example at µ0 = −.26 and Σ0 = .01.

Page 352: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

6.3 Maximum Likelihood Estimation 339

The final estimates are δ = .006, q11 = .033, r11 = .007, r12 = .01, r22 =.02, with all values being significant. The observations and the smoothedestimate of the signal, xnt , are displayed in Figure 6.4. The R code, whichuses Kfilter1 and Ksmooth1, is as follows.

1 # Setup

2 y = cbind(gtemp,gtemp2); num = nrow(y); input = rep(1,num)

3 A = array(rep(1,2), dim=c(2,1,num))

4 mu0 = -.26; Sigma0 = .01; Phi = 1

5 # Function to Calculate Likelihood

6 Linn=function(para){

7 cQ = para[1] # sigma_w

8 cR1 = para[2] # 11 element of chol(R)

9 cR2 = para[3] # 22 element of chol(R)

10 cR12 = para[4] # 12 element of chol(R)

11 cR = matrix(c(cR1,0,cR12,cR2),2) # put the matrix together

12 drift = para[5]

13 kf = Kfilter1(num,y,A,mu0,Sigma0,Phi,drift,0,cQ,cR,input)

14 return(kf$like) }

15 # Estimation

16 init.par = c(.1,.1,.1,0,.05) # initial values of parameters

17 (est = optim(init.par, Linn, NULL, method="BFGS", hessian=TRUE,

control=list(trace=1,REPORT=1)))

18 SE = sqrt(diag(solve(est$hessian)))

19 # display estimates

20 u = cbind(estimate=est$par, SE)

21 rownames(u)=c("sigw","cR11", "cR22", "cR12", "drift"); u

estimate SE

sigw 0.032730315 0.007473594

cR11 0.084752492 0.007815219

cR22 0.070864957 0.005732578

cR12 0.122458872 0.014867006

drift 0.005852047 0.002919058

22 # Smooth (first set parameters to their final estimates)

23 cQ=est$par[1]

24 cR1=est$par[2]

25 cR2=est$par[3]

26 cR12=est$par[4]

27 cR = matrix(c(cR1,0,cR12,cR2), 2)

28 (R = t(cR)%*%cR) # to view the estimated R matrix

29 drift = est$par[5]

30 ks = Ksmooth1(num,y,A,mu0,Sigma0,Phi,drift,0,cQ,cR,input)

31 # Plot

32 xsmooth = ts(as.vector(ks$xs), start=1880)

33 plot(xsmooth, lwd=2, ylim=c(-.5,.8), ylab="Temperature Deviations")

34 lines(gtemp, col="blue", lty=1) # color helps here

35 lines(gtemp2, col="red", lty=2)

Page 353: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

340 6 State-Space Models

In addition to Newton–Raphson, Shumway and Stoffer (1982) presenteda conceptually simpler estimation procedure based on the EM (expectation-maximization) algorithm (Dempster et al., 1977). For the sake of brevity,we ignore the inputs and consider the model in the form of (6.1) and (6.2).The basic idea is that if we could observe the states, Xn = {xxx0, xxx1, . . . , xxxn},in addition to the observations Yn = {yyy1, . . . , yyyn}, then we would consider{Xn, Yn} as the complete data, with the joint density

fΘ(Xn, Yn) = fµ0,Σ0(xxx0)

n∏t=1

fΦ,Q(xxxt|xxxt−1)n∏t=1

fR(yyyt|xxxt). (6.63)

Under the Gaussian assumption and ignoring constants, the complete datalikelihood, (6.63), can be written as

−2 lnLX,Y (Θ) = ln |Σ0|+ (xxx0 − µµµ0)′Σ−10 (xxx0 − µµµ0)

+ n ln |Q|+n∑t=1

(xxxt − Φxxxt−1)′Q−1(xxxt − Φxxxt−1)

+ n ln |R|+n∑t=1

(yyyt −Atxxxt)′R−1(yyyt −Atxxxt).

(6.64)

Thus, in view of (6.64), if we did have the complete data, we could then usethe results from multivariate normal theory to easily obtain the MLEs of Θ.We do not have the complete data; however, the EM algorithm gives us aniterative method for finding the MLEs of Θ based on the incomplete data,Yn, by successively maximizing the conditional expectation of the completedata likelihood. To implement the EM algorithm, we write, at iteration j,(j = 1, 2, . . .),

Q(Θ∣∣ Θ(j−1)

)= E

{−2 lnLX,Y (Θ)

∣∣ Yn, Θ(j−1)}. (6.65)

Calculation of (6.65) is the expectation step. Of course, given the currentvalue of the parameters, Θ(j−1), we can use Property 6.2 to obtain the desiredconditional expectations as smoothers. This property yields

Q(Θ∣∣ Θ(j−1)

)= ln |Σ0|+ tr

{Σ−10 [Pn0 + (xxxn0 − µµµ0)(xxxn0 − µµµ0)′]

}+ n ln |Q|+ tr

{Q−1[S11 − S10Φ

′ − ΦS′10 + ΦS00Φ′]}

+ n ln |R| (6.66)

+ tr{R−1

n∑t=1

[(yyyt −Atxxxnt )(yyyt −Atxxxnt )′ +AtPnt A′t]},

where

S11 =n∑t=1

(xxxnt xxxnt′ + Pnt ), (6.67)

Page 354: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

6.3 Maximum Likelihood Estimation 341

S10 =

n∑t=1

(xxxnt xxxnt−1′ + Pnt,t−1), (6.68)

and

S00 =n∑t=1

(xxxnt−1xxxnt−1′ + Pnt−1). (6.69)

In (6.66)-(6.69), the smoothers are calculated under the current value of theparameters Θ(j−1); for simplicity, we have not explicitly displayed this fact.

Minimizing (6.66) with respect to the parameters, at iteration j, con-stitutes the maximization step, and is analogous to the usual multivariateregression approach, which yields the updated estimates

Φ(j) = S10S−100 , (6.70)

Q(j) = n−1(S11 − S10S

−100 S

′10

), (6.71)

and

R(j) = n−1n∑t=1

[(yyyt −Atxxxnt )(yyyt −Atxxxnt )′ +AtPnt A′t]. (6.72)

The updates for the initial mean and variance–covariance matrix are

µµµ(j)0 = xxxn0 and Σ

(j)0 = Pn0 (6.73)

obtained from minimizing (6.66).The overall procedure can be regarded as simply alternating between the

Kalman filtering and smoothing recursions and the multivariate normal max-imum likelihood estimators, as given by (6.70)–(6.73). Convergence results forthe EM algorithm under general conditions can be found in Wu (1983). Wesummarize the iterative procedure as follows.

(i) Initialize the procedure by selecting starting values for the parametersΘ(0) = {µµµ0, Σ0, Φ,Q,R}.On iteration j, (j = 1, 2, . . .):

(ii) Compute the incomplete-data likelihood, − lnLY (Θ(j−1)); see equation(6.62).

(iii) Perform the E-Step. Use Properties 6.1, 6.2, and 6.3 to obtain thesmoothed values xxxnt , P

nt and Pnt,t−1, for t = 1, . . . , n, using the parame-

ters Θ(j−1). Use the smoothed values to calculate S11, S10, S00 given in(6.67)–(6.69).

(iv) Perform the M-Step. Update the estimates, µµµ0, Σ0, Φ,Q, and R using(6.70)–(6.73), to obtain Θ(j).

(v) Repeat Steps (ii) – (iv) to convergence.

Page 355: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

342 6 State-Space Models

Example 6.8 EM Algorithm for Example 6.3

Using the same data generated in Example 6.6, we performed an EM al-gorithm estimation of the parameters φ, σ2

w and σ2v as well as the initial

parameters µ0 and Σ0 using the script EM0. The convergence rate of the EMalgorithm compared with the Newton–Raphson procedure is slow. In thisexample, with convergence being claimed when the relative change in the loglikelihood is less than .00001; convergence was attained after 41 iterations.The final estimates, along with their standard errors (in parentheses), were

φ = .80 (.08), σw = .87 (.17), σv = .84 (.14),

with µ0 = −1.98 and Σ0 = .03. Evaluation of the standard errors used acall to fdHess in the nlme R package to evaluate the Hessian at the finalestimates. The nlme package must be loaded prior to the call to fdHess.

1 library(nlme) # loads package nlme

2 # Generate data (same as Example 6.6)

3 set.seed(999); num = 100; N = num+1

4 x = arima.sim(n=N, list(ar = .8, sd=1))

5 y = ts(x[-1] + rnorm(num,0,1))

6 # Initial Estimates

7 u = ts.intersect(y, lag(y,-1), lag(y,-2))

8 varu = var(u); coru = cor(u)

9 phi = coru[1,3]/coru[1,2]

10 q = (1-phi^2)*varu[1,2]/phi

11 r = varu[1,1] - q/(1-phi^2)

12 cr = sqrt(r); cq = sqrt(q); mu0 = 0; Sigma0 = 2.8

13 # EM procedure - output not shown

14 (em = EM0(num, y, 1, mu0, Sigma0, phi, cq, cr, 75, .00001))

15 # Standard Errors (this uses nlme)

16 phi = em$Phi; cq = chol(em$Q); cr = chol(em$R)

17 mu0 = em$mu0; Sigma0 = em$Sigma0

18 para = c(phi, cq, cr)

19 Linn = function(para){ # to evaluate likelihood at estimates

20 kf = Kfilter0(num, y, 1, mu0, Sigma0, para[1], para[2], para[3])

21 return(kf$like) }

22 emhess = fdHess(para, function(para) Linn(para))

23 SE = sqrt(diag(solve(emhess$Hessian)))

24 # Display Summary of Estimation

25 estimate = c(para, em$mu0, em$Sigma0); SE = c(SE, NA, NA)

26 u = cbind(estimate, SE)

27 rownames(u) = c("phi","sigw","sigv","mu0","Sigma0"); u

estimate SE

phi 0.80639903 0.07986272

sigw 0.86442634 0.16719703

sigv 0.84276381 0.13805072

mu0 -1.96010956 NA

Sigma0 0.03638596 NA

Page 356: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

6.3 Maximum Likelihood Estimation 343

Asymptotic Distribution of the MLEs

The asymptotic distribution of estimators of the model parameters, say, Θn,is studied extensively in Caines (1988, Chapters 7 and 8), and in Hannanand Deistler (1988, Chapter 4). In both of these references, the consistencyand asymptotic normality of the estimators is established under general con-ditions. Although we will only state the basic result, some crucial elementsare needed to establish large sample properties of the estimators. An essentialcondition is the stability of the filter. Stability of the filter assures that, forlarge t, the innovations εεεt are basically copies of each other (that is, indepen-dent and identically distributed) with a stable covariance matrix Σ that doesnot depend on t and that, asymptotically, the innovations contain all of theinformation about the unknown parameters. Although it is not necessary, forsimplicity, we shall assume here that At ≡ A for all t. Details on departuresfrom this assumption can be found in Jazwinski (1970, Sections 7.6 and 7.8).We also drop the inputs and use the model in the form of (6.1) and (6.2).

For stability of the filter, we assume the eigenvalues of Φ are less thanone in absolute value; this assumption can be weakened (for example, seeHarvey, 1991, Section 4.3), but we retain it for simplicity. This assumption isenough to ensure the stability of the filter in that, as t → ∞, the filter errorcovariance matrix P tt converges to P , the steady-state error covariance matrix,the gain matrix Kt converges to K, the steady-state gain matrix, from whichit follows that the innovation variance–covariance matrix Σt converges to Σ,the steady-state variance–covariance matrix of the stable innovations; detailscan be found in Jazwinski (1970, Sections 7.6 and 7.8) and Anderson andMoore (1979, Section 4.4). In particular, the steady-state filter error covariancematrix, P , satisfies the Riccati equation:

P = Φ[P − PA′(APA′ +R)−1AP ]Φ′ +Q;

the steady-state gain matrix satisfies K = PA′[APA′+R]−1. In Example 6.5,for all practical purposes, stability was reached by the fourth observation.

When the process is in steady-state, we may consider xxxtt+1 as the steady-state predictor and interpret it as xxxtt+1 = E(xxxt+1

∣∣ yyyt, yyyt−1, . . .). As can beseen from (6.19) and (6.21), the steady-state predictor can be written as

xxxtt+1 = Φ[I −KA]xxxt−1t + ΦKyt

= Φxxxt−1t + ΦKεεεt, (6.74)

where εεεt is the steady-state innovation process given by

εεεt = yyyt − E(yyyt∣∣ yyyt−1, yyyt−2, . . .).

In the Gaussian case, εεεt ∼ iid N(000, Σ), where Σ = APA′+R. In steady-state,the observations can be written as

yyyt = Axxxt−1t + εεεt. (6.75)

Page 357: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

344 6 State-Space Models

Together, (6.74) and (6.75) make up the steady-state innovations form of thedynamic linear model.

In the following property, we assume the Gaussian state-space model (6.1)and (6.2), is time invariant, i.e., At ≡ A, the eigenvalues of Φ are within theunit circle and the model has the smallest possible dimension (see Hannanand Diestler, 1988, Section 2.3 for details). We denote the true parameters byΘ0, and we assume the dimension of Θ0 is the dimension of the parameterspace. Although it is not necessary to assume wwwt and vvvt are Gaussian, certainadditional conditions would have to apply and adjustments to the asymptoticcovariance matrix would have to be made (see Caines, 1988, Chapter 8).

Property 6.4 Asymptotic Distribution of the EstimatorsUnder general conditions, let Θn be the estimator of Θ0 obtained by maxi-

mizing the innovations likelihood, LY (Θ), as given in (6.62). Then, as n→∞,

√n(Θn −Θ0

)d→ N

[0, I(Θ0)−1

],

where I(Θ) is the asymptotic information matrix given by

I(Θ) = limn→∞

n−1E[−∂2 lnLY (Θ)/∂Θ ∂Θ′

].

Precise details and the proof of Property 6.4 are given in Caines (1988,Chapter 7) and in Hannan and Deistler (1988, Chapter 4). For a Newtonprocedure, the Hessian matrix (as described in Example 6.6) at the time ofconvergence can be used as an estimate of nI(Θ0) to obtain estimates of thestandard errors. In the case of the EM algorithm, no derivatives are calculated,but we may include a numerical evaluation of the Hessian matrix at the time ofconvergence to obtain estimated standard errors. Also, extensions of the EMalgorithm exist, such as the SEM algorithm (Meng and Rubin, 1991), thatinclude a procedure for the estimation of standard errors. In the examples ofthis section, the estimated standard errors were obtained from the numericalHessian matrix of − lnLY (Θ), where Θ is the vector of parameters estimatesat the time of convergence.

6.4 Missing Data Modifications

An attractive feature available within the state-space framework is its abilityto treat time series that have been observed irregularly over time. For exam-ple, Jones (1980) used the state-space representation to fit ARMA models toseries with missing observations, and Palma and Chan (1997) used the modelfor estimation and forecasting of ARFIMA series with missing observations.Shumway and Stoffer (1982) described the modifications necessary to fit mul-tivariate state-space models via the EM algorithm when data are missing. Wewill discuss the procedure in detail in this section. Throughout this section,for notational simplicity, we assume the model is of the form (6.1) and (6.2).

Page 358: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

6.4 Missing Data Modifications 345

Suppose, at a given time t, we define the partition of the q×1 observation

vector yyyt = (yyy(1)t

′, yyy

(2)t

′)′, where the first q1t × 1 component is observed and

the second q2t × 1 component is unobserved, q1t + q2t = q. Then, write thepartitioned observation equation(

yyy(1)t

yyy(2)t

)=

[A

(1)t

A(2)t

]xxxt +

(vvv(1)t

vvv(2)t

), (6.76)

where A(1)t and A

(2)t are, respectively, the q1t × p and q2t × p partitioned

observation matrices, and

cov

(vvv(1)t

vvv(2)t

)=

[R11t R12t

R21t R22t

](6.77)

denotes the covariance matrix of the measurement errors between the observedand unobserved parts.

In the missing data case where yyy(2)t is not observed, we may modify the

observation equation in the DLM, (6.1)-(6.2), so that the model is

xxxt = Φxxxt−1 +wwwt and yyy(1)t = A

(1)t xxxt + vvv

(1)t , (6.78)

where now, the observation equation is q1t-dimensional at time t. In this case,it follows directly from Corollary 6.1 that the filter equations hold with theappropriate notational substitutions. If there are no observations at time t,then set the gain matrix, Kt, to the p×q zero matrix in Property 6.1, in whichcase xxxtt = xxxt−1t and P tt = P t−1t .

Rather than deal with varying observational dimensions, it is computa-tionally easier to modify the model by zeroing out certain components and re-taining a q-dimensional observation equation throughout. In particular, Corol-lary 6.1 holds for the missing data case if, at update t, we substitute

yyy(t) =

(yyy(1)t

000

), A(t) =

[A

(1)t

0

], R(t) =

[R11t 0

0 I22t

], (6.79)

for yyyt, At, and R, respectively, in (6.21)–(6.23), where I22t is the q2t × q2tidentity matrix. With the substitutions (6.79), the innovation values (6.24)and (6.25) will now be of the form

εεε(t) =

(εεε(1)t

000

), Σ(t) =

[A

(1)t P t−1t A

(1)′

t +R11t 00 I22t

], (6.80)

so that the innovations form of the likelihood given in (6.62) is correct for thiscase. Hence, with the substitutions in (6.79), maximum likelihood estimationvia the innovations likelihood can proceed as in the complete data case.

Once the missing data filtered values have been obtained, Stoffer (1982)also established the smoother values can be processed using Properties 6.2

Page 359: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

346 6 State-Space Models

and 6.3 with the values obtained from the missing data-filtered values. In themissing data case, the state estimators are denoted

xxx(s)t = E

(xxxt∣∣ yyy(1)1 , . . . , yyy(1)s

), (6.81)

with error variance–covariance matrix

P(s)t = E

{(xxxt − xxx(s)t

)(xxxt − xxx(s)t

)′}. (6.82)

The missing data lag-one smoother covariances will be denoted by P(n)t,t−1.

The maximum likelihood estimators in the EM procedure require furthermodifications for the case of missing data. Now, we consider

Y (1)n = {yyy(1)1 , . . . , yyy(1)n } (6.83)

as the incomplete data, and Xn, Yn, as defined in (6.63), as the complete data.In this case, the complete data likelihood, (6.63), or equivalently (6.64), is thesame, but to implement the E-step, at iteration j, we must calculate

Q(Θ∣∣ Θ(j−1)) = E

{−2 lnLX,Y (Θ)

∣∣ Y (1)n , Θ(j−1)}

= E∗

{ln |Σ0|+ tr Σ−10 (xxx0 − µµµ0)(xxx0 − µµµ0)′

∣∣ Y (1)n

}+ E∗

{n ln |Q|+

n∑t=1

tr[Q−1(xxxt − Φxxxt−1)(xxxt − Φxxxt−1)′

] ∣∣ Y (1)n

}+ E∗

{n ln |R|+

n∑t=1

tr[R−1(yyyt −Atxxxt)(yyyt −Atxxxt)′

] ∣∣ Y (1)n

},

(6.84)

where E∗ denotes the conditional expectation under Θ(j−1) and tr denotestrace. The first two terms in (6.84) will be like the first two terms of (6.66) withthe smoothers xxxnt , Pnt , and Pnt,t−1 replaced by their missing data counterparts,

xxx(n)t , P

(n)t , and P

(n)t,t−1. What changes in the missing data case is the third term

of (6.84), where we must evaluate E∗(yyy(2)t

∣∣ Y (1)n ) and E∗(yyy

(2)t yyy

(2)′

t

∣∣ Y (1)n ). In

Stoffer (1982), it is shown that

E∗

{(yyyt −Atxxxt)(yyyt −Atxxxt)′

∣∣ Y (1)n

}=

(yyy(1)t −A

(1)t xxx

(n)t

R∗21tR−1∗11t(yyy

(1)t −A

(1)t xxx

(n)t )

)(yyy(1)t −A

(1)t xxx

(n)t

R∗21tR−1∗11t(yyy

(1)t −A

(1)t xxx

(n)t )

)′

+

(A

(1)t

R∗21tR−1∗11tA

(1)t

)P

(n)t

(A

(1)t

R∗21tR−1∗11tA

(1)t

)′+

(0 00 R∗22t −R∗21tR−1∗11tR∗12t

).

(6.85)

Page 360: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

6.4 Missing Data Modifications 347

In (6.85), the values of R∗ikt, for i, k = 1, 2, are the current values specified

by Θ(j−1). In addition, xxx(n)t and P

(n)t are the values obtained by running the

smoother under the current parameter estimates specified by Θ(j−1).In the case in which observed and unobserved components have uncorre-

lated errors, that is, R∗12t is the zero matrix, (6.85) can be simplified to

E∗{

(yyyt −Atxxxt)(yyyt −Atxxxt)′∣∣ Y (1)

n

}=(yyy(t) −A(t)xxx

(n)t

)(yyy(t) −A(t)xxx

(n)t

)′+A(t)P

(n)t A′(t) +

(0 00 R∗22t

), (6.86)

where yyy(t) and A(t) are defined in (6.79).In this simplified case, the missing data M-step looks like the M-step given

in (6.67)-(6.73). That is, with

S(11) =n∑t=1

(xxx(n)t xxx

(n)t

′+ P

(n)t ), (6.87)

S(10) =n∑t=1

(xxx(n)t xxx

(n)t−1′+ P

(n)t,t−1), (6.88)

and

S(00) =n∑t=1

(xxx(n)t−1xxx

(n)t−1′+ P

(n)t−1), (6.89)

where the smoothers are calculated under the present value of the parametersΘ(j−1) using the missing data modifications, at iteration j, the maximizationstep is

Φ(j) = S(10)S−1(00), (6.90)

Q(j) = n−1(S(11) − S(10)S

−1(00)S

′(10)

), (6.91)

and

R(j) = n−1n∑t=1

Dt

{(yyy(t) −A(t)xxx

(n)t

)(yyy(t) −A(t)xxx

(n)t

)′+A(t)P

(n)t A′(t) +

(0 0

0 R(j−1)22t

)}Dt, (6.92)

where Dt is a permutation matrix that reorders the variables at time t in theiroriginal order and yyy(t) and A(t) are defined in (6.79). For example, supposeq = 3 and at time t, yt2 is missing. Then,

yyy(t) =

yt1yt30

, A(t) =

At1At3000′

, and Dt =

1 0 00 0 10 1 0

,

Page 361: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

348 6 State-Space Models

day

WB

C

0 20 40 60 80

12

34

5

day

PLT

0 20 40 60 80

3.0

3.5

4.0

4.5

5.0

5.5

6.0

day

HC

T

0 20 40 60 80

2025

3035

40

Fig. 6.5. Smoothed values for various components in the blood parameter trackingproblem. The actual data are shown as points, the smoothed values are shown assolid lines, and ±3 standard error bounds are shown as dashed lines.

where Ati is the ith row of At and 000′ is a 1× p vector of zeros. In (6.92), onlyR11t gets updated, and R22t at iteration j is simply set to its value from theprevious iteration, j−1. Of course, if we cannot assume R12t = 0, (6.92) mustbe changed accordingly using (6.85), but (6.90) and (6.91) remain the same.As before, the parameter estimates for the initial state are updated as

µµµ(j)0 = xxx

(n)0 and Σ

(j)0 = P

(n)0 . (6.93)

Example 6.9 Longitudinal Biomedical Data

We consider the biomedical data in Example 6.1, which have portions of thethree-dimensional vector missing after the 40th day. The maximum likeli-hood procedure yielded the estimators

Page 362: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

6.4 Missing Data Modifications 349

Φ =

.970 −.022 .007.057 .927 .006

−1.342 2.190 .792

, Q =

.018 −.002 .018−.002 .003 .028.018 .028 4.10

,

and R = diag{.003, .017, .342} for the transition, state error covariance andobservation error covariance matrices, respectively. The coupling betweenthe first and second series is relatively weak, whereas the third series HCTis strongly related to the first two; that is,

xt3 = −1.342xt−1,1 + 2.190xt−1,2 + .792xt−1,3.

Hence, the HCT is negatively correlated with white blood count (WBC) andpositively correlated with platelet count (PLT). Byproducts of the procedureare estimated trajectories for all three longitudinal series and their respectiveprediction intervals. In particular, Figure 6.5 shows the data as points, the

estimated smoothed values x(n)t as solid lines, and error bounds, x

(n)t ±

2√P

(n)t as dotted lines, for critical post-transplant platelet count.

In the following R code we use the script EM1. In this case the observationmatrices At are either the identity or the zero matrix because either all theseries are observed or none of them are observed.

1 y = cbind(WBC, PLT, HCT)

2 num = nrow(y)

3 A = array(0, dim=c(3,3,num)) # make array of obs matrices

4 for(k in 1:num) if (y[k,1] > 0) A[,,k]= diag(1,3)

5 # Initial values

6 mu0 = matrix(0, 3, 1)

7 Sigma0 = diag(c(.1, .1, 1), 3)

8 Phi = diag(1, 3)

9 cQ = diag(c(.1, .1, 1), 3)

10 cR = diag(c(.1, .1, 1), 3)

11 # EM procedure - output not shown

12 (em = EM1(num, y, A, mu0, Sigma0, Phi, 0, 0, cQ, cR, 0, 100, .001))

13 # Graph smoother

14 ks = Ksmooth1(num, y, A, em$mu0, em$Sigma0, em$Phi, 0, 0,

chol(em$Q), chol(em$R), 0)

15 y1s = ks$xs[1,,]; y2s = ks$xs[2,,]; y3s = ks$xs[3,,]

16 p1 = 2*sqrt(ks$Ps[1,1,])

17 p2 = 2*sqrt(ks$Ps[2,2,])

18 p3 = 2*sqrt(ks$Ps[3,3,])

19 par(mfrow=c(3,1), mar=c(4,4,1,1)+.2)

20 plot(WBC, type="p", pch=19, ylim=c(1,5), xlab="day")

21 lines(y1s); lines(y1s+p1, lty=2); lines(y1s-p1, lty=2)

22 plot(PLT, type="p", ylim=c(3,6), pch=19, xlab="day")

23 lines(y2s); lines(y2s+p2, lty=2); lines(y2s-p2, lty=2)

24 plot(HCT, type="p", pch=19, ylim=c(20,40), xlab="day")

25 lines(y3s); lines(y3s+p3, lty=2); lines(y3s-p3, lty=2)

Page 363: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

350 6 State-Space Models

6.5 Structural Models: Signal Extraction andForecasting

In order to develop computing techniques for handling a versatile cross sectionof possible models, it is necessary to restrict the state-space model somewhat,and we consider one possible class of specializations in this section. The com-ponents of the model are taken as linear processes that can be adapted torepresent fixed and disturbed trends and periodicities as well as classical au-toregressions. The observed series is regarded as being a sum of componentsignal series. To illustrate the possibilities, we consider an example that showshow to fit a sum of trend, seasonal, and irregular components to the quarterlyearnings data that we have considered before.

Example 6.10 Johnson & Johnson Quarterly Earnings

Consider the quarterly earnings series from the U.S. company Johnson &Johnson as given in Figure 1.1. The series is highly nonstationary, and thereis both a trend signal that is gradually increasing over time and a seasonalcomponent that cycles every four quarters or once per year. The seasonalcomponent is getting larger over time as well. Transforming into logarithmsor even taking the nth root does not seem to make the series stationary, asthere is a slight bend to the transformed curve. Suppose, however, we con-sider the series to be the sum of a trend component, a seasonal component,and a white noise. That is, let the observed series be expressed as

yt = Tt + St + vt, (6.94)

where Tt is trend and St is the seasonal component. Suppose we allow thetrend to increase exponentially; that is,

Tt = φTt−1 + wt1, (6.95)

where the coefficient φ > 1 characterizes the increase. Let the seasonalcomponent be modeled as

St + St−1 + St−2 + St−3 = wt2, (6.96)

which corresponds to assuming the seasonal component is expected to sumto zero over a complete period or four quarters. To express this model instate-space form, let xxxt = (Tt, St, St−1, St−2)′ be the state vector so theobservation equation (6.2) can be written as

yt =(1 1 0 0

)TtStSt−1St−2

+ vt,

with the state equation written as

Page 364: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

6.5 Structural Models: Signal Extraction and Forecasting 351

Trend Component

1960 1965 1970 1975 1980

510

15

Seasonal Component

1960 1965 1970 1975 1980

−3

−2

−1

01

2

Data (points) and Trend+Season (line)

1960 1965 1970 1975 1980

05

1015

Fig. 6.6. Estimated trend component, Tnt (top), estimated seasonal component,Snt (middle), and the Johnson and Johnson quarterly earnings series with Tnt + Sntsuperimposed (bottom).

TtStSt−1St−2

=

φ 0 0 00 −1 −1 −10 1 0 00 0 1 0

Tt−1St−1St−2St−3

+

wt1wt200

,

where R = r11 and

Q =

q11 0 0 00 q22 0 00 0 0 00 0 0 0

.

The model reduces to state-space form, (6.1) and (6.2), with p = 4 andq = 1. The parameters to be estimated are r11, the noise variance in themeasurement equations, q11 and q22, the model variances corresponding to

Page 365: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

352 6 State-Space Models

Fig. 6.7. A 12-quarter forecast for the Johnson & Johnson quarterly earnings series.The forecasts are shown as a continuation of the data (points connected by a solidline). The dashed lines indicate the upper and lower 95% prediction intervals.

the trend and seasonal components and φ, the transition parameter thatmodels the growth rate. Growth is about 3% per year, and we began withφ = 1.03. The initial mean was fixed at µµµ0 = (.7, 0, 0, 0)′, with uncertaintymodeled by the diagonal covariance matrix with Σ0ii = .04, for i = 1, . . . , 4.Initial state covariance values were taken as q11 = .01, q22 = .01,. The mea-surement error covariance was started at r11 = .25.

After about 20 iterations of a Newton–Raphson, the transition parameterestimate was φ = 1.035, corresponding to exponential growth with infla-tion at about 3.5% per year. The measurement uncertainty was small at√r11 = .0005, compared with the model uncertainties

√q11 = .1397 and√

q22 = .2209. Figure 6.6 shows the smoothed trend estimate and the expo-nentially increasing seasonal components. We may also consider forecastingthe Johnson & Johnson series, and the result of a 12-quarter forecast isshown in Figure 6.7 as basically an extension of the latter part of the ob-served data.

This example uses the Kfilter0 and Ksmooth0 scripts as follows.1 num = length(jj); A = cbind(1, 1, 0, 0)

2 # Function to Calculate Likelihood

3 Linn=function(para){

4 Phi = diag(0,4); Phi[1,1] = para[1]

5 Phi[2,]=c(0,-1,-1,-1); Phi[3,]=c(0, 1, 0, 0); Phi[4,]=c(0, 0, 1, 0)

6 cQ1 = para[2]; cQ2 = para[3]; cR = para[4] # sqrt of q11, q22, r11

7 cQ=diag(0,4); cQ[1,1]=cQ1; cQ[2,2]=cQ2;

8 kf = Kfilter0(num, jj, A, mu0, Sigma0, Phi, cQ, cR)

9 return(kf$like) }

Page 366: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

6.5 Structural Models: Signal Extraction and Forecasting 353

10 # Initial Parameters

11 mu0 = c(.7, 0, 0, 0); Sigma0 = diag(.04, 4)

12 init.par = c(1.03, .1, .1, .5) # Phi[1,1], the 2 Qs and R

13 # Estimation

14 est = optim(init.par, Linn, NULL, method="BFGS", hessian=TRUE,

control=list(trace=1,REPORT=1))

15 SE = sqrt(diag(solve(est$hessian)))

16 u = cbind(estimate=est$par,SE)

17 rownames(u)=c("Phi11","sigw1","sigw2","sigv"); u

18 # Smooth

19 Phi = diag(0,4); Phi[1,1] = est$par[1]

20 Phi[2,]=c(0,-1,-1,-1); Phi[3,]=c(0,1,0,0); Phi[4,]=c(0,0,1,0)

21 cQ1 = est$par[2]; cQ2 = est$par[3]; cR = est$par[4]

22 cQ = diag(1,4); cQ[1,1]=cQ1; cQ[2,2]=cQ2

23 ks = Ksmooth0(num, jj, A, mu0, Sigma0, Phi, cQ, cR)

24 # Plot

25 Tsm = ts(ks$xs[1,,], start=1960, freq=4)

26 Ssm = ts(ks$xs[2,,], start=1960, freq=4)

27 p1 = 2*sqrt(ks$Ps[1,1,]); p2 = 2*sqrt(ks$Ps[2,2,])

28 par(mfrow=c(3,1))

29 plot(Tsm, main="Trend Component", ylab="Trend")

30 lines(Tsm+p1, lty=2, col=4); lines(Tsm-p1,lty=2, col=4)

31 plot(Ssm, main="Seasonal Component", ylim=c(-5,4), ylab="Season")

32 lines(Ssm+p2,lty=2, col=4); lines(Ssm-p2,lty=2, col=4)

33 plot(jj, type="p", main="Data (points) and Trend+Season (line)")

34 lines(Tsm+Ssm)

For forecasting, we use the first part of the filter recursions directly and storethe predictions in y and the root mean square prediction errors in rmspe.

35 n.ahead=12; y = ts(append(jj, rep(0,n.ahead)), start=1960, freq=4)

36 rmspe = rep(0,n.ahead); x00 = ks$xf[,,num]; P00 = ks$Pf[,,num]

37 Q=t(cQ)%*%cQ; R=t(cR)%*%(cR) # see footnote and discussion below

38 for (m in 1:n.ahead){

39 xp = Phi%*%x00; Pp = Phi%*%P00%*%t(Phi)+Q

40 sig = A%*%Pp%*%t(A)+R; K = Pp%*%t(A)%*%(1/sig)

41 x00 = xp; P00 = Pp-K%*%A%*%Pp

42 y[num+m] = A%*%xp; rmspe[m] = sqrt(sig) }

43 plot(y, type="o", main="", ylab="", ylim=c(5,30), xlim=c(1975,1984))

44 upp = ts(y[(num+1):(num+n.ahead)]+2*rmspe, start=1981, freq=4)

45 low = ts(y[(num+1):(num+n.ahead)]-2*rmspe, start=1981, freq=4)

46 lines(upp, lty=2); lines(low, lty=2); abline(v=1980.75, lty=3)

Note that the Cholesky decomposition of Q does not exist here, however,the diagonal form allows us to use standard deviations for the first two diago-nal elements of cQ. Also when we perform the smoothing part of the example,we set the lower 2 × 2 diagonal block of the Q matrix equal to the identitymatrix; this is done for inversions in the script and it is only a device, thevalues are not used. These technicalities can be avoided using a form of themodel that we present in the next section.

Page 367: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

354 6 State-Space Models

6.6 State-Space Models with Correlated Errors

Sometimes it is advantageous to write the state-space model in a slightlydifferent way, as is done by numerous authors; for example, Anderson andMoore (1979) and Hannan and Deistler (1988). Here, we write the state-spacemodel as

xxxt+1 = Φxxxt + Υuuut+1 +Θwwwt t = 0, 1, . . . , n (6.97)

yyyt = Atxxxt + Γuuut + vvvt t = 1, . . . , n (6.98)

where, in the state equation, xxx0 ∼ Np(µµµ0, Σ0), Φ is p × p, and Υ is p × r,Θ is p ×m and wwwt ∼ iid Nm(000, Q). In the observation equation, At is q × pand Γ is q × r, and vvvt ∼ iid Nq(000, R). In this model, while wwwt and vvvt are stillwhite noise series (both independent of xxx0), we also allow the state noise andobservation noise to be correlated at time t; that is,

cov(wwws, vvvt) = S δts , (6.99)

where δts is Kronecker’s delta; note that S is an m × q matrix. The majordifference between this form of the model and the one specified by (6.3)–(6.4)is that this model starts the state noise process at t = 0 in order to easethe notation related to the concurrent covariance between wwwt and vvvt. Also,the inclusion of the matrix Θ allows us to avoid using a singular state noiseprocess as was done in Example 6.10.

To obtain the innovations, εεεt = yyyt − Atxxxt−1t − Γuuut, and the innovationvariance Σt = AtP

t−1t A′t + R, in this case, we need the one-step-ahead state

predictions. Of course, the filtered estimates will also be of interest, and theywill be needed for smoothing. Property 6.2 (the smoother) as displayed in§6.2 still holds. The following property generates the predictor xxxtt+1 from the

past predictor xxxt−1t when the noise terms are correlated and exhibits the filterupdate.

Property 6.5 The Kalman Filter with Correlated NoiseFor the state-space model specified in (6.97) and (6.98), with initial con-

ditions xxx01 and P 01 , for t = 1, . . . , n,

xxxtt+1 = Φxxxt−1t + Υuuut+1 +Ktεεεt (6.100)

P tt+1 = ΦP t−1t Φ′ +ΘQΘ′ −KtΣtK′t (6.101)

where εεεt = yyyt −Atxxxt−1t − Γuuut and the gain matrix is given by

Kt = [ΦP t−1t A′t +ΘS][AtPt−1t A′t +R]−1. (6.102)

The filter values are given by

xxxtt = xxxt−1t + P t−1t A′t[AtP

t−1t A′t +R

]−1εεεt+1, (6.103)

P tt = P t−1t − P t−1t A′t+1

[AtP

t−1t A′t +R

]−1AtP

t−1t . (6.104)

Page 368: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

6.6 State-Space Models with Correlated Errors 355

The derivation of Property 6.5 is similar to the derivation of the Kalmanfilter in Property 6.1 (Problem 6.18); we note that the gain matrix Kt dif-fers in the two properties. The filter values, (6.103)–(6.104), are symbolicallyidentical to (6.19) and (6.20). To initialize the filter, we note that

xxx01 = E(xxx1) = Φµµµ0 + Υuuu1, and P 01 = var(xxx1) = ΦΣ0Φ

′ +ΘQΘ′.

In the next two subsections, we show how to use the model (6.97)-(6.98) forfitting ARMAX models and for fitting (multivariate) regression models withautocorrelated errors. To put it succinctly, for ARMAX models, the inputsenter in the state equation and for regression with autocorrelated errors, theinputs enter in the observation equation. It is, of course, possible to combinethe two models and we give an example of this at the end of the section.

6.6.1 ARMAX Models

Consider a k-dimensional ARMAX model given by

yyyt = Υuuut +

p∑j=1

Φjyyyt−j +

q∑k=1

Θkvvvt−k + vvvt. (6.105)

The observations yyyt are a k-dimensional vector process, the Φs and Θs arek × k matrices, Υ is k × r, uuut is the r × 1 input, and vvvt is a k × 1 white noiseprocess; in fact, (6.105) and (5.98) are identical models, but here, we havewritten the observations as yyyt. We now have the following property.

Property 6.6 A State-Space Form of ARMAXFor p ≥ q, let

F =

Φ1 I 0 · · · 0

Φ2 0 I · · · 0

......

.... . .

...

Φp−1 0 0 · · · I

Φp 0 0 · · · 0

G =

Θ1 + Φ1

...Θq + ΦqΦq+1

...Φp

H =

Υ0...0

(6.106)

where F is kp×kp, G is kp×k, and H is kp× r. Then, the state-space modelgiven by

xxxt+1 = Fxxxt +Huuut+1 +Gvvvt, (6.107)

yyyt = Axxxt + vvvt, (6.108)

where A =[I, 0, · · · , 0

]is k × pk and I is the k × k identity matrix, implies

the ARMAX model (6.105). If p < q, set Φp+1 = · · · = Φq = 0, in whichcase p = q and (6.107)–(6.108) still apply. Note that the state process is kp-dimensional, whereas the observations are k-dimensional.

Page 369: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

356 6 State-Space Models

This form of the model is somewhat different than the form suggested in§6.1, equations (6.6)-(6.8). For example, in (6.8), by setting At equal to thep× p identity matrix (for all t) and setting R = 0 implies the data yt in (6.8)follow a VAR(m) process. In doing so, however, we do not make use of theability to allow for correlated state and observation error, so a singularity isintroduced into the system in the form of R = 0. The method in Property 6.6avoids that problem, and points out the fact that the same model can takemany forms. We do not prove Property 6.6 directly, but the following exampleshould suggest how to establish the general result.

Example 6.11 Univariate ARMAX(1, 1) in State-Space Form

Consider the univariate ARMAX(1, 1) model

yt = αt + φyt−1 + θvt−1 + vt,

where αt = Υuuut to ease the notation. For a simple example, if Υ = (β0, β1)and uuut = (1, t)′, the model for yt would be ARMA(1,1) with linear trend,yt = β0 + β1t + φyt−1 + θvt−1 + vt. Using Property 6.6, we can write themodel as

xt+1 = φxt + αt+1 + (θ + φ)vt, (6.109)

andyt = xt + vt. (6.110)

In this case, (6.109) is the state equation with wt ≡ vt and (6.110) isthe observation equation. Consequently, cov(wt, vt) = var(vt) = R, andcov(wt, vs) = 0 when s 6= t, so Property 6.5 would apply. To verify (6.109)and (6.110) specify an ARMAX(1, 1) model, we have

yt = xt + vt from (6.110)= φxt−1 + αt + (θ + φ)vt−1 + vt from (6.109)= αt + φ(xt−1 + vt−1) + θvt−1 + vt rearrange terms= αt + φyt−1 + θvt−1 + vt, from (6.110).

Together, Properties 6.5 and 6.6 can be used to accomplish maximumlikelihood estimation as described in §6.3 for ARMAX models. The ARMAXmodel is only a special case of the model (6.97)–(6.98), which is quite rich, aswill be discovered in the next subsection.

6.6.2 Multivariate Regression with Autocorrelated Errors

In regression with autocorrelated errors, we are interested in fitting the re-gression model

yyyt = Γuuut + εεεt (6.111)

to a k × 1 vector process, yyyt, with r regressors uuut = (ut1, . . . , utr)′ where εεεt

is vector ARMA(p, q) and Γ is a k × r matrix of regression parameters. We

Page 370: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

6.6 State-Space Models with Correlated Errors 357

note that the regressors do not have to vary with time (e.g., ut1 ≡ 1 includesa constant in the regression) and that the case k = 1 was treated in §5.6.

To put the model in state-space form, we simply notice that εεεt = yyyt−Γuuutis a k-dimensional ARMA(p, q) process. Thus, if we set H = 0 in (6.107), andinclude Γuuut in (6.108), we obtain

xxxt+1 = Fxxxt +Gvvvt, (6.112)

yyyt = Γuuut +Axxxt + vvvt, (6.113)

where the model matrices A, F , and G are defined in Property 6.6. Thefact that (6.112)–(6.113) is multivariate regression with autocorrelated errorsfollows directly from Property 6.6 by noticing that together, xxxt+1 = Fxxxt+Gvvvtand εεεt = Axxxt + vvvt imply εεεt = yyyt − Γuuut is vector ARMA(p, q).

As in the case of ARMAX models, regression with autocorrelated errors isa special case of the state-space model, and the results of Property 6.5 can beused to obtain the innovations form of the likelihood for parameter estimation.

Example 6.12 Mortality, Temperature and Pollution

In this example, we fit an ARMAX model to the detrended mortality seriescmort. As in Examples 5.10 and 5.11, we let Mt denote the weekly cardiovas-cular mortality series, Tt as the corresponding temperature series tempr, andPt as the corresponding particulate series. A preliminary analysis suggeststhe following considerations (no output is shown):• An AR(2) model fits well to detrended Mt:

fit = arima(cmort, order=c(2,0,0), xreg=time(cmort))

• The CCF between the mortality residuals, the temperature series and theparticulates series, shows a strong correlation with temperature laggedone week (Tt−1), concurrent particulate level (Pt) and the particulatelevel about one month prior (Pt−4).acf(cbind(dmort <- resid(fit), tempr, part))

lag.plot2(tempr, dmort, 8)

lag.plot2(part, dmort, 8)

From these results, we decided to fit the ARMAX model

Mt = φ1Mt−1 + φ2Mt−2 + β1Tt−1 + β2Pt + β3Pt−4 + vt (6.114)

to the detrended mortality series, Mt = Mt − (α + β4t), where vt ∼ iidN(0, σ2

v). To write the model in state-space form using Property 6.6, let

xxxt+1 = Φxxxt + Υuuut+1 +Θvt t = 0, 1, . . . , n

yt = α+Axxxt + Γuuut + vt t = 1, . . . , n

with

Φ =

[φ1 1

φ2 0

]Υ =

[β1 β2 β3 0

0 0 0 0

]Θ =

[φ1φ2

]

Page 371: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

358 6 State-Space Models

A = [ 1 0 ], Γ = [ 0 0 0 β4 ], uuut = (Tt, Pt, Pt−4, t)′, yt = Mt. Note that the

state process is bivariate and the observation process is univariate. We couldhave included α in Γ ; however, to reduce the number of parameters to beestimated numerically, we centered Mt by its sample mean and removed theconstant α from the model. In addition, t was centered by its mean. Initialvalues of the parameters were taken from the preliminary investigation. Wenote that Pt and Pt−4 are highly correlated, so orthogonalizing these twoinputs would be advantageous (although we did not do it here).

The estimates and standard errors are displayed along with the followingR code; investigation of the residuals shows that the model fits well.

1 dm = cmort - mean(cmort) # center mortality

2 trend = time(cmort) - mean(time(cmort)) # center time

3 ded = ts.intersect(dm, u1=lag(tempr,-1), u2=part, u3=lag(part,-4),

u4=trend, dframe=TRUE)

4 y = ded$dm; input = cbind(ded$u1, ded$u2, ded$u3, ded$u4)

5 num = length(y); A = array(c(1,0), dim = c(1,2,num))

6 # Function to Calculate Likelihood

7 Linn=function(para){

8 phi1=para[1]; phi2=para[2]; cR=para[3]

9 b1=para[4]; b2=para[5]; b3=para[6]; b4=para[7]

10 mu0 = matrix(c(0,0), 2, 1); Sigma0 = diag(100, 2)

11 Phi = matrix(c(phi1, phi2, 1, 0), 2)

12 Theta = matrix(c(phi1, phi2), 2)

13 Ups = matrix(c(b1, 0, b2, 0, b3, 0, 0, 0), 2, 4)

14 Gam = matrix(c(0, 0, 0, b4), 1, 4); cQ = cR; S = cR^2

15 kf = Kfilter2(num, y, A, mu0, Sigma0, Phi, Ups, Gam, Theta, cQ,

cR, S, input)

16 return(kf$like) }

17 # Estimation

18 phi1=.4; phi2=.4; cR=5; b1=-.1; b2=.1; b3=.1; b4=-1.5

19 init.par = c(phi1, phi2, cR, b1, b2, b3, b4) # initial parameters

20 est = optim(init.par, Linn, NULL, method="L-BFGS-B", hessian=TRUE,

control=list(trace=1,REPORT=1))

21 SE = sqrt(diag(solve(est$hessian)))

22 # Results

23 u = cbind(estimate=est$par, SE)

24 rownames(u)=c("phi1","phi2","sigv","TL1","P","PL4","trnd"); u

estimate SE

phi1 0.31437053 0.03712001

phi2 0.31777254 0.03825371

sigv 5.05662192 0.15920440

TL1 -0.11929669 0.01106674 (beta 1)

P 0.11935144 0.01746386 (beta 2)

PL4 0.06715402 0.01844125 (beta 3)

trnd -1.34871992 0.21921715 (beta 4)

The residuals can be obtained by running Kfilter2 again at the final esti-mates; they are returned as innov.

Page 372: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

6.7 Bootstrapping State-Space Models 359

6.7 Bootstrapping State-Space Models

Although in §6.3 we discussed the fact that under general conditions (whichwe assume to hold in this section) the MLEs of the parameters of a DLMare consistent and asymptotically normal, time series data are often of shortor moderate length. Several researchers have found evidence that samplesmust be fairly large before asymptotic results are applicable (Dent and Min,1978; Ansley and Newbold, 1980). Moreover, as we discussed in Example 3.35,problems occur if the parameters are near the boundary of the parameterspace. In this section, we discuss an algorithm for bootstrapping state-spacemodels; this algorithm and its justification, including the non-Gaussian case,along with numerous examples, can be found in Stoffer and Wall (1991) andin Stoffer and Wall (2004). In view of §6.6, anything we do or say here aboutDLMs applies equally to ARMAX models.

Using the DLM given by (6.97)–(6.99) and Property 6.5, we write theinnovations form of the filter as

εεεt = yyyt −Atxxxt−1t − Γuuut, (6.115)

Σt = AtPt−1t A′t +R, (6.116)

Kt = [ΦP t−1t A′t +ΘS]Σ−1t , (6.117)

xxxtt+1 = Φxxxt−1t + Υuuut+1 +Ktεεεt, (6.118)

P tt+1 = ΦP t−1t Φ′ +ΘQΘ′ −KtΣtK′t. (6.119)

This form of the filter is just a rearrangement of the filter given in Property 6.5.In addition, we can rewrite the model to obtain its innovations form,

xxxtt+1 = Φxxxt−1t + Υuuut+1 +Ktεεεt, (6.120)

yyyt = Atxxxt−1t + Γuuut + εεεt. (6.121)

This form of the model is a rewriting of (6.115) and (6.118), and it accommo-dates the bootstrapping algorithm.

As discussed in Example 6.5, although the innovations εεεt are uncorrelated,initially, Σt can be vastly different for different time points t. Thus, in aresampling procedure, we can either ignore the first few values of εεεt until Σtstabilizes or we can work with the standardized innovations

eeet = Σ−1/2t εεεt, (6.122)

so we are guaranteed these innovations have, at least, the same first two

moments. In (6.122), Σ1/2t denotes the unique square root matrix of Σt defined

by Σ1/2t Σ

1/2t = Σt. In what follows, we base the bootstrap procedure on

the standardized innovations, but we stress the fact that, even in this case,ignoring startup values might be necessary, as noted by Stoffer and Wall(1991).

Page 373: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

360 6 State-Space Models

The model coefficients and the correlation structure of the model areuniquely parameterized by a k × 1 parameter vector Θ0; that is, Φ = Φ(Θ0),Υ = Υ (Θ0), Q = Q(Θ0), At = At(Θ0), Γ = Γ (Θ0), and R = R(Θ0). Recallthe innovations form of the Gaussian likelihood (ignoring a constant) is

−2 lnLY (Θ) =n∑t=1

[ln |Σt(Θ)|+ εεεt(Θ)′Σt(Θ)−1εεεt(Θ)

]=

n∑t=1

[ln |Σt(Θ)|+ eeet(Θ)′eeet(Θ)] . (6.123)

We stress the fact that it is not necessary for the model to be Gaussian toconsider (6.123) as the criterion function to be used for parameter estimation.

Let Θ denote the MLE of Θ0, that is, Θ = argmaxΘLY (Θ), obtained by

the methods discussed in §6.3. Let εεεt(Θ) and Σt(Θ) be the innovation values

obtained by running the filter, (6.115)–(6.119), under Θ. Once this has beendone, the bootstrap procedure is accomplished by the following steps.

(i) Construct the standardized innovations

eeet(Θ) = Σ−1/2t (Θ)εεεt(Θ).

(ii) Sample, with replacement, n times from the set {eee1(Θ), . . . , eeen(Θ)} to

obtain {eee∗1(Θ), . . . , eee∗n(Θ)}, a bootstrap sample of standardized innova-tions.

(iii) Construct a bootstrap data set {yyy∗1, . . . , yyy∗n} as follows. Define the (p +q) × 1 vector ξξξt = (xxxt

t+1, yyy′t)′. Stacking (6.120) and (6.121) results in a

vector first-order equation for ξξξt given by

ξξξt = Ftξξξt−1 +Guuut +Hteeet, (6.124)

where

Ft =

[Φ 0At 0

], G =

[ΥΓ

], Ht =

[KtΣ

1/2t

Σ1/2t

].

Thus, to construct the bootstrap data set, solve (6.124) using eee∗t (Θ) inplace of eeet. The exogenous variables uuut and the initial conditions of theKalman filter remain fixed at their given values, and the parameter vectoris held fixed at Θ.

(iv) Using the bootstrap data set {yyy∗t ; t = 1, . . . , n}, construct a likelihood,

LY ∗(Θ), and obtain the MLE of Θ, say, Θ∗.(v) Repeat steps 2 through 4, a large number, B, of times, obtaining a boot-

strapped set of parameter estimates {Θ∗b ; b = 1, . . . , B}. The finite sam-

ple distribution of Θ − Θ0 may be approximated by the distribution ofΘ∗b − Θ, b = 1, . . . , B.

Page 374: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

6.7 Bootstrapping State-Space Models 361

Time

rate

(%)

1955 1960 1965 1970 1975 1980

05

1015

Fig. 6.8. Quarterly interest rate for Treasury bills (dashed line) and quarterlyinflation rate (solid line) in the Consumer Price Index.

In the next example, we discuss the case of a linear regression model, butwhere the regression coefficients are stochastic and allowed to vary with time.The state-space model provides a convenient setting for the analysis of suchmodels.

Example 6.13 Stochastic Regression

Figure 6.8 shows the quarterly inflation rate (solid line), yt, in the Con-sumer Price Index and the quarterly interest rate recorded for Treasurybills (dashed line), zt, from the first quarter of 1953 through the secondquarter of 1980, n = 110 observations. These data are taken from Newboldand Bos (1985).

In this example, we consider one analysis that was discussed in Newboldand Bos (1985, pp. 61-73), that focused on the first 50 observations andwhere quarterly inflation was modeled as being stochastically related toquarterly interest rate,

yt = α+ βtzt + vt,

where α is a fixed constant, βt is a stochastic regression coefficient, andvt is white noise with variance σ2

v . The stochastic regression term, whichcomprises the state variable, is specified by a first-order autoregression,

(βt − b) = φ(βt−1 − b) + wt,

where b is a constant, and wt is white noise with variance σ2w. The noise

processes, vt and wt, are assumed to be uncorrelated.Using the notation of the state-space model (6.97) and (6.98), we have

in the state equation, xxxt = βt, Φ = φ, uuut ≡ 1, Υ = (1 − φ)b, Q = σ2w, and

Page 375: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

362 6 State-Space Models

Table 6.2. Comparison of Standard Errors

Asymptotic BootstrapParameter MLE Standard Error Standard Error

φ .865 .223 .463α −.686 .487 .557b .788 .226 .821σw .115 .107 .216σv 1.135 .147 .340

in the observation equation, At = zt, Γ = α, R = σ2v , and S = 0. The pa-

rameter vector is Θ = (φ, α, b, σw, σv)′. The results of the Newton–Raphson

estimation procedure are listed in Table 6.2. Also shown in the Table 6.2are the corresponding standard errors obtained from B = 500 runs of thebootstrap. These standard errors are simply the standard deviations of thebootstrapped estimates, that is, the square root of

∑Bb=1(Θ∗ib−Θ∗i )2/(B−1),

where Θi, represents the ith parameter, i = 1, . . . , 5, and Θ∗i =∑Bb=1Θ

∗ib/B.

The asymptotic standard errors listed in Table 6.2 are typically muchsmaller than those obtained from the bootstrap. For most of the cases, thebootstrapped standard errors are at least 50% larger than the correspond-ing asymptotic value. Also, asymptotic theory prescribes the use of normaltheory when dealing with the parameter estimates. The bootstrap, however,allows us to investigate the small sample distribution of the estimators and,hence, provides more insight into the data analysis.

For example, Figure 6.9 shows the bootstrap distribution of the estimatorof φ in the upper left-hand corner. This distribution is highly skewed withvalues concentrated around .8, but with a long tail to the left. Some quantilesare −.09 (5%), .11 (10%), .34 (25%), .73 (50%), .86 (75%), .96 (90%), .98(95%), and they can be used to obtain confidence intervals. For example, a90% confidence interval for φ would be approximated by (−.09, .96). Thisinterval is ridiculously wide and includes 0 as a plausible value of φ; we willinterpret this after we discuss the results of the estimation of σw.

Figure 6.9 shows the bootstrap distribution of σw in the lower right-handcorner. The distribution is concentrated at two locations, one at approxi-mately σw = .25 (which is the median of the distribution of values away from0) and the other at σw = 0. The cases in which σw ≈ 0 correspond to de-terministic state dynamics. When σw = 0 and |φ| < 1, then βt ≈ b for larget, so the approximately 25% of the cases in which σw ≈ 0 suggest a fixedstate, or constant coefficient model. The cases in which σw is away from zerowould suggest a truly stochastic regression parameter. To investigate thismatter further, the off-diagonals of Figure 6.9 show the joint bootstrappedestimates, (φ, σw), for positive values of φ∗. The joint distribution suggests

σw > 0 corresponds to φ ≈ 0. When φ = 0, the state dynamics are given byβt = b+wt. If, in addition, σw is small relative to b, the system is nearly de-

Page 376: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

6.7 Bootstrapping State-Space Models 363

0.0 0.1 0.2 0.3 0.4 0.5 0.6

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Fig. 6.9. Bootstrap distribution, B = 500, of (i) the estimator of φ (upper left),(ii) the estimator of σw (lower right), and (iii) the estimators jointly (off-diagonals).

Only the values corresponding to φ∗ ≥ 0 are shown.

terministic; that is, βt ≈ b. Considering these results, the bootstrap analysisleads us to conclude the dynamics of the data are best described in termsof a fixed regression effect.

The following R code was used for this example. We note that the firstline of the code sets the relative tolerance for determining convergence forthe numerical optimization. Using the current default setting may result ina long run time of the algorithm and we suggest the value be decreased onslower machines or for demonstration purposes. Also, decreasing the numberof bootstrap replicates will improve computation time; for example, settingtol=.001 and nboot=200 yields reasonable results. In this example, we fixthe first three values of the data for the resampling scheme.

1 tol = sqrt(.Machine$double.eps) # convergence tolerance

2 nboot = 500 # number of bootstrap replicates

3 y = window(qinfl, c(1953,1), c(1965,2)) # inflation

4 z = window(qintr, c(1953,1), c(1965,2)) # interest

5 num = length(y); input = matrix(1, num, 1)

6 A = array(z, dim=c(1,1,num))

7 # Function to Calculate Likelihood

8 Linn=function(para){

9 phi=para[1]; alpha=para[2]; b=para[3]; Ups=(1-phi)*b

10 cQ=para[4]; cR=para[5]

Page 377: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

364 6 State-Space Models

11 kf=Kfilter2(num,y,A,mu0,Sigma0,phi,Ups,alpha,1,cQ,cR,0,input)

12 return(kf$like) }

13 # Parameter Estimation

14 mu0=1; Sigma0=.01; phi=.84; alpha=-.77; b=.85; cQ=.12; cR=1.1

15 init.par = c(phi, alpha, b, cQ, cR) # initial parameters

16 est = optim(init.par, Linn, NULL, method="BFGS", hessian=TRUE,

control=list(trace=1,REPORT=1,reltol=tol))

17 SE = sqrt(diag(solve(est$hessian)))

18 phi = est$par[1]; alpha=est$par[2]; b=est$par[3]; Ups=(1-phi)*b

19 cQ=est$par[4]; cR=est$par[5]

20 rbind(estimate=est$par, SE)

phi alpha b sigma_w sigma_v

estimate 0.8653348 -0.6855891 0.7879308 0.1145682 1.1353139

SE 0.2231382 0.4865775 0.2255649 0.1071838 0.1472067

21 # BEGIN BOOTSTRAP

22 Linn2=function(para){ # likelihood for bootstrapped data

23 phi=para[1]; alpha=para[2]; b=para[3]; Ups=(1-phi)*b

24 cQ=para[4]; cR=para[5]

25 kf=Kfilter2(num,y.star,A,mu0,Sigma0,phi,Ups,alpha,1,cQ,cR,0,input)

26 return(kf$like) }

27 # Run the filter at the estimates

28 kf=Kfilter2(num,y,A,mu0,Sigma0,phi,Ups,alpha,1,cQ,cR,0,input)

29 # Pull out necessary values from the filter and initialize

30 xp=kf$xp; innov=kf$innov; sig=kf$sig; K=kf$K; e=innov/sqrt(sig)

31 e.star=e; y.star=y; xp.star=xp; k=4:50

32 para.star = matrix(0, nboot, 5) # to store estimates

33 init.par=c(.84,-.77,.85,.12,1.1)

34 for (i in 1:nboot){cat("iteration:", i, "\n")

35 e.star[k] = sample(e[k], replace=TRUE)

36 for (j in k){

37 xp.star[j] = phi*xp.star[j-1]+Ups+K[j]*sqrt(sig[j])*e.star[j] }

38 y.star[k] = z[k]*xp.star[k]+alpha+sqrt(sig[k])*e.star[k]

39 est.star = optim(init.par, Linn2, NULL, method="BFGS",

control=list(reltol=tol))

40 para.star[i,] = cbind(est.star$par[1], est.star$par[2],

est.star$par[3], abs(est.star$par[4]), abs(est.star$par[5]))}

41 rmse = rep(NA,5) # compute bootstrapped SEs (Table 6.2)

42 for(i in 1:5){rmse[i]=sqrt(sum((para.star[,i]-est$par[i])^2)/nboot)

43 cat(i, rmse[i],"\n") }

1 0.46294 | 2 0.55698 | 3 0.82148 | 4 0.21595 | 5 0.34011

44 # Plot for phi vs sigma_w

45 phi = para.star[,1]; sigw = abs(para.star[,4])

46 phi = ifelse(phi<0, NA, phi) # any phi < 0 not plotted

47 panel.hist <- function(x, ...){

48 usr <- par("usr"); on.exit(par(usr))

49 par(usr = c(usr[1:2], 0, 1.5) )

50 h <- hist(x, plot = FALSE)

51 breaks <- h$breaks; nB <- length(breaks)

Page 378: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

6.8 Dynamic Linear Models with Switching 365

52 y <- h$counts; y <- y/max(y)

53 rect(breaks[-nB], 0, breaks[-1], y, ...)}

54 u = cbind(phi, sigw); colnames(u) = c("f","s")

55 pairs(u, cex=1.5, pch=1, diag.panel=panel.hist, cex.labels=1.5,

font.labels=5)

6.8 Dynamic Linear Models with Switching

The problem of modeling changes in regimes for vector-valued time serieshas been of interest in many different fields. In §5.5, we explored the ideathat the dynamics of the system of interest might change over the courseof time. In Example 5.6, we saw that pneumonia and influenza mortalityrates behave differently when a flu epidemic occurs than when no epidemicoccurs. As another example, some authors (for example, Hamilton, 1989, orMcCulloch and Tsay, 1993) have explored the possibility the dynamics ofthe quarterly U.S. GNP series (say, yt) analyzed in Example 3.35 might bedifferent during expansion (∇ log yt > 0) than during contraction (∇ log yt <0). In this section, we will concentrate on the method presented in Shumwayand Stoffer (1991). One way of modeling change in an evolving time series isby assuming the dynamics of some underlying model changes discontinuouslyat certain undetermined points in time. Our starting point is the DLM givenby (6.1) and (6.2), namely,

xxxt = Φxxxt−1 +wwwt, (6.125)

to describe the p× 1 state dynamics, and

yyyt = Atxxxt + vvvt (6.126)

to describe the q × 1 observation dynamics. Recall wwwt and vvvt are Gaussianwhite noise sequences with var(wwwt) = Q, var(vvvt) = R, and cov(wwwt, vvvs) = 0 forall s and t.

Generalizations of (6.125) and (6.126) to include the possibility of changesoccurring over time have been approached by allowing changes in the errorcovariances (Harrison and Stevens, 1976, Gordon and Smith, 1988, 1990) orby assigning mixture distributions to the observation errors vvvt (Pena andGuttman, 1988). Approximations to filtering were derived in all of the afore-mentioned articles. An application to monitoring renal transplants was de-scribed in Smith and West (1983) and in Gordon and Smith (1990). Changescan also be modeled in the classical regression case by allowing switches inthe design matrix, as in Quandt (1972).

Switching via a stationary Markov chain with independent observationshas been developed by Lindgren (1978) and Goldfeld and Quandt (1973). Inthe Markov chain approach, we declare the dynamics of the system at time tare generated by one of m possible regimes evolving according to a Markov

Page 379: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

366 6 State-Space Models

chain over time. As a simple example, suppose the dynamics of a univariatetime series, yt, is generated by either the model (1) yt = β1yt−1 + wt or themodel (2) yt = β2yt−1 +wt. We will write the model as yt = φtyt−1 +wt suchthat Pr(φt = βj) = πj , j = 1, 2, π1 + π2 = 1, and with the Markov property

Pr(φt = βj∣∣ φt−1 = βi, φt−2 = βi2 , . . .) = Pr(φt = βj

∣∣ φt−1 = βi) = πij ,

for i, j = 1, 2 (and i2, . . . = 1, 2). As previously mentioned, Markov switchingfor dependent data has been applied by Hamilton (1989) to detect changesbetween positive and negative growth periods in the economy. Applications tospeech recognition have been considered by Juang and Rabiner (1985). Thecase in which the particular regime is unknown to the observer comes underthe heading of hidden Markov models, and the techniques related to analyzingthese models are summarized in Rabiner and Juang (1986). An application ofthe idea of switching to the tracking of multiple targets has been consideredin Bar-Shalom (1978), who obtained approximations to Kalman filtering interms of weighted averages of the innovations.

Example 6.14 Tracking Multiple Targets

The approach of Shumway and Stoffer (1991) was motivated primarily bythe problem of tracking a large number of moving targets using a vectoryyyt of sensors. In this problem, we do not know at any given point in timewhich target any given sensor has detected. Hence, it is the structure of themeasurement matrix At in (6.126) that is changing, and not the dynamics ofthe signal xxxt or the noises, wwwt or vvvt. As an example, consider a 3× 1 vectorof satellite measurements yyyt = (yt1, yt2, yt3)′ that are observations on somecombination of a 3× 1 vector of targets or signals, xxxt = (xt1, xt2, xt3)′. Forthe measurement matrix

At =

0 1 01 0 00 0 1

for example, the first sensor, yt1, observes the second target, xt2; the secondsensor, yt2, observes the first target, xt1; and the third sensor, yt3, observesthe third target, xt3. All possible detection configurations will define a setof possible values for At, say, {M1,M2, . . . ,Mm}, as a collection of plausiblemeasurement matrices.

Example 6.15 Modeling Economic Change

As another example of the switching model presented in this section, con-sider the case in which the dynamics of the linear model changes suddenlyover the history of a given realization. For example, Lam (1990) has giventhe following generalization of Hamilton’s (1989) model for detecting pos-itive and negative growth periods in the economy. Suppose the data aregenerated by

yt = zt + nt, (6.127)

Page 380: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

6.8 Dynamic Linear Models with Switching 367

where zt is an autoregressive series and nt is a random walk with a driftthat switches between two values α0 and α0 + α1. That is,

nt = nt−1 + α0 + α1St, (6.128)

with St = 0 or 1, depending on whether the system is in state 1 or state 2.For the purpose of illustration, suppose

zt = φ1zt−1 + φ2zt−2 + wt (6.129)

is an AR(2) series with var(wt) = σ2w. Lam (1990) wrote (6.127) in a differ-

enced form∇yt = zt − zt−1 + α0 + α1St, (6.130)

which we may take as the observation equation (6.126) with state vector

xxxt = (zt, zt−1, α0, α1)′ (6.131)

andM1 = [1,−1, 1, 0] and M2 = [1,−1, 1, 1] (6.132)

determining the two possible economic conditions. The state equation,(6.125), is of the form

ztzt−1α0

α1

=

φ1 φ2 0 01 0 0 00 0 1 00 0 0 1

zt−1zt−2α0

α1

+

wt000

. (6.133)

The observation equation, (6.130), can be written as

∇yt = Atxxxt + vt, (6.134)

where we have included the possibility of observational noise, and wherePr(At = M1) = 1− Pr(At = M2), with M1 and M2 given in (6.132).

To incorporate a reasonable switching structure for the measurement ma-trix into the DLM that is compatible with both practical situations previouslydescribed, we assume that the m possible configurations are states in a non-stationary, independent process defined by the time-varying probabilities

πj(t) = Pr(At = Mj), (6.135)

for j = 1, . . . ,m and t = 1, 2, . . . , n. Important information about the currentstate of the measurement process is given by the filtered probabilities of beingin state j, defined as the conditional probabilities

πj(t|t) = Pr(At = Mj |Yt), (6.136)

Page 381: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

368 6 State-Space Models

which also vary as a function of time. In (6.136), we have used the notationYs = {yyy1, . . . , yyys}. The filtered probabilities (6.136) give the time-varyingestimates of the probability of being in state j given the data to time t.

It will be important for us to obtain estimators of the configuration prob-abilities, πj(t|t), the predicted and filtered state estimators, xxxt−1t and xxxtt, andthe corresponding error covariance matrices P t−1t and P tt . Of course, the pre-dictor and filter estimators will depend on the parameters, Θ, of the DLM. Inmany situations, the parameters will be unknown and we will have to estimatethem. Our focus will be on maximum likelihood estimation, but other authorshave taken a Bayesian approach that assigns priors to the parameters, andthen seeks posterior distributions of the model parameters; see, for example,Gordon and Smith (1990), Pena and Guttman (1988), or McCulloch and Tsay(1993).

We now establish the recursions for the filters associated with the statexxxt and the switching process, At. As discussed in §6.3, the filters are also anessential part of the maximum likelihood procedure. The predictors, xxxt−1t =E(xxxt|Yt−1), and filters, xxxtt = E(xxxt|Yt), and their associated error variance–covariance matrices, P t−1t and P tt , are given by

xxxt−1t = Φxxxt−1t−1, (6.137)

P t−1t = ΦP t−1t−1Φ′ +Q, (6.138)

xxxtt = xxxt−1t +

m∑j=1

πj(t|t)Ktjεεεtj , (6.139)

P tt =m∑j=1

πj(t|t)(I −KtjMj)Pt−1t , (6.140)

Ktj = P t−1t M ′jΣ−1tj , (6.141)

where the innovation values in (6.139) and (6.141) are

εεεtj = yyyt −Mjxxxt−1t , (6.142)

Σtj = MjPt−1t M ′j +R, (6.143)

for j = 1, . . . ,m.Equations (6.137)-(6.141) exhibit the filter values as weighted linear com-

binations of the m innovation values, (6.142)-(6.143), corresponding to each ofthe possible measurement matrices. The equations are similar to the approx-imations introduced by Bar-Shalom and Tse (1975), by Gordon and Smith(1990), and Pena and Guttman (1988).

To verify (6.139), let the indicator I(At = Mj) = 1 when At = Mj , andzero otherwise. Then, using (6.21),

Page 382: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

6.8 Dynamic Linear Models with Switching 369

xxxtt = E(xxxt|Yt) = E[E(xxxt|Yt, At)∣∣ Yt]

= E{ m∑j=1

E(xxxt|Yt, At = Mj)I(At = Mj)∣∣ Yt}

= E{ m∑j=1

[xxxt−1t +Ktj(yyyt −Mjxxxt−1t )]I(At = Mj)

∣∣ Yt}=

m∑j=1

πj(t|t)[xxxt−1t +Ktj(yyyt −Mjxxxt−1t )],

where Ktj is given by (6.141). Equation (6.140) is derived in a similar fashion;the other relationships, (6.137), (6.138), and (6.141), follow from straightfor-ward applications of the Kalman filter results given in Property 6.1.

Next, we derive the filters πj(t|t). Let fj(t|t − 1) denote the conditionaldensity of yyyt given the past yyy1, . . . , yyyt−1, and At = Mj , for j = 1, . . . ,m.Then,

πj(t|t) =πj(t)fj(t|t− 1)∑mk=1 πk(t)fk(t|t− 1)

, (6.144)

where we assume the distribution πj(t), for j = 1, . . . ,m has been specifiedbefore observing yyy1, . . . , yyyt (details follow as in Example 6.16 below). If theinvestigator has no reason to prefer one state over another at time t, the choiceof uniform priors, πj(t) = m−1, for j = 1, . . . ,m, will suffice. Smoothness canbe introduced by letting

πj(t) =m∑i=1

πi(t− 1|t− 1)πij , (6.145)

where the non-negative weights πij are chosen so∑mi=1 πij = 1. If the At

process was Markov with transition probabilities πij , then (6.145) would bethe update for the filter probability, as shown in the next example.

Example 6.16 Hidden Markov Chain ModelIf {At} is a hidden Markov chain with stationary transition probabilities

πij = Pr(At = Mj |At−1 = Mi), for i, j = 1, . . . ,m, letting p(·) denote ageneric probability function, we have

πj(t|t) =p(At = Mj , yyyt, Yt−1)

p(yyyt, Yt−1)

=p(Yt−1)p(At = Mj

∣∣ Yt−1)p(yyyt∣∣ At = Mj , Yt−1)

p(Yt−1)p(yyyt∣∣ Yt−1)

=πj(t|t− 1)fj(t|t− 1)∑mk=1 πk(t|t− 1)fk(t|t− 1)

. (6.146)

In the Markov case, the conditional probabilities

Page 383: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

370 6 State-Space Models

πj(t|t− 1) = Pr(At = Mj

∣∣Yt−1)

in (6.146) replace the unconditional probabilities, πj(t) = Pr(At = Mj), in(6.144).

To evaluate (6.146), we must be able to calculate πj(t|t−1) and fj(t|t−1).We will discuss the calculation of fj(t|t − 1) after this example. To deriveπj(t|t− 1), note,

πj(t|t− 1) = Pr(At = Mj

∣∣Yt−1)

=m∑i=1

Pr(At = Mj , At−1 = Mi

∣∣Yt−1)

=m∑i=1

Pr(At = Mj

∣∣At−1 = Mi)Pr(At−1 = Mi

∣∣Yt−1)

=m∑i=1

πijπi(t− 1|t− 1). (6.147)

Expression (6.145) comes from equation (6.147), where, as previously noted,we replace πj(t|t− 1) by πj(t).

The difficulty in extending the approach here to the Markov case is thedependence among the yyyt, which makes it necessary to enumerate over allpossible histories to derive the filtering equations. This problem will be evi-dent when we derive the conditional density fj(t|t− 1). Equation (6.145) hasπj(t) as a function of the past observations, Yt−1, which is inconsistent withour model assumption. Nevertheless, this seems to be a reasonable compro-mise that allows the data to modify the probabilities πj(t), without having todevelop a highly computer-intensive technique.

As previously suggested, the computation of fj(t|t − 1), without someapproximations, is highly computer-intensive. To evaluate fj(t|t−1), considerthe event

A1 = Mj1 , . . . , At−1 = Mjt−1, (6.148)

for ji = 1, . . . ,m, and i = 1, . . . , t−1, which specifies a specific set of measure-ment matrices through the past; we will write this event as A(t−1) = M(`). Be-cause mt−1 possible outcomes exist for A1, . . . , At−1, the index ` runs through` = 1, . . . ,mt−1. Using this notation, we may write

fj(t|t− 1)

=mt−1∑`=1

Pr{A(t−1) = M(`)

∣∣ Yt−1}f(yyyt∣∣ Yt−1, At = Mj , A(t−1) = M(`))

≡mt−1∑`=1

α(`) N(yyyt

∣∣∣ µµµtj(`), Σtj(`)) , j = 1, . . . ,m, (6.149)

Page 384: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

6.8 Dynamic Linear Models with Switching 371

where the notation N(·∣∣ bbb,B) represents the normal density with mean vector

bbb and variance–covariance matrix B. That is, fj(t|t − 1) is a mixture of nor-mals with non-negative weights α(`) = Pr{A(t−1) = M(`)

∣∣ Yt−1} such that∑` α(`) = 1, and with each normal distribution having mean vector

µµµtj(`) = Mjxxxt−1t (`) = MjE[xxxt

∣∣ Yt−1, A(t−1) = M(`)] (6.150)

and covariance matrix

Σtj(`) = MjPt−1t (`)M ′j +R. (6.151)

This result follows because the conditional distribution of yyyt in (6.149) isidentical to the fixed measurement matrix case presented in §4.2. The values in(6.150) and (6.151), and hence the densities, fj(t|t−1), for j = 1, . . . ,m, can beobtained directly from the Kalman filter, Property 6.1, with the measurementmatrices A(t−1) fixed at M(`).

Although fj(t|t − 1) is given explicitly in (6.149), its evaluation is highlycomputer intensive. For example, with m = 2 states and n = 20 observa-tions, we have to filter over 2 + 22 + · · · + 220 possible sample paths; note,220 = 1, 048, 576. One remedy is to trim (remove), at each t, highly improbablesample paths; that is, remove events in (6.148) with extremely small probabil-ity of occurring, and then evaluate fj(t|t− 1) as if the trimmed sample pathscould not have occurred. Another alternative, as suggested by Gordon andSmith (1990) and Shumway and Stoffer (1991), is to approximate fj(t|t− 1)using the closest (in the sense of Kulback–Leibler distance) normal distribu-tion. In this case, the approximation leads to choosing normal distributionwith the same mean and variance associated with fj(t|t− 1); that is, we ap-proximate fj(t|t− 1) by a normal with mean Mjxxx

t−1t and variance Σtj given

in (6.143).To develop a procedure for maximum likelihood estimation, the joint den-

sity of the data is

f(yyy1, . . . , yyyn) =n∏t=1

f(yyyt|Yt−1)

=n∏t=1

m∑j=1

Pr(At = Mj |Yt−1)f(yyyt|At = Mj , Yt−1),

and hence, the likelihood can be written as

lnLY (Θ) =n∑t=1

ln

m∑j=1

πj(t)fj(t|t− 1)

. (6.152)

For the hidden Markov model, πj(t) would be replaced by πj(t|t − 1). In(6.152), we will use the normal approximation to fj(t|t − 1). That is, hence-forth, we will consider fj(t|t − 1) as the normal, N(Mjxxx

t−1t , Σtj), density,

Page 385: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

372 6 State-Space Models

where xxxt−1t is given in (6.137) and Σtj is given in (6.143). We may considermaximizing (6.152) directly as a function of the parameters Θ = {µµµ0, Φ,Q,R}using a Newton method, or we may consider applying the EM algorithm tothe complete data likelihood.

To apply the EM algorithm as in §6.3, we call xxx0, xxx1, . . . , xxxn, A1, . . . , An,and yyy1, . . . , yyyn, the complete data, with likelihood given by

−2 lnLX,A,Y (Θ) = ln |Σ0|+ (xxx0 − µµµ0)′Σ−10 (xxx0 − µµµ0)

+ n ln |Q|+n∑t=1

(xxxt − Φxxxt−1)′Q−1(xxxt − Φxxxt−1)

− 2n∑t=1

m∑j=1

I(At=Mj) lnπj(t) + n ln |R|

+n∑t=1

m∑j=1

I(At=Mj)(yyyt −Atxxxt)′R−1(yyyt −Atxxxt).

(6.153)

As discussed in §6.3, we require the minimization of the conditional expecta-tion

Q(Θ∣∣ Θ(k−1)

)= E

{−2 lnLX,A,Y (Θ)

∣∣∣ Yn, Θ(k−1)}, (6.154)

with respect to Θ at each iteration, k = 1, 2, . . . . The calculation and maxi-mization of (6.154) is similar to the case of (6.65). In particular, with

πj(t|n) = E[I(At = Mj)∣∣ Yn], (6.155)

we obtain on iteration k,

π(k)j (t) = πj(t|n), (6.156)

µµµ(k)0 = xxxn0 , (6.157)

Φ(k) = S10S−100 , (6.158)

Q(k) = n−1(S11 − S10S

−100 S

′10

), (6.159)

and

R(k) = n−1n∑t=1

m∑j=1

πj(t|n)[(yyyt −Mjxxx

nt )(yyyt −Mjxxx

nt )′ +MjP

nt M

′j

]. (6.160)

where S11, S10, S00 are given in (6.67)-(6.69). As before, at iteration k, thefilters and the smoothers are calculated using the current values of the pa-rameters, Θ(k−1), and Σ0 is held fixed. Filtering is accomplished by using(6.137)-(6.141). Smoothing is derived in a similar manner to the derivation ofthe filter, and one is led to the smoother given in Properties 6.2 and 6.3, withone exception, the initial smoother covariance, (6.55), is now

Page 386: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

6.8 Dynamic Linear Models with Switching 373

Pnn,n−1 =

m∑j=1

πj(n|n)(I −KtjMj)ΦPn−1n−1 . (6.161)

Unfortunately, the computation of πj(t|n) is excessively complicated, and re-quires integrating over mixtures of normal distributions. Shumway and Stof-fer (1991) suggest approximating the smoother πj(t|n) by the filter πj(t|t),and find the approximation works well.

Example 6.17 Analysis of the Influenza Data

We use the results of this section to analyze the U.S. monthly pneumoniaand influenza mortality data presented in §5.4, Figure 5.7. Letting yt denotethe mortality caused by pneumonia and influenza at month t, we model ytin terms of a structural component model coupled with a hidden Markovprocess that determines whether a flu epidemic exists.

The model consists of three structural components. The first component,xt1, is an AR(2) process chosen to represent the periodic (seasonal) compo-nent of the data,

xt1 = α1xt−1,1 + α2xt−2,1 + wt1, (6.162)

where wt1 is white noise, with var(wt1) = σ21 . The second component, xt2, is

an AR(1) process with a nonzero constant term, which is chosen to representthe sharp rise in the data during an epidemic,

xt2 = β0 + β1xt−1,2 + wt2, (6.163)

where wt2 is white noise, with var(wt2) = σ22 . The third component, xt3, is

a fixed trend component given by,

xt3 = xt−1,3 + wt3, (6.164)

where var(wt3) = 0. The case in which var(wt3) > 0, which correspondsto a stochastic trend (random walk), was tried here, but the estimationbecame unstable, and lead to us fitting a fixed, rather than stochastic, trend.Thus, in the final model, the trend component satisfies ∇xt3 = 0; recall inExample 5.6 the data were also differenced once before fitting the model.

Throughout the years, periods of normal influenza mortality (state 1) aremodeled as

yt = xt1 + xt3 + vt, (6.165)

where the measurement error, vt, is white noise with var(vt) = σ2v . When

an epidemic occurs (state 2), mortality is modeled as

yt = xt1 + xt2 + xt3 + vt. (6.166)

The model specified in (6.162)–(6.166) can be written in the general state-space form. The state equation is

Page 387: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

374 6 State-Space Models

Table 6.3. Estimation Results for Influenza Data

Initial Model Final ModelParameter Estimates Estimates

α1 1.422 (.100) 1.406 (.079)α2 −.634 (.089) −.622 (.069)β0 .276 (.056) .210 (.025)β1 -.312 (.218) —σ1 .023 (.003) .023 (.005)σ2 .108 (.017) .112 (.017)σv .002 (.009) —

Estimated standard errors in parentheses

xt1xt−1,1xt2xt3

=

α1 α2 0 01 0 0 00 0 β1 00 0 0 1

xt−1,1xt−2,1xt−1,2xt−1,3

+

00β00

+

wt10wt20

. (6.167)

Of course, (6.167) can be written in the standard state-equation form as

xxxt = Φxxxt−1 + Υut +wwwt, (6.168)

where xxxt = (xt1, xt−1,1, xt2, xt3)′, Υ = (0, 0, β0, 0)′, ut ≡ 1, and Q is a4× 4 matrix with σ2

1 as the (1,1)-element, σ22 as the (3,3)-element, and the

remaining elements set equal to zero. The observation equation is

yt = Atxxxt + vt, (6.169)

where At is 1× 4, and vt is white noise with var(vt) = R = σ2v . We assume

all components of variance wt1, wt2, and vt are uncorrelated.As discussed in (6.165) and (6.166), At can take one of two possible forms

At = M1 = [1, 0, 0, 1] no epidemic,

At = M2 = [1, 0, 1, 1] epidemic,

corresponding to the two possible states of (1) no flu epidemic and (2) fluepidemic, such that Pr(At = M1) = 1 − Pr(At = M2). In this example,we will assume At is a hidden Markov chain, and hence we use the updat-ing equations given in Example 6.16, (6.146) and (6.147), with transitionprobabilities π11 = π22 = .75 (and, thus, π12 = π21 = .25).

Parameter estimation was accomplished using a quasi-Newton–Raphsonprocedure to maximize the approximate log likelihood given in (6.152), withinitial values of π1(1|0) = π2(1|0) = .5. Table 6.3 shows the results of theestimation procedure. On the initial fit, two estimates are not significant,namely, β1 and σv. When σ2

v = 0, there is no measurement error, and thevariability in data is explained solely by the variance components of the

Page 388: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

6.8 Dynamic Linear Models with Switching 375

1968 1970 1972 1974 1976 1978

0.0

0.4

0.8

(a)

1968 1970 1972 1974 1976 1978

−0.1

0.1

0.3

(b)

1968 1970 1972 1974 1976 1978

0.2

0.4

0.6

0.8 (c)

Fig. 6.10. (a) Influenza data, yt, (line–points) and a prediction indicator (0 or 1)that an epidemic occurs in month t given the data up to month t−1 (dashed line). (b)The three filtered structural components of influenza mortality: xtt1 (cyclic trace),xtt2 (spiked trace), and xtt3 (negative linear trace). (c) One-month-ahead predictionsshown as upper and lower limits yt−1

t ± 2√P t−1t (dashed lines), of the number of

pneumonia and influenza deaths, and yt (points).

state system, namely, σ21 and σ2

2 . The case in which β1 = 0 correspondsto a simple level shift during a flu epidemic. In the final model, with β1and σ2

v removed, the estimated level shift (β0) corresponds to an increase inmortality by about .2 per 1000 during a flu epidemic. The estimates for thefinal model are also listed in Table 6.3.

Figure 6.10(a) shows a plot of the data, yt, for the ten-year period of1969-1978 as well as an indicator that takes the value of 1 if the estimatedapproximate conditional probability exceeds .5, i.e., π2(t|t − 1) > .5. Theestimated prediction probabilities do a reasonable job of predicting a fluepidemic, although the peak in 1972 is missed.

Page 389: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

376 6 State-Space Models

Figure 6.10(b) shows the estimated filtered values (that is, filtering isdone using the parameter estimates) of the three components of the model,xtt1, xtt2, and xtt3. Except for initial instability (which is not shown), xtt1represents the seasonal (cyclic) aspect of the data, xtt2 represents the spikesduring a flu epidemic, and xtt3 represents the slow decline in flu mortalityover the ten-year period of 1969-1978.

One-month-ahead prediction, say, yt−1t , is obtained as

yt−1t = M1xxxt−1t if π1(t|t− 1) > π2(t|t− 1),

yt−1t = M2xxxt−1t if π1(t|t− 1) ≤ π2(t|t− 1).

Of course, xxxt−1t is the estimated state prediction, obtained via the filterpresented in (6.137)-(6.141) (with the addition of the constant term inthe model) using the estimated parameters. The results are shown in Fig-ure 6.10(c). The precision of the forecasts can be measured by the innovationvariances, Σt1 when no epidemic is predicted, and Σt2 when an epidemic ispredicted. These values become stable quickly, and when no epidemic is pre-dicted, the estimated standard error of the prediction is approximately .02(this is the square root of Σt1 for t large); when a flu epidemic is predicted,the estimated standard error of the prediction is approximately .11.

The results of this analysis are impressive given the small number of pa-rameters and the degree of approximation that was made to obtain a com-putationally simple method for fitting a complex model. In particular, asseen in Figure 6.10(a), the model is never fooled as to when a flu epidemicwill occur. This result is particularly impressive, given that, for example, in1971, it appeared as though an epidemic was about to begin, but it neverwas realized, and the model predicted no flu epidemic that year. As seen inFigure 6.10(c), the predicted mortality tends to be underestimated duringthe peaks, but the true values are typically within one standard error of thepredicted value. Further evidence of the strength of this technique can befound in the example given in Shumway and Stoffer (1991).

The R code for the final model estimation is as follows.1 y = as.matrix(flu); num = length(y); nstate = 4;

2 M1 = as.matrix(cbind(1,0,0,1)) # obs matrix normal

3 M2 = as.matrix(cbind(1,0,1,1)) # obs matrix flu epi

4 prob = matrix(0,num,1); yp = y # to store pi2(t|t-1) & y(t|t-1)

5 xfilter = array(0, dim=c(nstate,1,num)) # to store x(t|t)

6 # Function to Calculate Likelihood

7 Linn = function(para){

8 alpha1=para[1]; alpha2=para[2]; beta0=para[3]

9 sQ1=para[4]; sQ2=para[5]; like=0

10 xf=matrix(0, nstate, 1) # x filter

11 xp=matrix(0, nstate, 1) # x pred

12 Pf=diag(.1, nstate) # filter cov

13 Pp=diag(.1, nstate) # pred cov

14 pi11 <- .75 -> pi22; pi12 <- .25 -> pi21; pif1 <- .5 -> pif2

Page 390: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

6.8 Dynamic Linear Models with Switching 377

15 phi=matrix(0,nstate,nstate)

16 phi[1,1]=alpha1; phi[1,2]=alpha2; phi[2,1]=1; phi[4,4]=1

17 Ups = as.matrix(rbind(0,0,beta0,0))

18 Q = matrix(0,nstate,nstate)

19 Q[1,1]=sQ1^2; Q[3,3]=sQ2^2; R=0 # R=0 in final model

20 # begin filtering #

21 for(i in 1:num){

22 xp = phi%*%xf + Ups; Pp = phi%*%Pf%*%t(phi) + Q

23 sig1 = as.numeric(M1%*%Pp%*%t(M1) + R)

24 sig2 = as.numeric(M2%*%Pp%*%t(M2) + R)

25 k1 = Pp%*%t(M1)/sig1; k2 = Pp%*%t(M2)/sig2

26 e1 = y[i]-M1%*%xp; e2 = y[i]-M2%*%xp

27 pip1 = pif1*pi11 + pif2*pi21; pip2 = pif1*pi12 + pif2*pi22;

28 den1 = (1/sqrt(sig1))*exp(-.5*e1^2/sig1);

29 den2 = (1/sqrt(sig2))*exp(-.5*e2^2/sig2);

30 denom = pip1*den1 + pip2*den2;

31 pif1 = pip1*den1/denom; pif2 = pip2*den2/denom;

32 pif1=as.numeric(pif1); pif2=as.numeric(pif2)

33 e1=as.numeric(e1); e2=as.numeric(e2)

34 xf = xp + pif1*k1*e1 + pif2*k2*e2

35 eye = diag(1, nstate)

36 Pf = pif1*(eye-k1%*%M1)%*%Pp + pif2*(eye-k2%*%M2)%*%Pp

37 like = like - log(pip1*den1 + pip2*den2)

38 prob[i]<<-pip2; xfilter[,,i]<<-xf; innov.sig<<-c(sig1,sig2)

39 yp[i]<<-ifelse(pip1 > pip2, M1%*%xp, M2%*%xp) }

40 return(like) }

41 # Estimation

42 alpha1=1.4; alpha2=-.5; beta0=.3; sQ1=.1; sQ2=.1

43 init.par = c(alpha1, alpha2, beta0, sQ1, sQ2)

44 (est = optim(init.par, Linn, NULL, method="BFGS", hessian=TRUE,

control=list(trace=1,REPORT=1)))

45 SE = sqrt(diag(solve(est$hessian)))

46 u = cbind(estimate=est$par, SE)

47 rownames(u)=c("alpha1","alpha2","beta0","sQ1","sQ2"); u

estimate SE

alpha1 1.40570967 0.078587727

alpha2 -0.62198715 0.068733109

beta0 0.21049042 0.024625302

sQ1 0.02310306 0.001635291

sQ2 0.11217287 0.016684663

48 # Graphics

49 predepi = ifelse(prob<.5,0,1); k = 6:length(y)

50 Time = time(flu)[k]

51 par(mfrow=c(3,1), mar=c(2,3,1,1)+.1, cex=.9)

52 plot(Time, y[k], type="o", ylim=c(0,1),ylab="")

53 lines(Time, predepi[k], lty="dashed", lwd=1.2)

54 text(1979,.95,"(a)")

55 plot(Time, xfilter[1,,k], type="l", ylim=c(-.1,.4), ylab="")

Page 391: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

378 6 State-Space Models

56 lines(Time, xfilter[3,,k]); lines(Time, xfilter[4,,k])

57 text(1979,.35,"(b)")

58 plot(Time, y[k], type="p", pch=1, ylim=c(.1,.9),ylab="")

59 prde1 = 2*sqrt(innov.sig[1]); prde2 = 2*sqrt(innov.sig[2])

60 prde = ifelse(predepi[k]<.5, prde1,prde2)

61 lines(Time, yp[k]+prde, lty=2, lwd=1.5)

62 lines(Time, yp[k]-prde, lty=2, lwd=1.5)

63 text(1979,.85,"(c)")

6.9 Stochastic Volatility

Recently, there has been considerable interest in stochastic volatility models.These models are similar to the ARCH models presented in Chapter 5, butthey add a stochastic noise term to the equation for σt. Recall from §5.4 thata GARCH(1, 1) model for a return, which we denote here by rt, is given by

rt = σtεt (6.170)

σ2t = α0 + α1r

2t−1 + β1σ

2t−1, (6.171)

where εt is Gaussian white noise. If we define

ht = log σ2t and yt = log r2t ,

then (6.170) can be written as

yt = ht + log ε2t . (6.172)

Equation (6.172) is considered the observation equation, and the stochasticvariance ht is considered to be an unobserved state process. Instead of (6.171),however, the model assumes the volatility process follows, in its basic form,an autoregression,

ht = φ0 + φ1ht−1 + wt, (6.173)

where wt is white Gaussian noise with variance σ2w.

Together, (6.172) and (6.173) make up the stochastic volatility model dueto Taylor (1982). If ε2t had a log-normal distribution, (6.172)–(6.173) wouldform a Gaussian state-space model, and we could then use standard DLMresults to fit the model to data. Unfortunately, yt = log r2t is rarely normal, sowe typically keep the ARCH normality assumption on εt; in which case, log ε2tis distributed as the log of a chi-squared random variable with one degree offreedom. This density is given by

f(x) =1√2π

exp

{−1

2(ex − x)

}−∞ < x <∞, (6.174)

and its mean and variance are −1.27 and π2/2, respectively; the density(6.174) is highly skewed with a long tail on the left (see Figure 6.12).

Page 392: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

6.9 Stochastic Volatility 379

Various approaches to the fitting of stochastic volatility models have beenexamined; these methods include a wide range of assumptions on the ob-servational noise process. A good summary of the proposed techniques, bothBayesian (via MCMC) and non-Bayesian approaches (such as quasi-maximumlikelihood estimation and the EM algorithm), can be found in Jacquier etal. (1994), and Shephard (1996). Simulation methods for classical inferenceapplied to stochastic volatility models are discussed in Danielson (1994) andSandmann and Koopman (1998).

Kim, Shephard and Chib (1998) proposed modeling the log of a chi-squaredrandom variable by a mixture of seven normals to approximate the first fourmoments of the observational error distribution; the mixture is fixed and noadditional model parameters are added by using this technique. The basicmodel assumption that εt is Gaussian is unrealistic for most applications.In an effort to keep matters simple but more general (in that we allow theobservational error dynamics to depend on parameters that will be fitted), ourmethod of fitting stochastic volatility models is to retain the Gaussian stateequation (6.173), but to write the observation equation, with yt = log r2t , as

yt = α+ ht + ηt, (6.175)

where ηt is white noise, whose distribution is a mixture of two normals, onecentered at zero. In particular, we write

ηt = Itzt0 + (1− It)zt1, (6.176)

where It is an iid Bernoulli process, Pr{It = 0} = π0, Pr{It = 1} = π1(π0 + π1 = 1), zt0 ∼ iid N(0, σ2

0), and zt1 ∼ iid N(µ1, σ21).

The advantage to this model is that it is easy to fit because it uses nor-mality. In fact, the model equations (6.173) and (6.175)-(6.176) are similar tothose presented in Pena and Guttman (1988), who used the idea to obtaina robust Kalman filter, and, as previously mentioned, in Kim et al. (1998).The material presented in §6.8 applies here, and in particular, the filteringequations for this model are

htt+1 = φ0 + φ1ht−1t +

1∑j=0

πtjKtjεtj , (6.177)

P tt+1 = φ21Pt−1t + σ2

w −1∑j=0

πtjK2tjΣtj , (6.178)

εt0 = yt − α− ht−1t , εt1 = yt − α− ht−1t − µ1, (6.179)

Σt0 = P t−1t + σ20 , Σt1 = P t−1t + σ2

1 , (6.180)

Kt0 = φ1Pt−1t

/Σt0, Kt1 = φ1P

t−1t

/Σt1. (6.181)

To complete the filtering, we must be able to assess the probabilities πt1 =Pr(It = 1

∣∣ y1, . . . , yt), for t = 1, . . . , n; of course, πt0 = 1−πt1. Let fj(t∣∣ t−1)

Page 393: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

380 6 State-Space Models

denote the conditional density of yt given the past y1, . . . , yt−1, and It = j(j = 0, 1). Then,

πt1 =π1f1(t

∣∣ t− 1)

π0f0(t∣∣ t− 1) + π1f1(t

∣∣ t− 1), (6.182)

where we assume the distribution πj , for j = 0, 1 has been specified a priori.If the investigator has no reason to prefer one state over another the choiceof uniform priors, π1 = 1/2, will suffice. Unfortunately, it is computationallydifficult to obtain the exact values of fj(t

∣∣ t − 1); although we can give an

explicit expression of fj(t∣∣ t− 1), the actual computation of the conditional

density is prohibitive. A viable approximation, however, is to choose fj(t∣∣

t − 1) to be the normal density, N(ht−1t + µj , Σtj), for j = 0, 1 and µ0 = 0;see §6.8 for details.

The innovations filter given in (6.177)–(6.182) can be derived from theKalman filter by a simple conditioning argument; e.g., to derive (6.177), write

E(ht+1

∣∣y1, . . . , yt) =1∑j=0

E(ht+1

∣∣y1, . . . , yt, It = j)

Pr(It = j∣∣y1, . . . , yt)

=

1∑j=0

(φ0 + φ1h

t−1t +Ktjεtj

)πtj

= φ0 + φ1ht−1t +

1∑j=0

πtjKtjεtj .

Estimation of the parameters, Θ = (φ0, φ1, σ20 , µ1, σ

21 , σ

2w)′, is accomplished

via MLE based on the likelihood given by

lnLY (Θ) =n∑t=1

ln

1∑j=0

πj fj(t∣∣ t− 1)

, (6.183)

where the density fj(t∣∣ t−1) is approximated by the normal density, N(ht−1t +

µj , σ2j ), previously mentioned. We may consider maximizing (6.183) directly

as a function of the parameters Θ using a Newton method, or we may considerapplying the EM algorithm to the complete data likelihood.

Example 6.18 Analysis of the New York Stock Exchange Returns

The top of Figure 6.11 shows the log of the squares of returns, yt = log r2t ,for 200 of the 2000 daily observations of the NYSE previously displayed inFigure 1.4. Model (6.173) and (6.175)–(6.176), with π1 fixed at .5, was fit tothe data using a quasi-Newton–Raphson method to maximize (6.183). Theresults are given in Table 6.4. Figure 6.12 compares the density of the log ofa χ2

1 with the fitted normal mixture; we note the data indicate a substantialamount of probability in the upper tail that the log-χ2

1 distribution misses.

Page 394: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

6.9 Stochastic Volatility 381

800 850 900 950 1000

−15

−10

−5log(Squared NYSE Returns)

Time

800 850 900 950 1000

−1.5

0.0

1.0

Predicted log−Volatility

Fig. 6.11. Two hundred observations of yt = log r2t , for 801 ≤ t ≤ 1000, wherert is the daily return of the NYSE (top); the crash of October 19, 1987 occursat t = 938. Corresponding one-step-ahead predicted log volatility, log σ2

t , with ±2standard prediction errors (bottom).

Finally, the bottom of Figure 6.11 shows yt for 800 ≤ t ≤ 1000, whichincludes the crash of October 19, 1987, with yt−1t = α+ ht−1t superimposedon the graph; compare with Figure 5.6. Also displayed are error bounds.

The R code when φ0 is included in the model is as follows.1 y = log(nyse^2)

2 num = length(y)

3 # Initial Parameters

4 phi0=0; phi1=.95; sQ=.2; alpha=mean(y); sR0=1; mu1=-3; sR1=2

5 init.par = c(phi0, phi1, sQ, alpha, sR0, mu1, sR1)

6 # Innovations Likelihood

7 Linn = function(para){

8 phi0=para[1]; phi1=para[2]; sQ=para[3]; alpha=para[4]

9 sR0=para[5]; mu1=para[6]; sR1=para[7]

10 sv = SVfilter(num,y,phi0,phi1,sQ,alpha,sR0,mu1,sR1)

11 return(sv$like) }

12 # Estimation

13 (est = optim(init.par, Linn, NULL, method="BFGS", hessian=TRUE,

control=list(trace=1,REPORT=1)))

14 SE = sqrt(diag(solve(est$hessian)))

15 u = cbind(estimates=est$par, SE)

16 rownames(u)=c("phi0","phi1","sQ","alpha","sigv0","mu1","sigv1"); u

Page 395: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

382 6 State-Space Models

Table 6.4. Estimation Results for the NYSE Fit

EstimatedParameter Estimate Standard Error

φ0 −.006 .016φ1 .988 .007σw .091 .027α −9.613 1.269σ0 1.220 .065µ1 −2.292 .205σ1 2.683 .105

17 # Graphics (need filters at the estimated parameters)

18 phi0=est$par[1]; phi1=est$par[2]; sQ=est$par[3]; alpha=est$par[4]

19 sR0=est$par[5]; mu1=est$par[6]; sR1=est$par[7]

20 sv = SVfilter(num,y,phi0,phi1,sQ,alpha,sR0,mu1,sR1)

21 # densities plot (f is chi-sq, fm is fitted mixture)

22 x = seq(-15,6,by=.01)

23 f = exp(-.5*(exp(x)-x))/(sqrt(2*pi))

24 f0 = exp(-.5*(x^2)/sR0^2)/(sR0*sqrt(2*pi))

25 f1 = exp(-.5*(x-mu1)^2/sR1^2)/(sR1*sqrt(2*pi))

26 fm = (f0+f1)/2

27 plot(x, f, type="l"); lines(x, fm, lty=2,lwd=2)

28 dev.new(); par(mfrow=c(2,1)); Time=801:1000

29 plot(Time, y[Time], type="l", main="log(Squared NYSE Returns)")

30 plot(Time, sv$xp[Time],type="l", main="Predicted log-Volatility",

ylim=c(-1.5,1.8), ylab="", xlab="")

31 lines(Time, sv$xp[Time]+2*sqrt(sv$Pp[Time]), lty="dashed")

32 lines(Time, sv$xp[Time]-2*sqrt(sv$Pp[Time]), lty="dashed")

It is possible to use the bootstrap procedure described in §6.7 for thestochastic volatility model, with some minor changes. The following procedurewas described in Stoffer and Wall (2004). We develop a vector first-orderequation, as was done in (6.124). First, using (6.179), and noting that yt =πt0yt + πt1yt, we may write

yt = α+ ht−1t + πt0εt0 + πt1(εt1 + µ1). (6.184)

Consider the standardized innovations

etj = Σ−1/2tj εtj , j = 0, 1, (6.185)

and define the 2× 1 vector

eeet =

[et0et1

].

Also, define the 2× 1 vector

Page 396: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

6.9 Stochastic Volatility 383

−15 −10 −5 0 5

0.00

0.05

0.10

0.15

0.20

0.25

Fig. 6.12. Density of the log of a χ21 as given by (6.174) (solid line) and the fitted

normal mixture (dashed line) form the NYSE example.

ξξξt =

[htt+1

yt

].

Combining (6.177) and (6.184) results in a vector first-order equation for ξξξtgiven by

ξξξt = Fξξξt−1 +Gt +Hteeet, (6.186)

where

F =

[φ1 01 0

], Gt =

[φ0

α+ πt1µ1

], Ht =

[πt0Kt0Σ

1/2t0 πt1Kt1Σ

1/2t1

πt0Σ1/2t0 πt1Σ

1/2t1

].

Hence, the steps in bootstrapping for this case are the same as steps 1 through5 described in §5.8, but with (6.124) replaced by the following first-orderequation:

ξξξ∗t = F (Θ)ξξξ∗t−1 +Gt(Θ; πt1) +Ht(Θ; πt1)eee∗t , (6.187)

where Θ = (φ0, φ1, σ20 , α, µ1, σ

21 , σ

2w)′ is the MLE of Θ, and πt1 is estimated via

(6.182), replacing f1(t∣∣ t− 1) and f0(t

∣∣ t − 1) by their respective estimatednormal densities (πt0 = 1− πt1).

Example 6.19 Analysis of the U.S. GNP Growth Rate

In Example 5.4, we fit an ARCH model to the U.S. GNP growth rate. Inthis example, we will fit a stochastic volatility model to the residuals fromthe MA(2) fit on the growth rate (see Example 3.38). Figure 6.13 shows thelog of the squared residuals, say yt, from the MA(2) fit on the U.S. GNPseries. The stochastic volatility model (6.172)–(6.176) was then fit to yt.

Page 397: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

384 6 State-Space Models

1950 1960 1970 1980 1990 2000

−20

−15

−10

Fig. 6.13. Log of the squared residuals from an MA(2) fit on GNP growth rate.

Table 6.5 shows the MLEs of the model parameters along with their asymp-totic SEs assuming the model is correct. Also displayed in Table 6.5 are themeans and SEs of B = 500 bootstrapped samples. There is some amount ofagreement between the asymptotic values and the bootstrapped values. Theinterest here, however, is not so much in the SEs, but in the actual sam-pling distribution of the estimates. For example, Figure 6.14 compares thebootstrap histogram and asymptotic normal distribution of φ1. In this case,the bootstrap distribution exhibits positive kurtosis and skewness which ismissed by the assumption of asymptotic normality.

The R code for this example is as follows. We held φ0 at 0 for this analysisbecause it was not significantly different from 0 in an initial analysis.

1 n.boot = 500 # number of bootstrap replicates

2 tol = sqrt(.Machine$double.eps) # convergence tolerance

3 gnpgr = diff(log(gnp))

4 fit = arima(gnpgr, order=c(1,0,0))

5 y = as.matrix(log(resid(fit)^2))

6 num = length(y)

7 plot.ts(y, ylab="")

8 # Initial Parameters

9 phi1 = .9; sQ = .5; alpha = mean(y); sR0 = 1; mu1 = -3; sR1 = 2.5

10 init.par = c(phi1, sQ, alpha, sR0, mu1, sR1)

11 # Innovations Likelihood

12 Linn=function(para){

13 phi1 = para[1]; sQ = para[2]; alpha = para[3]

14 sR0 = para[4]; mu1 = para[5]; sR1 = para[6]

15 sv = SVfilter(num, y, 0, phi1, sQ, alpha, sR0, mu1, sR1)

16 return(sv$like) }

Page 398: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

6.9 Stochastic Volatility 385

Dens

ity

0.4 0.6 0.8 1.0 1.2

01

23

45

6

Fig. 6.14. Bootstrap histogram and asymptotic distribution of φ1 for the U.S. GNPexample.

Table 6.5. Estimates and Standard Errors for GNP Example

Asymptotic Bootstrap†Parameter MLE SE SE

φ1 .879 .106 .074σw .388 .217 .428α −9.662 .339 1.536σ0 .833 .203 .389µ1 −2.341 .495 .437σ1 2.452 .293 .330

† Based on 500 bootstrapped samples.

17 # Estimation

18 (est = optim(init.par, Linn, NULL, method="BFGS", hessian=TRUE,

control=list(trace=1,REPORT=1)))

19 SE = sqrt(diag(solve(est$hessian)))

20 u = rbind(estimates=est$par, SE)

21 colnames(u)=c("phi1","sQ","alpha","sig0","mu1","sig1"); u

phi1 sQ alpha sig0 mu1 sig1

estimates 0.8790 0.3878 -9.6624 0.8325 -2.3412 2.4516

SE 0.1061 0.2172 0.3386 0.2034 0.4952 0.2927

22 # Bootstrap

23 para.star = matrix(0, n.boot, 6) # to store parameter estimates

Page 399: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

386 6 State-Space Models

24 Linn2 = function(para){ # calculate likelihood

25 phi1 = para[1]; sQ = para[2]; alpha = para[3]

26 sR0 = para[4]; mu1 = para[5]; sR1 = para[6]

27 sv = SVfilter(num, y.star, 0, phi1, sQ, alpha, sR0, mu1, sR1)

28 return(sv$like) }

29 for (jb in 1:n.boot){

30 cat("iteration:", jb, "\n")

31 phi1 = est$par[1]; sQ = est$par[2]; alpha = est$par[3]

32 sR0 = est$par[4]; mu1 = est$par[5]; sR1 = est$par[6]

33 Q = sQ^2; R0 = sR0^2; R1 = sR1^2

34 sv = SVfilter(num, y, 0, phi1, sQ, alpha, sR0, mu1, sR1)

35 sig0 = sv$Pp+R0; sig1 = sv$Pp+R1;

36 K0 = sv$Pp/sig0; K1 = sv$Pp/sig1

37 inn0 = y-sv$xp-alpha; inn1 = y-sv$xp-mu1-alpha

38 den1 = (1/sqrt(sig1))*exp(-.5*inn1^2/sig1)

39 den0 = (1/sqrt(sig0))*exp(-.5*inn0^2/sig0)

40 fpi1 = den1/(den0+den1)

41 # start resampling at t=4

42 e0 = inn0/sqrt(sig0); e1 = inn1/sqrt(sig1)

43 indx = sample(4:num, replace=TRUE)

44 sinn = cbind(c(e0[1:3], e0[indx]), c(e1[1:3], e1[indx]))

45 eF = matrix(c(phi1, 1, 0, 0), 2, 2)

46 xi = cbind(sv$xp,y) # initialize

47 for (i in 4:num){ # generate boot sample

48 G = matrix(c(0, alpha+fpi1[i]*mu1), 2, 1)

49 h21 = (1-fpi1[i])*sqrt(sig0[i]); h11 = h21*K0[i]

50 h22 = fpi1[i]*sqrt(sig1[i]); h12 = h22*K1[i]

51 H = matrix(c(h11,h21,h12,h22),2,2)

52 xi[i,] = t(eF%*%as.matrix(xi[i-1,],2) + G +

H%*%as.matrix(sinn[i,],2))}

53 # Estimates from boot data

54 y.star = xi[,2]

55 phi1=.9; sQ=.5; alpha=mean(y.star); sR0=1; mu1=-3; sR1=2.5

56 init.par = c(phi1, sQ, alpha, sR0, mu1, sR1) # same as for data

57 est.star = optim(init.par, Linn2, NULL, method="BFGS",

control=list(reltol=tol))

58 para.star[jb,] = cbind(est.star$par[1], abs(est.star$par[2]),

est.star$par[3], abs(est.star$par[4]), est.star$par[5],

abs(est.star$par[6])) }

59 # Some summary statistics and graphics

60 rmse = rep(NA,6) # SEs from the bootstrap

61 for(i in 1:6){

62 rmse[i] = sqrt(sum((para.star[,i]-est$par[i])^2)/n.boot)

63 cat(i, rmse[i],"\n") }

64 dev.new(); phi = para.star[,1]

65 hist(phi, 15, prob=TRUE, main="", xlim=c(.4,1.2), xlab="")

66 u = seq(.4, 1.2, by=.01)

67 lines(u,dnorm(u, mean=.8790267, sd=.1061884), lty="dashed", lwd=2)

Page 400: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

6.10 State-Space Models Using Monte Carlo Methods 387

6.10 Nonlinear and Non-normal State-Space ModelsUsing Monte Carlo Methods

Most of this chapter has focused on linear dynamic models assumed to beGaussian processes. Historically, these models were convenient because an-alyzing the data was a relatively simple matter. These assumptions cannotcover every situation, and it is advantageous to explore departures from theseassumptions. As seen in §6.8, the solution to the nonlinear and non-Gaussiancase will require computer-intensive techniques currently in vogue because ofthe availability of cheap and fast computers. In this section, we take a Bayesianapproach to forecasting as our main objective; see West and Harrison (1997)for a detailed account of Bayesian forecasting with dynamic models. Prior tothe mid-1980s, a number of approximation methods were developed to filternon-normal or nonlinear processes in an attempt to circumvent the computa-tional complexity of the analysis of such models. For example, the extendedKalman filter and the Gaussian sum filter (Alspach and Sorensen, 1972) aretwo such methods described in detail in Anderson and Moore (1979). As inthe previous section, these techniques typically rely on approximating thenon-normal distribution by one or several Gaussian distributions or by someother parametric function.

With the advent of cheap and fast computing, a number of authors de-veloped computer-intensive methods based on numerical integration. For ex-ample, Kitagawa (1987) proposed a numerical method based on piecewiselinear approximations to the density functions for prediction, filtering, andsmoothing for non-Gaussian and nonstationary state-space models. Pole andWest (1988) used Gaussian quadrature techniques in a Bayesian analysis ofnonlinear dynamic models; West and Harrison (1997, Chapter 13) providea detailed explanation of these and similar methods. Markov chain MonteCarlo (MCMC) methods refer to Monte Carlo integration methods that usea Markovian updating scheme. We will describe the method in more detaillater. The most common MCMC method is the Gibbs sampler, which is es-sentially a modification of the Metropolis algorithm (Metropolis et al., 1953)developed by Hastings (1970) in the statistical setting and by Geman andGeman (1984) in the context of image restoration. Later, Tanner and Wong(1987) used the ideas in their substitution sampling approach, and Gelfandand Smith (1990) developed the Gibbs sampler for a wide class of parametricmodels. This technique was first used by Carlin et al. (1992) in the context ofgeneral nonlinear and non-Gaussian state-space models. Fruhwirth-Schnatter(1994) and Carter and Kohn (1994) built on these ideas to develop efficientGibbs sampling schemes for more restrictive models.

If the model is linear, that is, (6.1) and (6.2) hold, but the distributionsare not Gaussian, a non-Gaussian likelihood can be defined by (6.31) in §6.2,but where f0(·), fw(·) and fv(·) are not normal densities. In this case, pre-diction and filtering can be accomplished using numerical integration tech-niques (e.g., Kitagawa, 1987; Pole and West, 1988) or Monte Carlo techniques

Page 401: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

388 6 State-Space Models

(e.g. Fruhwirth-Schnatter, 1994; Carter and Kohn, 1994) to evaluate (6.32)and (6.33). Of course, the prediction and filter densities pΘ(xxxt

∣∣ Yt−1) and

pΘ(xxxt∣∣ Yt) will no longer be Gaussian and will not generally be of the location-

scale form as in the Gaussian case. A rich class of non-normal densities is givenin (6.198).

In general, the state-space model can be given by the following equations:

xxxt = Ft(xxxt−1,wwwt) and yyyt = Ht(xxxt, vvvt), (6.188)

where Ft and Ht are known functions that may depend on parameters Θand wwwt and vvvt are white noise processes. The main component of the modelretained by (6.188) is that the states are Markov, and the observations areconditionally independent, but we do not necessarily assume Ft and Ht arelinear, or wwwt and vvvt are Gaussian. Of course, if Ft(xxxt−1,wwwt) = Φtxxxt−1 + wwwtand Ht(xxxt, vvvt) = Atxxxt +vvvt and wwwt and vvvt are Gaussian, we have the standardDLM (exogenous variables can be added to the model in the usual way). Inthe general model, (6.188), the complete data likelihood is given by

LX,Y (Θ) = pΘ(xxx0)n∏t=1

pΘ(xxxt∣∣ xxxt−1)pΘ(yyyt

∣∣ xxxt), (6.189)

and the prediction and filter densities, as given by (6.32) and (6.33) in §6.2,still hold. Because our focus is on simulation using MCMC methods, we firstdescribe the technique in a general context.

Example 6.20 MCMC Techniques and the Gibbs Sampler

The goal of a Monte Carlo technique is to simulate a pseudo-random sam-ple of vectors from a desired density function pΘ(zzz). In Markov chainMonte Carlo, we simulate an ordered sequence of pseudo-random vectors,zzz0 7→ zzz1 7→ zzz2 7→ · · · by specifying a starting value, zzz0 and then samplingsuccessive values from a transition density π(zzzt|zzzt−1), for t = 1, 2, . . .. Inthis way, conditional on zzzt−1, the t-th pseudo-random vector, zzzt, is simu-lated independent of its predecessors. This technique alone does not yielda pseudo-random sample because contiguous draws are dependent on eachother (that is, we obtain a first-order dependent sequence of pseudo-randomvectors). If done appropriately, the dependence between the pseudo-variateszzzt and zzzt+m decays exponentially in m, and we may regard the collection{zzzt+`m; ` = 1, 2, . . .} for t and m suitably large, as a pseudo-random sam-ple. Alternately, one may repeat the process in parallel, retaining the m-th

value, on run g = 1, 2, . . ., say, zzz(g)m , for largem. Under general conditions, the

Markov chain converges in the sense that, eventually, the sequence of pseudo-variates appear stationary and the individual zzzt are marginally distributedaccording to the stationary “target” density pΘ(zzz). Technical details maybe found in Tierney (1994).

For Gibbs sampling, suppose we have a collection {zzz1, . . . , zzzk} of randomvectors with complete conditional densities denoted generically by

Page 402: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

6.10 State-Space Models Using Monte Carlo Methods 389

pΘ(zzzj∣∣ zzzi, i 6= j) ≡ pΘ(zzzj

∣∣ zzz1, . . . , zzzj−1, zzzj+1, . . . , zzzk),

for j = 1, . . . , k, available for sampling. Here, available means pseudo-samples may be generated by some method given the values of the appro-priate conditioning random vectors. Under mild conditions, these completeconditionals uniquely determine the full joint density pΘ(zzz1, . . . , zzzk) and,consequently, all marginals, pΘ(zzzj) for j = 1, . . . , k; details may be found inBesag (1974). The Gibbs sampler generates pseudo-samples from the jointdistribution as follows.(i) Start with an arbitrary set of starting values, say, {zzz1[0], . . . , zzzk[0]}.(ii) Draw zzz1[1] from pΘ(zzz1|zzz2[0], . . . , zzzk[0]);(iii) Draw zzz2[1] from pΘ(zzz2|zzz1[1], zzz3[0], . . . , zzzk[0]);(iv) Repeat until step k, which draws zzzk[1] from pΘ(zzzk|zzz1[1], . . . , zzzk−1[1]).(v) Repeat steps (i)-(iv) ` times obtaining a collection {zzz1[`], . . . , zzzk[`]}.Geman and Geman (1984) showed that under mild conditions, {zzz1[`], . . . , zzzk[`]}converges in distribution to a random observation from pΘ(zzz1, . . . , zzzk) as` → ∞. For this reason, we typically drop the subscript [`] from the nota-tion, assuming ` is sufficiently large for the generated sample to be thoughtof as a realization from the joint density; hence, we denote this first realiza-

tion as {zzz(1)1[`], . . . , zzz(1)k[`]} ≡ {zzz

(1)1 , . . . , zzz

(1)k }. This entire process is replicated in

parallel, a large number, G, of times providing pseudo-random iid collections

{zzz(g)1 , . . . , zzz(g)k }, for g = 1, . . . , G from the joint distribution. These simulated

values can the be used to estimate the marginal densities. In particular, ifpΘ(zzzj |zzzi, i 6= j) is available in closed form, then2

pΘ(zzzj) = G−1G∑g=1

pΘ(zzzj∣∣ zzz(g)i , i 6= j). (6.190)

Because of the relatively recent appearance of Gibbs sampling methodology,several important theoretical and practical issues are under investigation.These issues include the diagnosis of convergence, modification of the sam-pling order, efficient estimation, and sequential sampling schemes (as op-posed to the parallel processing described above) to mention a few. At thistime, the best advice can be obtained from the texts by Gelman et al. (1995)and Gilks et al. (1996), and we are certain that many more will follow.

Finally, it may be necessary to nest rejection sampling within the Gibbssampling procedure. The need for rejection sampling arises when we wantto sample from a density, say, f(zzz), but f(zzz) is known only up to a pro-portionality constant, say, p(zzz) ∝ f(zzz). If a density g(zzz) is available, andthere is a constant c for which p(zzz) ≤ cg(zzz) for all zzz, the rejection algorithmgenerates pseudo-variates from f(zzz) by generating a value, zzz∗ from g(zzz) and

2 Approximation (6.190) is based on the fact that, for random vectors x and y withjoint density p(x, y), the marginal density of x is obtained by integrating y out ofthe joint density, i.e., p(x) =

∫p(x, y)dy =

∫p(x|y)p(y)dy.

Page 403: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

390 6 State-Space Models

accepting it as a value from f(zzz) with probability π(zzz∗) = p(zzz∗)/[cg(zzz∗)].This algorithm can be quite inefficient if π(·) is close to zero; in such cases,more sophisticated envelope functions may be needed. Further discussion ofthese matters in the case of nonlinear state-space models can be found inCarlin et al. (1992, Examples 1.2 and 3.2).

In Example 6.20, the generic random vectors zzzj can represent parame-ter values, such as components of Θ, state values xxxt, or future observationsyyyn+m, for m ≥ 1. This will become evident in the following examples. Beforediscussing the general case of nonlinear and non-normal state-space models,we briefly introduce MCMC methods for the Gaussian DLM, as presented inFruhwirth-Schnatter (1994) and Carter and Kohn (1994).

Example 6.21 Parameter Assessment for the Gaussian DLM

Consider a Gaussian DLM given by

xxxt = Φtxxxt−1 +wwwt and yt = aaa′txxxt + vt. (6.191)

The observations are univariate, and the state process is p-dimensional; thisDLM includes the structural models presented in §6.5. The prior on theinitial state is xxx0 ∼ N(µµµ0, Σ0), and we assume that wwwt ∼ iid N(000, Qt), inde-pendent of vt ∼ iid N(0, rt). The collection of unknown model parameterswill be denoted by Θ.

To explore how we would assess the values of Θ using an MCMCtechnique, we focus on the problem obtaining the posterior distribution,p(Θ

∣∣ Yn), of the parameters given the data, Yn = {y1, . . . , yn} and a priorπ(Θ). Of course, these distributions depend on “hyperparameters” that areassumed to be known. (Some authors consider the states xxxt as the first levelof parameters because they are unobserved. In this case, the values in Θ areregarded as the hyperparameters, and the parameters of their distributionsare regarded as hyper-hyperparameters.) Denoting the entire set of statevectors as Xn = {xxx0, xxx1, . . . , xxxn}, the posterior can be written as

p(Θ∣∣ Yn) =

∫p(Θ

∣∣ Xn, Yn) p(Xn, Θ∗ ∣∣ Yn) dXn dΘ

∗. (6.192)

Although the posterior, p(Θ∣∣ Yn), may be intractable, conditioning on the

states can make the problem manageable in that

p(Θ∣∣ Xn, Yn) ∝ π(Θ) p(x0

∣∣ Θ)n∏t=1

p(xxxt∣∣ xxxt−1, Θ) p(yt

∣∣ xxxt, Θ) (6.193)

can be easier to work with (either as members of conjugate families or usingsome rejection scheme); we will discuss this in more detail when we presentthe nonlinear, non-Gaussian case, but we will assume for the present p(Θ

∣∣Xn, Yn) is in closed form.

Page 404: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

6.10 State-Space Models Using Monte Carlo Methods 391

Suppose we can obtain G pseudo-random draws, X(g)n ≡ (Xn, Θ

∗)(g), forg = 1, . . . , G, from the joint posterior density p(Xn, Θ

∗∣∣ Yn). Then (6.192)

can be approximated by

p(Θ∣∣ Yn) = G−1

G∑g=1

p(Θ∣∣ X(g)

n , Yn).

A sample from p(Xn, Θ∗∣∣ Yn) is obtained using two different MCMC

methods. First, the Gibbs sampler is used, for each g, as follows: sampleXn[`] given Θ∗[`−1] from p(Xn

∣∣ Θ∗[`−1], Yn), and then a sample Θ∗[`] from p(Θ∣∣

Xn[`], Yn) as given by (6.193), for ` = 1, 2, . . .. Stop when ` is sufficiently

large, and retain the final values as X(g)n . This process is repeated G times.

The first step of this method requires simultaneous generation of the statevectors. Because we are dealing with a Gaussian linear model, we can relyon the existing theory of the Kalman filter to accomplish this step. This stepis conditional on Θ, and we assume at this point that Θ is fixed and known.In other words, our goal is to sample the entire set of state vectors, Xn ={xxx0, xxx1, . . . , xxxn}, from the multivariate normal posterior density pΘ(Xn

∣∣Yn), where Yn = {y1, . . . , yn} represents the observations. Because of theMarkov structure, we can write,

pΘ(Xn

∣∣ Yn) = pΘ(xxxn∣∣ Yn)pΘ(xxxn−1

∣∣ xxxn, Yn−1) · · · pΘ(xxx0∣∣ xxx1). (6.194)

In view of (6.194), it is possible to sample the entire set of state vectors,Xn, by sequentially simulating the individual states backward. This pro-cess yields a simulation method that Fruhwirth-Schnatter (1994) called theforward-filtering, backward-sampling algorithm. In particular, because theprocesses are Gaussian, we need only obtain the conditional means andvariances, say, mmmt = EΘ(xxxt

∣∣ Yt, xxxt+1), and Vt = varΘ(xxxt∣∣ Yt, xxxt+1). This

conditioning argument is akin to having xxxt+1 as an additional observationon state xxxt. In particular, using standard multivariate normal distributiontheory,

mmmt = xxxtt + Jt(xxxt+1 − xxxtt+1),

Vt = P tt − JtP tt+1J′t, (6.195)

for t = n−1, n−2, . . . , 0, where Jt is defined in (6.49). To verify (6.195), theessential part of the Gaussian density (that is, the exponent) of xxxt

∣∣ Yt, xxxt+1

is

(xxxt+1 − Φt+1xxxt)′[Qt+1]−1(xxxt+1 − Φt+1xxxt) + (xxxt − xxxtt)′[P tt ]−1(xxxt − xxxtt),

and we simply complete the square; see Fruhwirth-Schnatter (1994) or Westand Harrison (1997, §4.7). Hence, the algorithm is to first sample xxxn froma N(xxxnn, P

nn ), where xxxnn and Pnn are obtained from the Kalman filter, Prop-

erty 6.1, and then sample xxxt from a N(mmmt, Vt), for t = n − 1, n − 2, . . . , 0,

Page 405: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

392 6 State-Space Models

where the conditioning value of xxxt+1 is the value previously sampled; mmmt

and Vt are given in (6.195).

Next, we address an MCMC approach to nonlinear and non-Gaussianstate-space modeling that was first presented in Carlin et al. (1992). Weconsider the general model given in (6.188), but with additive errors:

xxxt = Ft(xxxt−1) +wwwt and yyyt = Ht(xxxt) + vvvt, (6.196)

where Ft and Ht are given, but may also depend on unknown parameters,say, Φt and At, respectively, the collection of which will be denoted by Θ.The errors are independent white noise sequences with var(wwwt) = Qt andvar(vvvt) = Rt. Although time-varying variance–covariance matrices are easilyincorporated in this framework, to ease the discussion we focus on the caseQt ≡ Q and Rt ≡ R. Also, although it is not necessary, we assume theinitial state condition xxx0 is fixed and known; this is merely for notationalconvenience, so we do not have to carry along the additional terms involvingxxx0 throughout the discussion.

In general, the likelihood specification for the model is given by

LX,Y (Θ,Q,R) =n∏t=1

f1(xxxt∣∣ xxxt−1, Θ,Q) f2(yyyt

∣∣ xxxt, Θ,R), (6.197)

where it is assumed the densities f1(·) and f2(·) are scale mixtures of normals.Specifically, for t = 1, . . . , n,

f1(xxxt∣∣ xxxt−1, Θ,Q) =

∫f(xxxt

∣∣ xxxt−1, Θ,Q, λt) p1(λt) dλt,

f2(yyyt∣∣ xxxt, Θ,R) =

∫f(yyyt

∣∣ xxxt, Θ,R, ωt) p2(ωt) dωt, (6.198)

where conditional on the independent sequences of nuisance parameters λλλ =(λt; t = 1, . . . , n) and ωωω = (ωt; t = 1, . . . , n),

xxxt∣∣ xxxt−1, Θ,Q, λt ∼ N

(Ft(xxxt−1;Θ), λtQ

),

yyyt∣∣ xxxt, Θ,R, ωt ∼ N

(Ht(xxxt;Θ), ωtR

). (6.199)

By varying p1(λt) and p2(ωt), we can have a wide variety of non-Gaussian errordensities. These densities include, for example, double exponential, logistic,and t distributions in the univariate case and a rich class of multivariatedistributions; this is discussed further in Carlin et al. (1992). The key tothe approach is the introduction of the nuisance parameters λλλ and ωωω and thestructure (6.199), which lends itself naturally to the Gibbs sampler and allowsfor the analysis of this general nonlinear and non-Gaussian problem.

According to Example 6.20, to implement the Gibbs sampler, we must beable to sample from the following complete conditional distributions:

Page 406: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

6.10 State-Space Models Using Monte Carlo Methods 393

(i) xxxt∣∣ xxxs6=t, λλλ,ωωω,Θ,Q,R, Yn t = 1, . . . , n,

(ii) λt∣∣ λs6=t, ωωω,Θ,Q,R, Yn, Xn ∼ λt

∣∣ Θ,Q,xxxt, xxxt−1 t = 1, . . . , n,

(iii) ωt∣∣ ωs6=t, λλλ,Θ,Q,R, Yn, Xn ∼ ωt

∣∣ Θ,R,yyyt, xxxt t = 1, . . . , n,

(iv) Q∣∣ λλλ,ωωω,Θ,R, Yn, Xn ∼ Q

∣∣ λλλ, Yn, Xn,

(v) R∣∣ λλλ,ωωω,Θ,Q, Yn, Xn ∼ R

∣∣ ωωω, Yn, Xn,

(vi) Θ∣∣ λλλ,ωωω,Q,R, Yn, Xn ∼ Θ

∣∣ Yn, Xn,

where Xn = {xxx1, . . . , xxxn} and Yn = {yyy1, . . . , yyyn}. The main difference betweenthis method and the linear Gaussian case is that, because of the generality, wesample the states one-at-a-time rather than simultaneously generating all ofthem. As discussed in Carter and Kohn (1994), if possible, it is more efficientto generate the states simultaneously as in Example 6.21.

We will discuss items (i) and (ii) above. The third item follows in a similarmanner to the second, and items (iv)-(vi) will follow from standard multivari-ate normal distribution theory and from Wishart distribution theory becauseof the conditioning on λλλ and ωωω. We will discuss this matter further in the nextexample. First, consider the linear model, Ft(xxxt−1) = Φtxxxt−1, and Ht(xxxt) =Atxxxt in (6.196). In this case, for t = 1, . . . , n, xxxt

∣∣ xxxs6=t, λλλ,ωωω,Θ,Q,R, Yn has ap-dimensional Np(Btbbbt, Bt) distribution, with

B−1t =Q−1

λt+A′tR

−1Atωt

+Φ′t+1Q

−1Φt+1

λt+1,

bbbt =xxxt−1Φ

′tQ−1

λt+yyytR

−1Atωt

+xxxt+1Q

−1Φt+1

λt+1, (6.200)

where, when t = n in (6.200), terms in the sum with elements having a sub-script of n+ 1 are dropped (this is assumed to be the case in what follows,although we do not explicitly state it). This result follows by noting the es-sential part of the multivariate normal distribution (that is, the exponent) ofxxxt∣∣ xxxs6=t, λλλ,ωωω,Θ,Q,R, Yn is

(xxxt − Φtxxxt−1)′(λtQ)−1(xxxt − Φtxxxt−1) + (yyyt −Atxxxt)′(ωtR)−1(yyyt −Atxxxt)

+(xxxt+1 − Φt+1xxxt)′(λt+1Q)−1(xxxt+1 − Φt+1xxxt), (6.201)

which upon manipulation yields (6.200).

Example 6.22 Nonlinear Models

In the case of nonlinear models, we can use (6.200) with slight modifications.For example, consider the case in which Ft is nonlinear, but Ht is linear, sothe observations are yyyt = Atxxxt + vvvt. Then,

xxxt∣∣ xxxs6=t, λλλ,ωωω,Θ,Q,R, Yn ∝ η1(xxxt)Np(B1tbbb1t, B1t), (6.202)

where

Page 407: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

394 6 State-Space Models

B−11t =Q−1

λt+A′tR

−1Atωt

,

bbb1t =F ′t (xxxt−1)Q−1

λt+yyytR

−1Atωt

,

and

η1(xxxt) = exp

{− 1

2λt+1

(xxxt+1 − Ft+1(xxxt)

)′Q−1

(xxxt+1 − Ft+1(xxxt)

)}.

Because 0 ≤ η1(xxxt) ≤ 1, for all xxxt, the distribution we want to samplefrom is dominated by the Np(B1tbbb1t, B1t) density. Hence, we may use re-jection sampling as discussed in Example 6.20 to obtain an observationfrom the required density. That is, we generate a pseudo-variate from theNp(B1tbbb1t, B1t) density and accept it with probability η1(xxxt).

We proceed analogously in the case in which Ft(xxxt−1) = Φtxxxt−1 is linearand Ht(xxxt) is nonlinear. In this case,

xxxt∣∣ xxxs6=t, λλλ,ωωω,Θ,Q,R, Yn ∝ η2(xxxt)Np(B2tbbb2t, B2t), (6.203)

where

B−12t =Q−1

λt+Φ′t+1Q

−1Φt+1

λt+1,

bbb2t =xxxt−1Φ

′tQ−1

λt+xxxt+1Q

−1Φt+1

λt+1,

and

η2(xxxt) = exp

{− 1

2ωt

(yyyt −Ht(xxxt)

)′R−1

(yyyt −Ht(xxxt)

)}.

Here, we generate a pseudo-variate from the Np(B2tbbb2t, B2t) density andaccept it with probability η2(xxxt).

Finally, in the case in which both Ft and Ht are nonlinear, we have

xxxt∣∣ xxxs6=t, λλλ,ωωω,Θ,Q,R, Yn ∝ η1(xxxt)η2(xxxt)Np(Ft(xxxt−1), λtQ), (6.204)

so we sample from a Np(Ft(xxxt−1), λtQ) density and accept it with probabilityη1(xxxt)η2(xxxt).

Determination of (ii), λt∣∣ Θ,Q,xxxt, xxxt−1 follows directly from Bayes theo-

rem; that is, p(λt∣∣ Θ,Q,xxxt, xxxt−1) ∝ p1(λt)p(xxxt

∣∣ λt, xxxt−1, Θ,Q). By (6.198),

however, we know the normalization constant is given by f1(xxxt∣∣ xxxt−1, Θ,Q),

and thus the complete conditional density for λt is of a known functionalform.

Many examples of these techniques are given in Carlin et al. (1992), in-cluding the problem of model choice. In the next example, we consider a uni-variate nonlinear model in which the state noise process has a t-distribution.

Page 408: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

6.10 State-Space Models Using Monte Carlo Methods 395

As noted in Meinhold and Singpurwalla (1989), using t-distributions for theerror processes is a way of robustifying the Kalman filter against outliers. Inthis example we present a brief discussion of a detailed analysis presented inCarlin et al. (1992, Example 4.2); readers interested in more detail may findit in that article.

Example 6.23 A Nonlinear, Non-Gaussian State-Space Model

Kitagawa (1987) considered the analysis of data generated from the followingunivariate nonlinear model:

xt = Ft(xt−1) + wt and yt = Ht(xt) + vt t = 1, . . . , 100, (6.205)

with

Ft(xt−1) = αxt−1 + βxt−1/(1 + x2t−1) + γ cos[1.2(t− 1)],

Ht(xt) = x2t/20, (6.206)

where x0 = 0, wt are independent random variables having a central t-distribution with ν = 10 degrees and scaled so var(wt) = σ2

w = 10 [we de-note this generically by t(0, σ, ν)], and vt is white standard Gaussian noise,var(vt) = σ2

v = 1. The state noise and observation noise are mutually inde-pendent. Kitagawa (1987) discussed the analysis of data generated from thismodel with α = .5, β = 25, and γ = 8 assumed known. We will use thesevalues of the parameters in this example, but we will assume they are un-known. Figure 6.15 shows a typical data sequence yt and the correspondingstate process xt.

Our goal here will be to obtain an estimate of the prediction densityp(x101

∣∣ Y100). To accomplish this, we use n = 101 and consider y101 as alatent variable (we will discuss this in more detail shortly). The priors onthe variance components are chosen from a conjugate family, that is, σ2

w ∼IG(a0, b0) independent of σ2

v ∼ IG(c0, d0), where IG denotes the inverse(reciprocal) gamma distribution [z has an inverse gamma distribution if 1/zhas a gamma distribution; general properties can be found, for example, inBox and Tiao (1973, Section 8.5)]. Then,

σ2w

∣∣ λλλ, Yn, Xn ∼

IG

a0 +n

2,

{1

b0+

1

2

n∑t=1

[xt − F (xt−1)]2/λt

}−1 ,

σ2v

∣∣ ωωω, Yn, Xn ∼

IG

c0 +n

2,

{1

d0+

1

2

n∑t=1

[yt −H(xt)]2/ωt

}−1 . (6.207)

Next, letting ν/λt ∼ χ2ν , we get that, marginally, wt

∣∣ σw ∼ t(0, σw, ν),

as required, leading to the complete conditional λt∣∣ σw, α, β, γ, Yn, Xn, for

t = 1, . . . , n, being distributed as

Page 409: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

396 6 State-Space Models

0 20 40 60 80 100

−20

−10

010

2030

Fig. 6.15. The state process, xt (solid line), and the observations, yt (dashed line),for t = 1, . . . , 100 generated from the model (6.205).

IG

(ν + 1

2, 2

{[xt − F (xt−1)]2

σ2w

+ ν

}−1). (6.208)

We take ωt ≡ 1 for t = 1, . . . , n, because the observation noise is Gaussian.For the states, xt, we take a normal prior on the initial state, x0 ∼

N(µ0, σ20), and then we use rejection sampling to conditionally generate a

state value xt, for t = 1, . . . , n, as described in Example 6.22, equation(6.204). In this case, η1(xt) and η2(xt) are given in (6.202) and (6.203), re-spectively, with Ft and Ht given by (6.206), Θ = (α, β, γ)′, Q = σ2

w andR = σ2

v . Endpoints take some special consideration; we generate x0 froma N(µ0, σ

20) and accept it with probability η1(x0), and we generate x101

as usual and accept it with probability η2(x101). The last complete condi-tional depends on y101, a latent data value not observed but instead gener-ated according to its complete conditional, which is N(x2101/20, σ2

v), becauseω101 = 1.

The prior on Θ = (α, β, γ)′ is taken to be trivariate normal with mean(µα, µβ , µγ)′ and diagonal variance–covariance matrix diag{σ2

α, σ2β , σ

2γ}. The

necessary conditionals can be found using standard normal theory, as donein (6.200). For example, the complete conditional distribution of α is of theform N(Bb,B), where

B−1 =1

σ2α

+1

σ2w

n∑t=1

x2t−1λt

,

b =µασ2α

+1

σ2w

n∑t=1

xt−1λt

(xt − β

xt−11 + x2t−1

− γ cos[1.2(t− 1)]

).

Page 410: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

6.10 State-Space Models Using Monte Carlo Methods 397

densi

ty

-10 0 10 20

0.00.0

20.0

40.0

60.0

8

Fig. 6.16. Estimated one-step-ahead prediction posterior density p(x101|Y100) ofthe state process for the nonlinear and non-normal model given by (6.205) usingGibbs sampling, G = 500.

The complete conditional for β has the same form, with

B−1 =1

σ2β

+1

σ2w

n∑t=1

x2t−1λt(1 + x2t−1)2

,

b =µβσ2β

+1

σ2w

n∑t=1

xt−1λt(1 + x2t−1)

(xt − αxt−1 − γ cos[1.2(t− 1)]) ,

and for γ the values are

B−1 =1

σ2γ

+1

σ2w

n∑t=1

cos2[1.2(t− 1)]

λt,

b =µγσ2γ

+1

σ2w

n∑t=1

cos[1.2(t− 1)]

λt

(xt − αxt−1 − β

xt−11 + x2t−1

).

In this example, we put µ0 = 0, σ20 = 10, and a0 = 3, b0 = .05 (so the

prior on σ2w has mean and standard deviation equal to 10), and c0 = 3,

d0 = .5 (so the prior on σ2v has mean and standard deviation equal to

one). The normal prior on Θ = (α, β, γ)′ had corresponding mean vectorequal to (µα = .5, µβ = 25, µγ = 8)′ and diagonal variance matrix equal

Page 411: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

398 6 State-Space Models

to diag{σ2α = .25, σ2

β = 10, σ2γ = 4}. The Gibbs sampler ran for ` = 50

iterations for G = 500 parallel replications per iteration. We estimate themarginal posterior density of x101 as

p(x101∣∣ Y100) = G−1

G∑g=1

N(x101

∣∣∣ [Ft(xt−1)](g), λ(g)101σ

2(g)w

), (6.209)

where N(·|a, b) denotes the normal density with mean a and variance b, and

[Ft(xt−1)](g) = α(g)x(g)t−1 + β(g)x

(g)t−1/(1 + x

2(g)t−1 ) + γ(g) cos[1.2(t− 1)].

The estimate, (6.209), with G = 500, is shown in Figure 6.16. Other aspectsof the analysis, for example, the marginal posteriors of the elements of Θ,can be found in Carlin et al. (1992).

Problems

Section 6.1

6.1 Consider a system process given by

xt = −.9xt−2 + wt t = 1, . . . , n

where x0 ∼ N(0, σ20), x−1 ∼ N(0, σ2

1), and wt is Gaussian white noise withvariance σ2

w. The system process is observed with noise, say,

yt = xt + vt,

where vt is Gaussian white noise with variance σ2v . Further, suppose x0, x−1,

{wt} and {vt} are independent.

(a) Write the system and observation equations in the form of a state spacemodel.

(b) Find the values of σ20 and σ2

1 that make the observations, yt, stationary.(c) Generate n = 100 observations with σw = 1, σv = 1 and using the values

of σ20 and σ2

1 found in (b). Do a time plot of xt and of yt and compare thetwo processes. Also, compare the sample ACF and PACF of xt and of yt.

(d) Repeat (c), but with σv = 10.

6.2 Consider the state-space model presented in Example 6.3. Let xt−1t =E(xt|yt−1, . . . , y1) and let P t−1t = E(xt − xt−1t )2. The innovation sequence orresiduals are εt = yt − yt−1t , where yt−1t = E(yt|yt−1, . . . , y1). Find cov(εs, εt)in terms of xt−1t and P t−1t for (i) s 6= t and (ii) s = t.

Page 412: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

Problems 399

Section 6.2

6.3 Simulate n = 100 observations from the following state-space model:

xt = .8xt−1 + wt and yt = xt + vt

where x0 ∼ N(0, 2.78), wt ∼ iid N(0, 1), and vt ∼ iid N(0, 1) are all mutuallyindependent. Compute and plot the data, yt, the one-step-ahead predictors,yt−1t along with the root mean square prediction errors, E1/2(yt−yt−1t )2 usingFigure 6.3 as a guide.

6.4 Suppose the vector zzz = (xxx′, yyy′)′, where xxx (p× 1) and yyy (q × 1) are jointlydistributed with mean vectors µµµx and µµµy and with covariance matrix

cov(zzz) =

(Σxx ΣxyΣyx Σyy

).

Consider projecting xxx on M = sp{111, yyy}, say, xxx = bbb+Byyy.

(a) Show the orthogonality conditions can be written as

E(xxx− bbb−Byyy) = 0,

E[(xxx− bbb−Byyy)yyy′] = 0,

leading to the solutions

bbb = µµµx −Bµµµy and B = ΣxyΣ−1yy .

(b) Prove the mean square error matrix is

MSE = E[(xxx− bbb−Byyy)xxx′] = Σxx −ΣxyΣ−1yy Σyx.

(c) How can these results be used to justify the claim that, in the absenceof normality, Property 6.1 yields the best linear estimate of the state xxxtgiven the data Yt, namely, xxxtt, and its corresponding MSE, namely, P tt ?

6.5 Projection Theorem Derivation of Property 6.2. Throughout this problem,we use the notation of Property 6.2 and of the Projection Theorem givenin Appendix B, where H is L2. If Lk+1 = sp{yyy1, . . . , yyyk+1}, and Vk+1 =sp{yyyk+1 − yyykk+1}, for k = 0, 1, . . . , n− 1, where yyykk+1 is the projection of yyyk+1

on Lk, then, Lk+1 = Lk ⊕ Vk+1. We assume P 00 > 0 and R > 0.

(a) Show the projection of xxxk on Lk+1, that is, xxxk+1k , is given by

xxxk+1k = xxxkk +Hk+1(yyyk+1 − yyykk+1),

where Hk+1 can be determined by the orthogonality property

E{(xxxk −Hk+1(yyyk+1 − yyykk+1)

) (yyyk+1 − yyykk+1

)′}= 0.

ShowHk+1 = P kk Φ

′A′k+1

[Ak+1P

kk+1A

′k+1 +R

]−1.

Page 413: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

400 6 State-Space Models

(b) Define Jk = P kk Φ′[P kk+1]−1, and show

xxxk+1k = xxxkk + Jk(xxxk+1

k+1 − xxxkk+1).

(c) Repeating the process, show

xxxk+2k = xxxkk + Jk(xxxk+1

k+1 − xxxkk+1) +Hk+2(yyyk+2 − yyyk+1

k+2),

solving for Hk+2. Simplify and show

xxxk+2k = xxxkk + Jk(xxxk+2

k+1 − xxxkk+1).

(d) Using induction, conclude

xxxnk = xxxkk + Jk(xxxnk+1 − xxxkk+1),

which yields the smoother with k = t− 1.

Section 6.3

6.6 Consider the univariate state-space model given by state conditions x0 =w0, xt = xt−1 + wt and observations yt = xt + vt, t = 1, 2, . . ., where wt andvt are independent, Gaussian, white noise processes with var(wt) = σ2

w andvar(vt) = σ2

v .

(a) Show that yt follows an IMA(1,1) model, that is, ∇yt follows an MA(1)model.

(b) Fit the model specified in part (a) to the logarithm of the glacial varveseries and compare the results to those presented in Example 3.32.

6.7 Let yt represent the global temperature series (gtemp) shown in Figure 1.2.

(a) Fit a smoothing spline using gcv (the default) to yt and plot the resultsuperimposed on the data. Repeat the fit using spar=.7; the gcv methodyields spar=.5 approximately. (Example 2.14 on page 75 may help. Alsoin R, see the help file ?smooth.spline.)

(b) Write the model yt = xt + vt with ∇2xt = wt, in state-space form. [Hint:The state will be a 2 × 1 vector, say, xxxt = (xt, xt−1)′.] Assume wt andvt are independent Gaussian white noise processes, both independent ofxxx0. Fit this state-space model to yt, and exhibit a time plot the estimatedsmoother, xnt and the corresponding error limits, xnt ±2

√Pnt superimposed

on the data.(c) Superimpose all the fits from parts (a) and (b) [include the error bounds]

on the data and briefly compare and contrast the results.

6.8 Smoothing Splines and the Kalman Smoother. Consider the discrete timeversion of the smoothing spline argument given in (2.56); that is, suppose weobserve yt = xt + vt and we wish to fit xt, for t = 1, . . . , n, constrained to besmooth, by minimizing

Page 414: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

Problems 401

n∑t=1

[yt − xt]2 + λ

n∑t=1

(∇2xt

)2. (6.210)

Show that this problem is identical to obtaining xnt in Problem 6.7(b), withλ = σ2

v/σ2w, assuming xxx0 = 000. Hint: Using the notation surrounding equa-

tion (6.63), the goal is to find the MLE of Xn given Yn, i.e., maximizelog f(Xn|Yn). Because of the Gaussianity, the maximum (or mode) of thedistribution is when the states are estimated by xnt , the conditional means.But log f(Xn|Yn) = log f(Xn, Yn) − log f(Yn), so maximizing log f(Xn, Yn)with respect to Xn is an equivalent problem. Now, ignore the initial state andwrite −2 log f(Xn, Yn) based on the model, which should look like (6.210);use (6.64) as a guide.

6.9 Consider the modelyt = xt + vt,

where vt is Gaussian white noise with variance σ2v , xt are independent Gaus-

sian random variables with mean zero and var(xt) = rtσ2x with xt independent

of vt, and r1, . . . , rn are known constants. Show that applying the EM algo-rithm to the problem of estimating σ2

x and σ2v leads to updates (represented

by hats)

σ2x =

1

n

n∑t=1

σ2t + µ2

t

rtand σ2

v =1

n

n∑t=1

[(yt − µt)2 + σ2t ],

where, based on the current estimates (represented by tildes),

µt =rtσ

2x

rtσ2x + σ2

v

yt and σ2t =

rtσ2xσ

2v

rtσ2x + σ2

v

.

6.10 To explore the stability of the filter, consider a univariate state-spacemodel. That is, for t = 1, 2, . . ., the observations are yt = xt+vt and the stateequation is xt = φxt−1 +wt, where σw = σv = 1 and |φ| < 1. The initial state,x0, has zero mean and variance one.

(a) Exhibit the recursion for P t−1t in Property 6.1 in terms of P t−2t−1 .

(b) Use the result of (a) to verify P t−1t approaches a limit (t→∞) P that isthe positive solution of P 2 − φ2P − 1 = 0.

(c) With K = limt→∞Kt as given in Property 6.1, show |1−K| < 1.(d) Show, in steady-state, the one-step-ahead predictor, ynn+1 = E(yn+1

∣∣yn, yn−1, . . .), of a future observation satisfies

ynn+1 =∞∑j=0

φjK(1−K)j−1yn+1−j .

Page 415: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

402 6 State-Space Models

6.11 In §6.3, we discussed that it is possible to obtain a recursion for thegradient vector, −∂ lnLY (Θ)/∂Θ. Assume the model is given by (6.1) and(6.2) and At is a known design matrix that does not depend on Θ, in whichcase Property 6.1 applies. For the gradient vector, show

∂ lnLY (Θ)/∂Θi =

n∑t=1

{εεε′tΣ

−1t

∂εεεt∂Θi

− 1

2εεε′tΣ

−1t

∂Σt∂Θi

Σ−1t εεεt

+1

2tr

(Σ−1t

∂Σt∂Θi

)},

where the dependence of the innovation values onΘ is understood. In addition,with the general definition ∂ig = ∂g(Θ)/∂Θi, show the following recursions,for t = 2, . . . , n apply:

(i) ∂iεεεt = −At ∂ixxxt−1t ,(ii) ∂ixxx

t−1t = ∂iΦ xxx

t−2t−1 + Φ ∂ixxx

t−2t−1 + ∂iKt−1 εεεt−1 +Kt−1 ∂iεεεt−1,

(iii) ∂iΣt = At ∂iPt−1t A′t + ∂iR,

(iv) ∂iKt =[∂iΦ P t−1t A′t + Φ ∂iP

t−1t A′t −Kt ∂iΣt

]Σ−1t ,

(v) ∂iPt−1t = ∂iΦ P t−2t−1Φ

′ + Φ ∂iPt−2t−1 Φ′ + Φ P t−2t−1 ∂iΦ

′ + ∂iQ,− ∂iKt−1 ΣtK

′t−1 −Kt−1 ∂iΣt K

′t−1 −Kt−1Σt ∂iK

′t−1,

using the fact that P t−1t = ΦP t−2t−1Φ′ +Q−Kt−1ΣtK

′t−1.

6.12 Continuing with the previous problem, consider the evaluation of theHessian matrix and the numerical evaluation of the asymptotic variance–covariance matrix of the parameter estimates. The information matrix sat-isfies

E

{−∂

2 lnLY (Θ)

∂Θ ∂Θ′

}= E

{(∂ lnLY (Θ)

∂Θ

)(∂ lnLY (Θ)

∂Θ

)′};

see Anderson (1984, Section 4.4), for example. Show the (i, j)-th element ofthe information matrix, say, Iij(Θ) = E

{−∂2 lnLY (Θ)/∂Θi ∂Θj

}, is

Iij(Θ) =n∑t=1

E{∂iεεε′t Σ−1t ∂jεεεt +

1

2tr(Σ−1t ∂iΣt Σ

−1t ∂jΣt

)+

1

4tr(Σ−1t ∂iΣt

)tr(Σ−1t ∂jΣt

)}.

Consequently, an approximate Hessian matrix can be obtained from the sam-ple by dropping the expectation, E, in the above result and using only therecursions needed to calculate the gradient vector.

Page 416: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

Problems 403

Section 6.4

6.13 As an example of the way the state-space model handles the missingdata problem, suppose the first-order autoregressive process

xt = φxt−1 + wt

has an observation missing at t = m, leading to the observations yt = Atxt,where At = 1 for all t, except t = m wherein At = 0. Assume x0 = 0with variance σ2

w/(1− φ2), where the variance of wt is σ2w. Show the Kalman

smoother estimators in this case are

xnt =

φy1 t = 0,φ

1+φ2(ym−1 + ym+1) t = m,

y, t 6= 0,m,

with mean square covariances determined by

Pnt =

σ2w t = 0,

σ2w/(1 + φ2) t = m,

0 t 6= 0,m.

6.14 The data set ar1miss is n = 100 observations generated from an AR(1)process, xt = φxt−1 +wt, with φ = .9 and σw = 1, where 10% of the data hasbeen zeroed out at random. Considering the zeroed out data to be missingdata, use the results of Problem 6.13 to estimate the parameters of the model,φ and σw, using the EM algorithm, and then estimate the missing values.

Section 6.5

6.15 Using Example 6.10 as a guide, fit a structural model to the FederalReserve Board Production Index data and compare it with the model fit inExample 3.46.

Section 6.6

6.16 (a) Fit an AR(2) to the recruitment series, Rt in rec, and consider alag-plot of the residuals from the fit versus the SOI series, St in soi, atvarious lags, St−h, for h = 0, 1, . . .. Use the lag-plot to argue that St−5 isreasonable to include as an exogenous variable.

(b) Fit an ARX(2) to Rt using St−5 as an exogenous variable and commenton the results; include an examination of the innovations.

Page 417: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

404 6 State-Space Models

6.17 Use Property 6.6 to complete the following exercises.

(a) Write a univariate AR(1) model, yt = φyt−1 + vt, in state-space form.Verify your answer is indeed an AR(1).

(b) Repeat (a) for an MA(1) model, yt = vt + θvt−1.(c) Write an IMA(1,1) model, yt = yt−1 + vt + θvt−1, in state-space form.

6.18 Verify Property 6.5.

6.19 Verify Property 6.6.

Section 6.7

6.20 Repeat the bootstrap analysis of Example 6.13 on the entire three-monthTreasury bills and rate of inflation data set of 110 observations. Do the con-clusions of Example 6.13—that the dynamics of the data are best describedin terms of a fixed, rather than stochastic, regression—still hold?

Section 6.8

6.21 Fit the switching model described in Example 6.15 to the growth rate ofGNP. The data are in gnp and, in the notation of the example, yt is log-GNPand ∇yt is the growth rate. Use the code in Example 6.17 as a guide.

Section 6.9

6.22 Use the material presented in Example 6.21 to perform a Bayesian anal-ysis of the model for the Johnson & Johnson data presented in Example 6.10.

6.23 Verify (6.194) and (6.195).

6.24 Verify (6.200) and (6.207).

Section 6.10

6.25 Fit a stochastic volatility model to the returns of one (or more) of the fourfinancial time series available in the R datasets package as EuStockMarkets.

Page 418: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

7

Statistical Methods in the Frequency Domain

7.1 Introduction

In previous chapters, we saw many applied time series problems that involvedrelating series to each other or to evaluating the effects of treatments or designparameters that arise when time-varying phenomena are subjected to periodicstimuli. In many cases, the nature of the physical or biological phenomena un-der study are best described by their Fourier components rather than by thedifference equations involved in ARIMA or state-space models. The funda-mental tools we use in studying periodic phenomena are the discrete Fouriertransforms (DFTs) of the processes and their statistical properties. Hence,in §7.2, we review the properties of the DFT of a multivariate time seriesand discuss various approximations to the likelihood function based on thelarge-sample properties and the properties of the complex multivariate nor-mal distribution. This enables extension of the classical techniques discussedin the following paragraphs to the multivariate time series case.

An extremely important class of problems in classical statistics develops

put series. For example, in Chapter 2, we have previously considered relatingtemperature and various pollutant levels to daily mortality, but have not in-vestigated the frequencies that appear to be driving the relation and have notlooked at the possibility of leading or lagging effects. In Chapter 4, we isolateda definite lag structure that could be used to relate sea surface temperatureto the number of new recruits. In Problem 5.13, the possible driving pro-cesses that could be used to explain inflow to Lake Shasta were hypothesizedin terms of the possible inputs precipitation, cloud cover, temperature, andother variables. Identifying the combination of input factors that produce thebest prediction for inflow is an example of multiple regression in the frequencydomain, with the models treated theoretically by considering the regression,conditional on the random input processes.

A situation somewhat different from that above would be one in which theinput series are regarded as fixed and known. In this case, we have a model

© Springer Science+Business Media, LLC 2011

R.H. Shumway and D.S. Stoffer, Time Series Analysis and Its Applications: With R Examples, 405

when we are interested in relating a collection of input series to some out-

Springer Texts in Statistics, DOI 10.1007/978-1-4419-7865-3_7,

Page 419: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

406 7 Statistical Methods in the Frequency Domain

−0.6

−0.2

0.2

0.4

0.6

Bru

sh

−0.4

−0.2

0.0

0.2

0.4

Hea

t

−0.6

−0.2

0.0

0.2

0.4

0.6

0 20 40 60 80 100 120

Sho

ck

Time

−0.4

−0.2

0.0

0.2

0.4

Bru

sh

−0.4

−0.2

0.0

0.2

0.4

Hea

t

−0.4

−0.2

0.0

0.2

0.4

0 20 40 60 80 100 120

Sho

ck

Time

Awake Sedated

Fig. 7.1. Mean response of subjects to various combinations of periodic stimulaemeasured at the cortex (primary somatosensory, contralateral). In the first column,the subjects are awake, in the second column the subjects are under mild anesthesia.In the first row, the stimulus is a brush on the hand, the second row involves theapplication of heat, and the third row involves a low level shock.

analogous to that occurring in analysis of variance, in which the analysis nowcan be performed on a frequency by frequency basis. This analysis worksespecially well when the inputs are dummy variables, depending on some con-figuration of treatment and other design effects and when effects are largelydependent on periodic stimuli. As an example, we will look at a designedexperiment measuring the fMRI brain responses of a number of awake andmildly anesthetized subjects to several levels of periodic brushing, heat, andshock effects. Some limited data from this experiment have been discussedpreviously in Example 1.6 of Chapter 1. Figure 7.1 shows mean responses tovarious levels of periodic heat, brushing, and shock stimuli for subjects awakeand subjects under mild anesthesia. The stimuli were periodic in nature, ap-plied alternately for 32 seconds (16 points) and then stopped for 32 seconds.The periodic input signal comes through under all three design conditions

Page 420: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

7.1 Introduction 407

when the subjects are awake, but is somewhat attenuated under anesthesia.The mean shock level response hardly shows on the input signal; shock levelswere designed to simulate surgical incision without inflicting tissue damage.The means in Figure 7.1 are from a single location. Actually, for each indi-vidual, some nine series were recorded at various locations in the brain. It isnatural to consider testing the effects of brushing, heat, and shock under thetwo levels of consciousness, using a time series generalization of analysis ofvariance. The R code used to generate Figure 7.1 is:

1 x = matrix(0, 128, 6)

2 for (i in 1:6) x[,i] = rowMeans(fmri[[i]])

3 colnames(x)=c("Brush", "Heat", "Shock", "Brush", "Heat", "Shock")

4 plot.ts(x, main="")

5 mtext("Awake", side=3, line=1.2, adj=.05, cex=1.2)

6 mtext("Sedated", side=3, line=1.2, adj=.85, cex=1.2)

A generalization to random coefficient regression is also considered, paral-leling the univariate approach to signal extraction and detection presented in§4.9. This method enables a treatment of multivariate ridge-type regressionsand inversion problems. Also, the usual random effects analysis of variance inthe frequency domain becomes a special case of the random coefficient model.

The extension of frequency domain methodology to more classical ap-proaches to multivariate discrimination and clustering is of interest in thefrequency dependent case. Many time series differ in their means and in theirautocovariance functions, making the use of both the mean function and thespectral density matrices relevant. As an example of such data, consider thebivariate series consisting of the P and S components derived from severalearthquakes and explosions, such as those shown in Figure 7.2, where the Pand S components, representing different arrivals have been separated fromthe first and second halves, respectively, of waveforms like those shown origi-nally in Figure 1.7.

Two earthquakes and two explosions from a set of eight earthquakes andexplosions are shown in Figure 7.2 and some essential differences exist thatmight be used to characterize the two classes of events. Also, the frequencycontent of the two components of the earthquakes appears to be lower thanthose of the explosions, and relative amplitudes of the two classes appear todiffer. For example, the ratio of the S to P amplitudes in the earthquakegroup is much higher for this restricted subset. Spectral differences were alsonoticed in Chapter 4, where the explosion processes had a stronger high-frequency component relative to the low-frequency contributions. Exampleslike these are typical of applications in which the essential differences be-tween multivariate time series can be expressed by the behavior of either thefrequency-dependent mean value functions or the spectral matrix. In discrimi-nant analysis, these types of differences are exploited to develop combinationsof linear and quadratic classification criteria. Such functions can then be usedto classify events of unknown origin, such as the Novaya Zemlya event shownin Figure 7.2, which tends to bear a visual resemblance to the explosion group.

Page 421: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

408 7 Statistical Methods in the Frequency Domain

−0.

20.

00.

2

EQ

5

−1.

00.

00.

5

EQ

6

−3

−1

13

EX

5

−4

−2

02

4

EX

6

−3

−1

12

3

0 200 400 600 800 1000

NZ

Time

−0.

40.

00.

4

EQ

5

−4

−2

02

4

EQ

6

−3

−1

12

3

EX

5

−6

−2

26

EX

6

−4

−2

02

0 200 400 600 800 1000

NZ

Time

P waves S waves

Fig. 7.2. Various bivariate earthquakes (EQ) and explosions (EX) recorded at 40pts/sec compared with an event NZ (Novaya Zemlya) of unknown origin. Compres-sional waves, also known as primary or P waves, travel fastest in the Earth’s crustand are first to arrive. Shear waves propagate more slowly through the Earth andarrive second, hence they are called secondary or S waves.

The R code used to produce Figure 7.2 is:

1 attach(eqexp)

2 P = 1:1024; S = P+1024

3 x = cbind(EQ5[P], EQ6[P], EX5[P], EX6[P], NZ[P], EQ5[S], EQ6[S],

EX5[S], EX6[S], NZ[S])

4 x.name = c("EQ5","EQ6","EX5","EX6","NZ")

5 colnames(x) = c(x.name, x.name)

6 plot.ts(x, main="")

7 mtext("P waves", side=3, line=1.2, adj=.05, cex=1.2)

8 mtext("S waves", side=3, line=1.2, adj=.85, cex=1.2)

Finally, for multivariate processes, the structure of the spectral matrixis also of great interest. We might reduce the dimension of the underlyingprocess to a smaller set of input processes that explain most of the variability

Page 422: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

7.2 Spectral Matrices and Likelihood Functions 409

in the cross-spectral matrix as a function of frequency. Principal componentanalysis can be used to decompose the spectral matrix into a smaller subsetof component factors that explain decreasing amounts of power. For example,the hydrological data might be explained in terms of a component processthat weights heavily on precipitation and inflow and one that weights heavilyon temperature and cloud cover. Perhaps these two components could explainmost of the power in the spectral matrix at a given frequency. The ideasbehind principal component analysis can also be generalized to include anoptimal scaling methodology for categorical data called the spectral envelope(see Stoffer et al., 1993). In succeeding sections, we also give an introductionto dynamic Fourier analysis and to wavelet analysis.

7.2 Spectral Matrices and Likelihood Functions

We have previously argued for an approximation to the log likelihood basedon the joint distribution of the DFTs in (4.78), where we used approximationas an aid in estimating parameters for certain parameterized spectra. In thischapter, we make heavy use of the fact that the sine and cosine transformsof the p× 1 vector process xxxt = (xt1, xt2, . . . , xtp)

′ with mean Exxxt = µµµt, say,with DFT1

XXX(ωk) = n−1/2n∑t=1

xxxt e−2πiωkt = XXXc(ωk)− iXXXs(ωk) (7.1)

and mean

MMM(ωk) = n−1/2n∑t=1

µµµt e−2πiωkt = MMM c(ωk)− iMMMs(ωk) (7.2)

will be approximately uncorrelated, where we evaluate at the usual Fourierfrequencies {ωk = k/n, 0 < |ωk| < 1/2}. By Theorem C.6, the approximate2p × 2p covariance matrix of the cosine and sine transforms, say, XXX(ωk) =(XXXc(ωk)′,XXXs(ωk)′)′, is

Σ(ωk) = 12

(C(ωk) −Q(ωk)Q(ωk) C(ωk)

), (7.3)

and the real and imaginary parts are jointly normal. This result implies, bythe results stated in Appendix C, the density function of the vector DFT, say,XXX(ωk), can be approximated as

1 In previous chapters, the DFT of a process xt was denoted by dx(ωk). In thischapter, we will consider the Fourier transforms of many different processes andso, to avoid the overuse of subscripts and to ease the notation, we use a capitalletter, e.g., X(ωk), to denote the DFT of xt. This notation is standard in thedigital signal processing (DSP) literature.

Page 423: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

410 7 Statistical Methods in the Frequency Domain

p(ωk) ≈ |f(ωk)|−1 exp{−(XXX(ωk)−MMM(ωk)

)∗f−1(ωk)

(XXX(ωk)−MMM(ωk)

)},

where the spectral matrix is the usual

f(ωk) = C(ωk)− iQ(ωk). (7.4)

Certain computations that we do in the section on discriminant analysis willinvolve approximating the joint likelihood by the product of densities like theone given above over subsets of the frequency band 0 < ωk < 1/2.

To use the likelihood function for estimating the spectral matrix, for ex-ample, we appeal to the limiting result implied by Theorem C.7 and againchoose L frequencies in the neighborhood of some target frequency ω, say,XXX(ωk±k/n), for k = 1, . . . ,m and L = 2m+ 1. Then, let XXX`, for ` = 1, . . . , Ldenote the indexed values, and note the DFTs of the mean adjusted vector pro-cess are approximately jointly normal with mean zero and complex covariancematrix f = f(ω). Then, write the log likelihood over the L sub-frequencies as

lnL(XXX1, . . . ,XXXL; f) ≈ −L ln |f | −L∑`=1

(XXX` −MMM `)∗f−1(XXX` −MMM `), (7.5)

where we have suppressed the argument of f = f(ω) for ease of notation.The use of spectral approximations to the likelihood has been fairly standard,beginning with the work of Whittle (1961) and continuing in Brillinger (1981)and Hannan (1970). Assuming the mean adjusted series are available, i.e., MMM `

is known, we obtain the maximum likelihood estimator for f , namely,

f = L−1L∑`=1

(XXX` −MMM `)(XXX` −MMM `)∗; (7.6)

see Problem 7.2.

7.3 Regression for Jointly Stationary Series

In §4.8, we considered a model of the form

yt =∞∑

r=−∞β1rxt−r,1 + vt, (7.7)

where xt1 is a single observed input series and yt is the observed outputseries, and we are interested in estimating the filter coefficients β1r relatingthe adjacent lagged values of xt1 to the output series yt. In the case of theSOI and Recruitment series, we identified the El Nino driving series as xt1,the input and yt, the Recruitment series, as the output. In general, more

Page 424: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

7.3 Regression for Jointly Stationary Series 411

510

1520

2530

Tem

p

05

1015

Dew

Pt

0.0

0.2

0.4

0.6

0.8

0 100 200 300 400

Cld

Cvr

Time

0.4

0.6

0.8

1.0

1.2

1.4

Wnd

Spd

020

040

060

080

0

Pre

cip

200

600

1000

0 100 200 300 400

Inflo

w

Time

climhyd

Fig. 7.3. Monthly values of weather and inflow at Lake Shasta.

than a single plausible input series may exist. For example, the Lake Shastainflow hydrological data (climhyd) shown in Figure 7.3 suggests there maybe at least five possible series driving the inflow; see Example 7.1 for moredetails. Hence, we may envision a q × 1 input vector of driving series, say,xxxt = (xt1, xt2, . . . , xtq)

′, and a set of q × 1 vector of regression functionsβββr = (β1r, β2r,, . . . , βqr)

′, which are related as

yt =∞∑

r=−∞βββ′rxxxt−r + vt =

q∑j=1

∞∑r=−∞

βjrxt−r,j + vt, (7.8)

which shows that the output is a sum of linearly filtered versions of the inputprocesses and a stationary noise process vt, assumed to be uncorrelated withxxxt. Each filtered component in the sum over j gives the contribution of laggedvalues of the j-th input series to the output series. We assume the regressionfunctions βjr are fixed and unknown.

Page 425: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

412 7 Statistical Methods in the Frequency Domain

The model given by (7.8) is useful under several different scenarios, corre-sponding to a number of different assumptions that can be made about thecomponents. Assuming the input and output processes are jointly stationarywith zero means leads to the conventional regression analysis given in thissection. The analysis depends on theory that assumes we observe the outputprocess yt conditional on fixed values of the input vector xxxt; this is the same asthe assumptions made in conventional regression analysis. Assumptions con-sidered later involve letting the coefficient vector βββt be a random unknownsignal vector that can be estimated by Bayesian arguments, using the condi-tional expectation given the data. The answers to this approach, given in §7.5,allow signal extraction and deconvolution problems to be handled. Assumingthe inputs are fixed allows various experimental designs and analysis of vari-ance to be done for both fixed and random effects models. Estimation of thefrequency-dependent random effects variance components in the analysis ofvariance model is also considered in §7.5.

For the approach in this section, assume the inputs and outputs have zeromeans and are jointly stationary with the (q + 1) × 1 vector process (xxx′t, yt)

of inputs xxxt and outputs yt assumed to have a spectral matrix of the form

f(ω) =

(fxx(ω) fxy(ω)fyx(ω) fyy(ω)

), (7.9)

where fyx(ω) = (fyx1(ω), fyx2

(ω), . . . , fyxq (ω)) is the 1 × q vector of cross-spectra relating the q inputs to the output and fxx(ω) is the q × q spectralmatrix of the inputs. Generally, we observe the inputs and search for the vectorof regression functions βββt relating the inputs to the outputs. We assume allautocovariance functions satisfy the absolute summability conditions of theform

∞∑h=−∞

|h||γjk(h)| <∞. (7.10)

(j, k = 1, . . . , q + 1), where γjk(h) is the autocovariance corresponding to thecross-spectrum fjk(ω) in (7.9). We also need to assume a linear process of theform (C.35) as a condition for using Theorem C.7 on the joint distribution ofthe discrete Fourier transforms in the neighborhood of some fixed frequency.

Estimation of the Regression Function

In order to estimate the regression function βββr, the Projection Theorem(Appendix B) applied to minimizing

MSE = E[(yt −

∞∑r=−∞

βββ′rxxxt−r)2]

(7.11)

leads to the orthogonality conditions

Page 426: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

7.3 Regression for Jointly Stationary Series 413

E[(yt −

∞∑r=−∞

βββ′rxxxt−r)xxx′t−s

]= 000′ (7.12)

for all s = 0,±1,±2, . . ., where 000′ denotes the 1 × q zero vector. Taking theexpectations inside and substituting for the definitions of the autocovariancefunctions appearing and leads to the normal equations

∞∑r=−∞

βββ′r Γxx(s− r) = γγγ′yx(s), (7.13)

for s = 0,±1,±2, . . ., where Γxx(s) denotes the q × q autocovariance matrixof the vector series xxxt at lag s and γγγyx(s) = (γyx1(s), . . . , γyxq (s)) is a 1 × qvector containing the lagged covariances between yt and xxxt. Again, a frequencydomain approximate solution is easier in this case because the computationscan be done frequency by frequency using cross-spectra that can be estimatedfrom sample data using the DFT. In order to develop the frequency domainsolution, substitute the representation into the normal equations, using thesame approach as used in the simple case derived in §4.8. This approach yields∫ 1/2

−1/2

∞∑r=−∞

βββ′r e2πiω(s−r) fxx(ω) dω = γγγ′yx(s).

Now, because γγγ′yx(s) is the Fourier transform of the cross-spectral vectorfyx(ω) = f∗xy(ω), we might write the system of equations in the frequencydomain, using the uniqueness of the Fourier transform, as

BBB′(ω)fxx(ω) = f∗xy(ω), (7.14)

where fxx(ω) is the q × q spectral matrix of the inputs and BBB(ω) is the q × 1vector Fourier transform of βββt. Multiplying (7.14) on the right by f−1xx (ω),assuming fxx(ω) is nonsingular at ω, leads to the frequency domain estimator

BBB′(ω) = f∗xy(ω)f−1xx (ω). (7.15)

Note, (7.15) implies the regression function would take the form

βββt =

∫ 1/2

−1/2BBB(ω) e2πiωt dω. (7.16)

As before, it is conventional to introduce the DFT as the approximate esti-mator for the integral (7.16) and write

βββMt = M−1M−1∑k=0

BBB(ωk) e2πiωkt, (7.17)

where ωk = k/M,M << n. The approximation was shown in Problem 4.35 tohold exactly as long as βββt = 000 for |t| ≥M/2 and to have a mean-squared-error

Page 427: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

414 7 Statistical Methods in the Frequency Domain

bounded by a function of the zero-lag autocovariance and the absolute sumof the neglected coefficients.

The mean-squared-error (7.11) can be written using the orthogonality prin-ciple, giving

MSE =

∫ 1/2

−1/2fy·x(ω) dω, (7.18)

wherefy·x(ω) = fyy(ω)− f∗xy(ω)f−1xx (ω)fxy(ω) (7.19)

denotes the residual or error spectrum. The resemblance of (7.19) to the usualequations in regression analysis is striking. It is useful to pursue the multipleregression analogy further by noting a squared multiple coherence can bedefined as

ρ2y·x(ω) =f∗xy(ω)f−1xx (ω)fxy(ω)

fyy(ω). (7.20)

This expression leads to the mean squared error in the form

MSE =

∫ 1/2

−1/2fyy(ω)[1− ρ2y·x(ω)] dω, (7.21)

and we have an interpretation of ρ2y·x(ω) as the proportion of power accountedfor by the lagged regression on xxxt at frequency ω. If ρ2y·x(ω) = 0 for all ω, wehave

MSE =

∫ 1/2

−1/2fyy(ω) dω = E[y2t ],

which is the mean squared error when no predictive power exists. As long asfxx(ω) is positive definite at all frequencies, MSE ≥ 0, and we will have

0 ≤ ρ2y·x(ω) ≤ 1 (7.22)

for all ω. If the multiple coherence is unity for all frequencies, the mean squarederror in (7.21) is zero and the output series is perfectly predicted by a linearlyfiltered combination of the inputs. Problem 7.3 shows the ordinary squaredcoherence between the series yt and the linearly filtered combinations of theinputs appearing in (7.11) is exactly (7.20).

Estimation Using Sampled Data

Clearly, the matrices of spectra and cross-spectra will not ordinarily beknown, so the regression computations need to be based on sampled data.We assume, therefore, the inputs xt1, xt2, . . . , xtq and output yt series areavailable at the time points t = 1, 2, . . . , n, as in Chapter 4. In order todevelop reasonable estimates for the spectral quantities, some replication mustbe assumed. Often, only one replication of each of the inputs and the outputwill exist, so it is necessary to assume a band exists over which the spectra

Page 428: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

7.3 Regression for Jointly Stationary Series 415

and cross-spectra are approximately equal to fxx(ω) and fxy(ω), respectively.Then, let Y (ωk + `/n) and XXX(ωk + `/n) be the DFTs of yt and xxxt over theband, say, at frequencies of the form

ωk ± `/n, ` = 1, . . . ,m,

where L = 2m + 1 as before. Then, simply substitute the sample spectralmatrix

fxx(ω) = L−1m∑

`=−m

XXX(ωk + `/n)XXX∗(ωk + `/n) (7.23)

and the vector of sample cross-spectra

fxy(ω) = L−1m∑

`=−m

XXX(ωk + `/n)Y (ωk + `/n) (7.24)

for the respective terms in (7.15) to get the regression estimator BBB(ω). Forthe regression estimator (7.17), we may use

βMt =1

M

M−1∑k=0

f∗xy(ωk)f−1xx (ωk) e2πiωkt (7.25)

for t = 0,±1,±2, . . . ,±(M/2− 1), as the estimated regression function.

Tests of Hypotheses

The estimated squared multiple coherence, corresponding to the theoreti-cal coherence (7.20), becomes

ρ2y·x(ω) =f∗xy(ω)f−1xx (ω)fxy(ω)

fyy(ω). (7.26)

We may obtain a distributional result for the multiple coherence functionanalogous to that obtained in the univariate case by writing the multipleregression model in the frequency domain, as was done in §4.6. We obtain thestatistic

F2q,2(L−q) =(L− q)

q

ρ2y·x(ω)

[1− ρ2y·x(ω)], (7.27)

which has an F -distribution with 2q and 2(L − q) degrees of freedom underthe null hypothesis that ρ2y·x(ω) = 0, or equivalently, that BBB(ω) = 0, in themodel

Y (ωk + `/n) = BBB′(ω)X(ωk + `/n) + V (ωk + `/n), (7.28)

where the spectral density of the error V (ωk + `/n) is fy·x(ω). Problem 7.4sketches a derivation of this result.

Page 429: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

416 7 Statistical Methods in the Frequency Domain

Table 7.1. ANOPOW for the Partitioned Regression Model

Source Power Degrees of Freedom

xt,q1+1, . . . , xt,q1+q2 SSR(ω) (7.34) 2q2Error SSE(ω) (7.35) 2(L− q1 − q2)

Total Lfy·1(ω) 2(L− q1)

A second kind of hypothesis of interest is one that might be used to testwhether a full model with q inputs is significantly better than some submodelwith q1 < q components. In the time domain, this hypothesis implies, for apartition of the vector of inputs into q1 and q2 components (q1 + q2 = q), say,xxxt = (xxx′t1, xxx

′t2)′, and the similarly partitioned vector of regression functions

βββt = (βββ′1t, βββ′2t)′, we would be interested in testing whether βββ2t = 000 in the

partitioned regression model

yt =∞∑

r=−∞βββ′1rxxxt−r,1 +

∞∑r=−∞

βββ′2rxxxt−r,2 + vt. (7.29)

Rewriting the regression model (7.29) in the frequency domain in a form thatis similar to (7.28) establishes that, under the partitions of the spectral matrixinto its qi × qj (i, j = 1, 2) submatrices, say,

fxx(ω) =

(f11(ω) f12(ω)

f21(ω) f22(ω)

), (7.30)

and the cross-spectral vector into its qi × 1 (i = 1, 2) subvectors,

fxy(ω) =

(f1y(ω)

f2y(ω)

), (7.31)

we may test the hypothesis βββ2t = 000 at frequency ω by comparing the estimatedresidual power

fy·x(ω) = fyy(ω)− f∗xy(ω)f−1xx (ω)fxy(ω) (7.32)

under the full model with that under the reduced model, given by

fy·1(ω) = fyy(ω)− f∗1y(ω)f−111 (ω)f1y(ω). (7.33)

The power due to regression can be written as

SSR(ω) = L[fy·1(ω)− fy·x(ω)], (7.34)

with the usual error power given by

Page 430: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

7.3 Regression for Jointly Stationary Series 417

SSE(ω) = Lfy·x(ω). (7.35)

The test of no regression proceeds using the F -statistic

F2q2,2(L−q) =(L− q)q2

SSR(ω)

SSE(ω). (7.36)

The distribution of this F -statistic with 2q2 numerator degrees of freedom and2(L−q) denominator degrees of freedom follows from an argument parallelingthat given in Chapter 4 for the case of a single input. The test results canbe summarized in an Analysis of Power (ANOPOW) table that parallels theusual analysis of variance (ANOVA) table. Table 7.1 shows the componentsof power for testing βββ2t = 000 at a particular frequency ω. The ratio of thetwo components divided by their respective degrees of freedom just yieldsthe F -statistic (7.36) used for testing whether the q2 add significantly to thepredictive power of the regression on the q1 series.

Example 7.1 Predicting Lake Shasta Inflow

We illustrate some of the preceding ideas by considering the problem ofpredicting the transformed (logged) inflow series shown in Figure 7.3 fromsome combination of the inputs. First, look for the best single input pre-dictor using the squared coherence function (7.26). The results, exhibited inFigure 7.4(a)-(e), show transformed (square root) precipitation produces themost consistently high squared coherence values at all frequencies (L = 25),with the seasonal period contributing most significantly. Other inputs, withthe exception of wind speed, also appear to be plausible contributors. Fig-ure 7.4(a)-(e) shows a .001 threshold corresponding to the F -statistic, sep-arately, for each possible predictor of inflow.

Next, we focus on the analysis with two predictor series, temperature andtransformed precipitation. The additional contribution of temperature tothe model seems somewhat marginal because the multiple coherence (7.26),shown in the top panel of Figure 7.4(f) seems only slightly better than theunivariate coherence with precipitation shown in Figure 7.4(e). It is, how-ever, instructive to produce the multiple regression functions, using (7.25)to see if a simple model for inflow exists that would involve some regressioncombination of inputs temperature and precipitation that would be usefulfor predicting inflow to Shasta Lake. The top of Figure 7.5 shows the partialF -statistic, (7.36), for testing if temperature is predictive of inflow whenprecipitation is in the model. In addition, threshold values corresponding toa false discovery rate (FDR) of .001 (see Benjamini & Hochberg, 1995) andthe corresponding null F quantile are displayed in that figure.

Although the contribution of temperature is marginal, it is instructive toproduce the multiple regression functions, using (7.25), to see if a simplemodel for inflow exists that would involve some regression combination ofinputs temperature and precipitation that would be useful for predicting in-flow to Lake Shasta. With this in mind, denoting the possible inputs Pt for

Page 431: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

418 7 Statistical Methods in the Frequency Domain

0.0 0.1 0.2 0.3 0.4 0.5

0.0

0.2

0.4

0.6

0.8

1.0

Inflow withTemp

Frequency

Sq

Coh

eren

ce

(a)

0.0 0.1 0.2 0.3 0.4 0.50.

00.

20.

40.

60.

81.

0

Inflow withDewPt

Frequency

Sq

Coh

eren

ce

(b)

0.0 0.1 0.2 0.3 0.4 0.5

0.0

0.2

0.4

0.6

0.8

1.0

Inflow withCldCvr

Frequency

Sq

Coh

eren

ce

(c)

0.0 0.1 0.2 0.3 0.4 0.5

0.0

0.2

0.4

0.6

0.8

1.0

Inflow withWndSpd

Frequency

Sq

Coh

eren

ce

(d)

0.0 0.1 0.2 0.3 0.4 0.5

0.0

0.2

0.4

0.6

0.8

1.0

Inflow withPrecip

Frequency

Sq

Coh

eren

ce

(e)

0.0 0.1 0.2 0.3 0.4 0.5

0.0

0.2

0.4

0.6

0.8

1.0

Frequency

Sq

Coh

eren

ce

(f)

Inflow withTemp and Precip

Fig. 7.4. Squared coherency between Lake Shasta inflow and (a) temperature; (b)dew point; (c) cloud cover; (d) wind speed; (e) precipitation. The multiple coherencybetween inflow and temperature – precipitation jointly is displayed in (f). In eachcase, the .001 threshold is exhibited as a horizontal line.

transformed precipitation and Tt for transformed temperature, the regres-sion function has been plotted in the lower two panels of Figure 7.5 using avalue of M = 100 for each of the two inputs. In that figure, the time indexruns over both positive and negative values and are centered at time t = 0.Hence, the relation with temperature seems to be instantaneous and posi-tive and an exponentially decaying relation to precipitation exists that hasbeen noticed previously in the analysis in Problem 4.37. The plots suggesta transfer function model of the general form fitted to the Recruitment andSOI series in Example 5.8. We might propose fitting the inflow output, say,It, using the model

It = α0 +δ0

(1− ω1B)Pt + α2Tt + ηt,

Page 432: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

7.3 Regression for Jointly Stationary Series 419

0.0 0.1 0.2 0.3 0.4 0.5

05

1015

2025

Frequency

FPartial F Statistic

−40 −20 0 20 40

−0.0

20.

000.

020.

04Te

mp

Impulse Response Functions

−40 −20 0 20 40

−0.0

10.

010.

030.

05

Index

Pre

cip

Fig. 7.5. Partial F -statistics [top] for testing whether temperature adds to theability to predict Lake Shasta inflow when precipitation is included in the model.The dashed line indicates the .001 FDR level and the solid line represents the cor-responding quantile of the null F distribution. Multiple impulse response functionsfor the regression relations of temperature [middle] and precipitation [bottom].

which is the transfer function model, without the temperature component,considered in that section.

The R code for this example is as follows.1 attach(climhyd)

2 plot.ts(climhyd) # figure 7.3

3 Y = climhyd # Y holds the transformed series

4 Y[,6] = log(Y[,6]) # log inflow

5 Y[,5] = sqrt(Y[,5]) # sqrt precipitation

6 L = 25; M = 100; alpha = .001; fdr = .001

7 nq = 2 # number of inputs (Temp and Precip)

8 # Spectral Matrix

Page 433: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

420 7 Statistical Methods in the Frequency Domain

9 Yspec = mvspec(Y, spans=L, kernel="daniell", detrend=TRUE,

demean=FALSE, taper=.1)

10 n = Yspec$n.used # effective sample size

11 Fr = Yspec$freq # fundamental freqs

12 n.freq = length(Fr) # number of frequencies

13 Yspec$bandwidth*sqrt(12) # = 0.050 - the bandwidth

14 # Coherencies (see sec 4.7 also)

15 Fq = qf(1-alpha, 2, L-2); cn = Fq/(L-1+Fq)

16 plt.name = c("(a)","(b)","(c)","(d)","(e)","(f)")

17 dev.new(); par(mfrow=c(2,3), cex.lab=1.2)

18 # The coherencies are listed as 1,2,...,15=choose(6,2)

19 for (i in 11:15){

20 plot(Fr, Yspec$coh[,i], type="l", ylab="Sq Coherence",

xlab="Frequency", ylim=c(0,1), main=c("Inflow with",

names(climhyd[i-10])))

21 abline(h = cn); text(.45,.98, plt.name[i-10], cex=1.2) }

22 # Multiple Coherency

23 coh.15 = stoch.reg(Y, cols.full = c(1,5), cols.red = NULL, alpha,

L, M, plot.which = "coh")

24 text(.45 ,.98, plt.name[6], cex=1.2)

25 title(main = c("Inflow with", "Temp and Precip"))

26 # Partial F (called eF; avoid use of F alone)

27 numer.df = 2*nq; denom.df = Yspec$df-2*nq

28 dev.new()

29 par(mfrow=c(3,1), mar=c(3,3,2,1)+.5, mgp = c(1.5,0.4,0),

cex.lab=1.2)

30 out.15 = stoch.reg(Y, cols.full = c(1,5), cols.red = 5, alpha, L,

M, plot.which = "F.stat")

31 eF = out.15$eF

32 pvals = pf(eF, numer.df, denom.df, lower.tail = FALSE)

33 pID = FDR(pvals, fdr); abline(h=c(eF[pID]), lty=2)

34 title(main = "Partial F Statistic")

35 # Regression Coefficients

36 S = seq(from = -M/2+1, to = M/2 - 1, length = M-1)

37 plot(S, coh.15$Betahat[,1], type = "h", xlab = "", ylab =

names(climhyd[1]), ylim = c(-.025, .055), lwd=2)

38 abline(h=0); title(main = "Impulse Response Functions")

39 plot(S, coh.15$Betahat[,2], type = "h", xlab = "Index", ylab =

names(climhyd[5]), ylim = c(-.015, .055), lwd=2)

40 abline(h=0)

7.4 Regression with Deterministic Inputs

The previous section considered the case in which the input and output serieswere jointly stationary, but there are many circumstances in which we mightwant to assume that the input functions are fixed and have a known functionalform. This happens in the analysis of data from designed experiments. For

Page 434: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

7.4 Regression with Deterministic Inputs 421

example, we may want to take a collection of earthquakes and explosionssuch as are shown in Figure 7.2 and test whether the mean functions arethe same for either the P or S components or, perhaps, for them jointly. Incertain other signal detection problems using arrays, the inputs are used asdummy variables to express lags corresponding to the arrival times of thesignal at various elements, under a model corresponding to that of a planewave from a fixed source propagating across the array. In Figure 7.1, weplotted the mean responses of the cortex as a function of various underlyingdesign configurations corresponding to various stimuli applied to awake andmildly anesthetized subjects.

It is necessary to introduce a replicated version of the underlying modelto handle even the univariate situation, and we replace (7.8) by

yjt =∞∑

r=−∞βββ′rzzzj,t−r + vjt (7.37)

for j = 1, 2, . . . , N series, where we assume the vector of known deterministicinputs, zzzjt = (zjt1, . . . , zzzjtq)

′, satisfies

∞∑t=−∞

|t||zjtk| <∞

for j = 1, . . . , N replicates of an underlying process involving k = 1, . . . , qregression functions. The model can also be treated under the assumptionthat the deterministic function satisfy Grenander’s conditions, as in Han-nan (1970), but we do not need those conditions here and simply follow theapproach in Shumway (1983, 1988).

It will sometimes be convenient in what follows to represent the model inmatrix notation, writing (7.37) as

yyyt =∞∑

r=−∞zt−r βββr + vvvt, (7.38)

where zt = (zzz1t, . . . , zzzNt)′ are the N × q matrices of independent inputs

and yyyt and vvvt are the N × 1 output and error vectors. The error vectorvvvt = (v1t, . . . , vNt)

′ is assumed to be a multivariate, zero-mean, stationary,normal process with spectral matrix fv(ω)IN that is proportional to the N×Nidentity matrix. That is, we assume the error series vjt are independently andidentically distributed with spectral densities fv(ω).

Example 7.2 An Infrasonic Signal from a Nuclear Explosion

Often, we will observe a common signal, say, βt on an array of sensors, withthe response at the j-th sensor denoted by yjt, j = 1, . . . , N . For example,Figure 7.6 shows an infrasonic or low-frequency acoustic signal from a nu-clear explosion, as observed on a small triangular array of N = 3 acoustic

Page 435: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

422 7 Statistical Methods in the Frequency Domain

sens

or1

−6

−4

−2

02

Infrasonic Signals and Beamse

nsor

2

−4

−2

02

sens

or3

−4

−2

02

beam

0 500 1000 1500 2000

−4

−2

02

Time

Fig. 7.6. Three series for a nuclear explosion detonated 25 km south of ChristmasIsland and the delayed average or beam. The time scale is 10 points per second.

sensors. These signals appear at slightly different times. Because of the waysignals propagate, a plane wave signal of this kind, from a given source, trav-eling at a given velocity, will arrive at elements in the array at predictabletime delays. In the case of the infrasonic signal in Figure 7.6, the delayswere approximated by computing the cross-correlation between elementsand simply reading off the time delay corresponding to the maximum. For adetailed discussion of the statistical analysis of array signals, see Shumwayet al. (1999).

A simple additive signal-plus-noise model of the form

yjt = βt−τj + vjt (7.39)

can be assumed, where τj , j = 1, 2, . . . , N are the time delays that determinethe start point of the signal at each element of the array. The model (7.39)is written in the form (7.37) by letting zjt = δt−τj , where δt = 1 when t = 0

Page 436: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

7.4 Regression with Deterministic Inputs 423

and is zero otherwise. In this case, we are interested in both the problemof detecting the presence of the signal and in estimating its waveform βt.In this case, a plausible estimator of the waveform would be the unbiasedbeam, say,

βt =

∑Nj=1 yj,t+τj

N, (7.40)

where time delays in this case were measured as τ1 = 17, τ2 = 0, andτ3 = −22 from the cross-correlation function. The bottom panel of Fig-ure 7.6 shows the computed beam in this case, and the noise in the individualchannels has been reduced and the essential characteristics of the commonsignal are retained in the average. The R code for this example is

1 attach(beamd)

2 tau = rep(0,3)

3 u = ccf(sensor1, sensor2, plot=FALSE)

4 tau[1] = u$lag[which.max(u$acf)] # 17

5 u = ccf(sensor3, sensor2, plot=FALSE)

6 tau[3] = u$lag[which.max(u$acf)] # -22

7 Y = ts.union(lag(sensor1,tau[1]), lag(sensor2, tau[2]),

lag(sensor3, tau[3]))

8 beam = rowMeans(Y)

9 par(mfrow=c(4,1), mar=c(0,5.1,0,5.1), oma=c(6,0,5,0))

10 plot.ts(sensor1, xaxt="no")

11 title(main="Infrasonic Signals and Beam", outer=TRUE)

12 plot.ts(sensor2, xaxt="no"); plot.ts(sensor3, xaxt="no")

13 plot.ts(beam); title(xlab="Time", outer=TRUE)

The above discussion and example serve to motivate a more detailed lookat the estimation and detection problems in the case in which the input serieszzzjt are fixed and known. We consider the modifications needed for this casein the following sections.

Estimation of the Regression Relation

Because the regression model (7.37) involves fixed functions, we may par-allel the usual approach using the Gauss–Markov theorem to search for linear-filtered estimators of the form

βββt =N∑j=1

∞∑r=−∞

hhhjryj,t−r, (7.41)

where hhhjt = (hjt1 . . . , hjtq)′ is a vector of filter coefficients, determined so the

estimators are unbiased and have minimum variance. The equivalent matrixform is

βββt =∞∑

r=−∞hr yyyt−r, (7.42)

Page 437: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

424 7 Statistical Methods in the Frequency Domain

where ht = (hhh1t, . . . , hhhNt) is a q × N matrix of filter functions. The matrixform resembles the usual classical regression case and is more convenient forextending the Gauss–Markov Theorem to lagged regression. The unbiasedcondition is considered in Problem 7.6. It can be shown (see Shumway andDean, 1968) that hhhjs can be taken as the Fourier transform of

HHHj(ω) = S−1z (ω)ZZZj(ω), (7.43)

where

ZZZj(ω) =∞∑

t=−∞zzzjte

−2πiωt (7.44)

is the infinite Fourier transform of zzzjt. The matrix

Sz(ω) =

N∑j=1

ZZZj(ω)ZZZ ′j(ω) (7.45)

can be written in the form

Sz(ω) = Z∗(ω)Z(ω), (7.46)

where the N × q matrix Z(ω) is defined by Z(ω) = (ZZZ1(ω), . . . ,ZZZN (ω))′. Inmatrix notation, the Fourier transform of the optimal filter becomes

H(ω) = S−1z (ω)Z∗(ω), (7.47)

where H(ω) = (HHH1(ω), . . . ,HHHN (ω)) is the q×N matrix of frequency responsefunctions. The optimal filter then becomes the Fourier transform

ht =

∫ 1/2

−1/2H(ω)e2πiωt dω. (7.48)

If the transform is not tractable to compute, an approximation analogous to(7.25) may be used.

Example 7.3 Estimation of the Infrasonic Signal in Example 7.2

We consider the problem of producing a best linearly filtered unbiased esti-mator for the infrasonic signal in Example 7.2. In this case, q = 1 and (7.44)becomes

Zj(ω) =∞∑

t=−∞δt−τje

−2πiωt = e−2πiωτj

and Sz(ω) = N . Hence, we have

Hj(ω) =1

Ne2πiωτj .

Using (7.48), we obtain hjt = 1N δ(t+ τj). Substituting in (7.41), we obtain

the best linear unbiased estimator as the beam, computed as in (7.40).

Page 438: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

7.4 Regression with Deterministic Inputs 425

Tests of Hypotheses

We consider first testing the hypothesis that the complete vector βββt is zero,i.e., that the vector signal is absent. We develop a test at each frequency ωby taking single adjacent frequencies of the form ωk = k/n, as in the initialsection. We may approximate the DFT of the observed vector in the model(7.37) using a representation of the form

Yj(ωk) = BBB′(ωk)ZZZj(ωk) + Vj(ωk) (7.49)

for j = 1, . . . , N , where the error terms will be uncorrelated with common vari-ance f(ωk), the spectral density of the error term. The independent variablesZZZj(ωk) can either be the infinite Fourier transform, or they can be approx-imated by the DFT. Hence, we can obtain the matrix version of a complexregression model, written in the form

YYY (ωk) = Z(ωk)BBB(ωk) + VVV (ωk), (7.50)

where the N × q matrix Z(ωk) has been defined previously below (7.46) andYYY (ωk) and VVV (ωk) are N × 1 vectors with the error vector VVV (ωk) having meanzero, with covariance matrix f(ωk)IN . The usual regression arguments showthat the maximum likelihood estimator for the regression coefficient will be

BBB(ωk) = S−1z (ωk)ssszy(ωk), (7.51)

where Sz(ωk) is given by (7.46) and

ssszy(ωk) = Z∗(ωk)YYY (ωk) =N∑j=1

ZZZj(ωk)Yj(ωk). (7.52)

Also, the maximum likelihood estimator for the error spectral matrix is pro-portional to

s2y·z(ωk) =N∑j=1

|Yj(ωk)− BBB(ωk)′ZZZj(ωk)|2

= YYY ∗(ωk)YYY (ωk)− YYY ∗(ωk)Z(ωk)[Z∗(ωk)Z(ωk)]−1Z∗(ωk)YYY (ωk)

= s2y(ωk)− sss∗zy(ωk)S−1z (ωk)ssszy(ωk), (7.53)

where

s2y(ωk) =N∑j=1

|Yj(ωk)|2. (7.54)

Under the null hypothesis that the regression coefficient BBB(ωk) = 000, the es-timator for the error power is just s2y(ωk). If smoothing is needed, we mayreplace the (7.53) and (7.54) by smoothed components over the frequencies

Page 439: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

426 7 Statistical Methods in the Frequency Domain

Table 7.2. Analysis of Power (ANOPOW) for Testing No Contribution

from the Independent Series at Frequency ω in the Fixed Input Case

Source Power Degrees of Freedom

Regression SSR(ω)(7.55) 2LqError SSE(ω) (7.56) 2L(N − q)Total SST(ω) 2LN

ωk + `/n, for ` = −m, . . . ,m and L = 2m + 1, close to ω. In that case, weobtain the regression and error spectral components as

SSR(ω) =m∑

`=−m

sss∗zy(ωk + `/n)S−1z (ωk + `/n)ssszy(ωk + `/n) (7.55)

and

SSE(ω) =m∑

`=−m

s2y·z(ωk + `/n). (7.56)

The F -statistic for testing no regression relation is

F2Lq,2L(N−q) =N − qq

SSR(ω)

SSE(ω). (7.57)

The analysis of power pertaining to this situation appears in Table 7.2.In the fixed regression case, the partitioned hypothesis that is the analog

of βββ2t = 0 in (7.27) with xxxt1, xxxt2 replaced by zzzt1, zzzt2. Here, we partition Sz(ω)into qi × qj (i, j = 1, 2) submatrices, say,

Sz(ωk) =

(S11(ωk) S12(ωk)S21(ωk) S22(ωk)

), (7.58)

and the cross-spectral vector into its qi × 1, for i = 1, 2, subvectors

ssszy(ωk) =

(sss1y(ωk)sss2y(ωk)

). (7.59)

Here, we test the hypothesis βββ2t = 000 at frequency ω by comparing the residualpower (7.53) under the full model with the residual power under the reducedmodel, given by

s2y·1(ωk) = s2y(ωk)− sss∗1y(ωk)S−111 (ωk)sss1y(ωk). (7.60)

Again, it is desirable to add over adjacent frequencies with roughly comparablespectra so the regression and error power components can be taken as

SSR(ω) =m∑

`=−m

[s2y·1(ωk + `/n)− s2y·z(ωk + `/n)

](7.61)

Page 440: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

7.4 Regression with Deterministic Inputs 427

Table 7.3. Analysis of Power (ANOPOW) for Testing No Contribution

from the Last q2 Inputs in the Fixed Input Case

Source Power Degrees of Freedom

Regression SSR(ω)(7.61) 2Lq2Error SSE(ω) (7.62) 2L(N − q)Total SST(ω) 2L(N − q1)

and

SSE(ω) =m∑

`=−m

s2y·z(ωk + `/n). (7.62)

The information can again be summarized as in Table 7.3, where the ratio ofmean power regression and error components leads to the F -statistic

F2Lq2,2L(N−q) =(N − q)q2

SSR(ω)

SSE(ω). (7.63)

We illustrate the analysis of power procedure using the infrasonic signal de-tection procedure of Example 7.2.

Example 7.4 Detecting the Infrasonic Signal Using ANOPOW

We consider the problem of detecting the common signal for the three in-frasonic series observing the common signal, as shown in Figure 7.4. Thepresence of the signal is obvious in the waveforms shown, so the test heremainly confirms the statistical significance and isolates the frequencies con-taining the strongest signal components. Each series contained n = 2048points, sampled at 10 points per second. We use the model in (7.39) soZj(ω) = e−2πiωτj and Sz(ω) = N as in Example 7.3, with szy(ωk) given as

szy(ωk) =N∑j=1

e2πiωτjYj(ωk),

using (7.45) and (7.52). The above expression can be interpreted as beingproportional to the weighted mean or beam, computed in frequency, and weintroduce the notation

Bw(ωk) =1

N

N∑j=1

e2πiωτjYj(ωk) (7.64)

for that term. Substituting for the power components in Table 7.3 yields

sss∗zy(ωk)S−1z (ωk)ssszy(ωk) = N |Bw(ωk)|2

and

Page 441: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

428 7 Statistical Methods in the Frequency Domain

0.00 0.02 0.04 0.06 0.08 0.105e−0

25e

+00

Sum of Squares

log

Pow

er

0.00 0.02 0.04 0.06 0.08 0.10

02

46

8

F Statistic

Frequency

F−st

atis

tic

Fig. 7.7. Analysis of power for infrasound array on a log scale (top panel) withSST(ω) shown as a solid line and SSE(ω) as a dashed line. The F -statistics (bottompanel) showing detections with the dashed line based on an FDR level of .001 andthe solid line corresponding null F quantile.

s2y·z(ωk) =N∑j=1

|Yj(ωk)−Bw(ωk)|2 =N∑j=1

|Yj(ωk)|2 −N |Bw(ωk)|2

for the regression signal and error components, respectively. Because onlythree elements in the array and a reasonable number of points in time ex-ist, it seems advisable to employ some smoothing over frequency to obtainadditional degrees of freedom. In this case, L = 9, yielding 2(9) = 18 and2(9)(3 − 1) = 36 degrees of freedom for the numerator and denominator ofthe F -statistic (7.57). The top of Figure 7.7 shows the analysis of powercomponents due to error and the total power. The power is maximum atabout .002 cycles per point or about .02 cycles per second. The F -statisticis compared with the .001 FDR and the corresponding null F significancein the bottom panel and has the strongest detection at about .02 cyclesper second. Little power of consequence appears to exist elsewhere, how-ever, there is some marginally significant signal power near the .5 cycles persecond frequency band.

The R code for this example is as follows.1 attach(beamd)

Page 442: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

7.5 Random Coefficient Regression 429

2 L = 9; fdr = .001; N = 3

3 Y = cbind(beamd, beam=rowMeans(beamd))

4 n = nextn(nrow(Y))

5 Y.fft = mvfft(as.ts(Y))/sqrt(n)

6 Df = Y.fft[,1:3] # fft of the data

7 Bf = Y.fft[,4] # beam fft

8 ssr = N*Re(Bf*Conj(Bf)) # raw signal spectrum

9 sse = Re(rowSums(Df*Conj(Df))) - ssr # raw error spectrum

10 # Smooth

11 SSE = filter(sse, sides=2, filter=rep(1/L,L), circular=TRUE)

12 SSR = filter(ssr, sides=2, filter=rep(1/L,L), circular=TRUE)

13 SST = SSE + SSR

14 par(mfrow=c(2,1), mar=c(4,4,2,1)+.1)

15 Fr = 0:(n-1)/n; nFr = 1:200 # number of freqs to plot

16 plot(Fr[nFr], SST[nFr], type="l", ylab="log Power", xlab="",

main="Sum of Squares", log="y")

17 lines(Fr[nFr], SSE[nFr], type="l", lty=2)

18 eF = (N-1)*SSR/SSE; df1 = 2*L; df2 = 2*L*(N-1)

19 pvals = pf(eF, df1, df2, lower=FALSE) # p values for FDR

20 pID = FDR(pvals, fdr); Fq = qf(1-fdr, df1, df2)

21 plot(Fr[nFr], eF[nFr], type="l", ylab="F-statistic",

xlab="Frequency", main="F Statistic")

22 abline(h=c(Fq, eF[pID]), lty=1:2)

Although there are examples of detecting multiple regression functionsof the general type considered above (see, for example, Shumway, 1983), wedo not consider additional examples of partitioning in the fixed input casehere. The reason is that several examples exist in the section on designedexperiments that illustrate the partitioned approach.

7.5 Random Coefficient Regression

The lagged regression models considered so far have assumed the input pro-cess is either stochastic or fixed and the components of the vector of regres-sion function βββt are fixed and unknown parameters to be estimated. Thereare many cases in time series analysis in which it is more natural to regardthe regression vector as an unknown stochastic signal. For example, we havestudied the state-space model in Chapter 6, where the state equation canbe considered as involving a random parameter vector that is essentially amultivariate autoregressive process. In §4.10, we considered estimating theunivariate regression function βt as a signal extraction problem.

In this section, we consider a random coefficient regression model of (7.38)in the equivalent form

yyyt =

∞∑r=−∞

zt−r βββr + vvvt, (7.65)

Page 443: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

430 7 Statistical Methods in the Frequency Domain

where yyyt = (y1t, . . . , yNt)′ is the N×1 response vector and zt = (zzz1t, . . . , zzzNt)

are the N × q matrices containing the fixed input processes. Here, the com-ponents of the q × 1 regression vector βββt are zero-mean, uncorrelated, sta-tionary series with common spectral matrix fβ(ω)Iq and the error series vvvthave zero-means and spectral matrix fv(ω)IN , where IN is the N×N identitymatrix. Then, defining the N × q matrix Z(ω) = (ZZZ1(ω),ZZZ2(ω), . . . ,ZZZN (ω))′

of Fourier transforms of zt, as in (7.44), it is easy to show the spectral matrixof the response vector yyyt is given by

fy(ω) = fβ(ω)Z(ω)Z∗(ω) + fv(ω)IN . (7.66)

The regression model with a stochastic stationary signal component is a gen-eral version of the simple additive noise model

yt = βt + vt,

considered by Wiener (1949) and Kolmogorov (1941), who derived the min-imum mean squared error estimators for βt, as in §4.10. The more generalmultivariate version (7.65) represents the series as a convolution of the signalvector βββt and a known set of vector input series contained in the matrix zzzt.Restricting the the covariance matrices of signal and noise to diagonal form isconsistent with what is done in statistics using random effects models, whichwe consider here in a later section. The problem of estimating the regressionfunction βββt is often called deconvolution in the engineering and geophysicalliterature.

Estimation of the Regression Relation

The regression function βββt can be estimated by a general filter of the form(7.42), where we write that estimator in matrix form

βββt =∞∑

r=−∞hr yyyt−r, (7.67)

where ht = (hhh1t, . . . , hhhNt), and apply the orthogonality principle, as in §3.9.A generalization of the argument in that section (see Problem 7.7) leads tothe estimator

H(ω) = [Sz(ω) + θ(ω)Iq]−1Z∗(ω) (7.68)

for the Fourier transform of the minimum mean-squared error filter, wherethe parameter

θ(ω) =fv(ω)

fβ(ω)(7.69)

is the inverse of the signal-to-noise ratio. It is clear from the frequency domainversion of the linear model (7.50), the comparable version of the estimator(7.51) can be written as

Page 444: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

7.5 Random Coefficient Regression 431

BBB(ω) = [Sz(ω) + θ(ω)Iq]−1ssszy(ω). (7.70)

This version exhibits the estimator in the stochastic regressor case as the usualestimator, with a ridge correction, θ(ω), that is proportional to the inverse ofthe signal-to-noise ratio.

The mean-squared covariance of the estimator is shown to be

E[(BBB −BBB)(BBB −BBB)∗] = fv(ω)[Sz(ω) + θ(ω)Iq]−1, (7.71)

which again exhibits the close connection between this case and the varianceof the estimator (7.51), which can be shown to be fv(ω)S−1z (ω).

Example 7.5 Estimating the Random Infrasonic Signal

In Example 7.4, we have already determined the components needed in(7.68) and (7.69) to obtain the estimators for the random signal. The Fouriertransform of the optimum filter at series j has the form

Hj(ω) =e2πiωτj

N + θ(ω)(7.72)

with the mean-squared error given by fv(ω)/[N+θ(ω)] from (7.71). The neteffect of applying the filters will be the same as filtering the beam with thefrequency response function

H0(ω) =N

N + θ(ω)=

Nfβ(ω)

fv(ω) +Nfβ(ω), (7.73)

where the last form is more convenient in cases in which portions of thesignal spectrum are essentially zero.

The optimal filters ht have frequency response functions that depend onthe signal spectrum fβ(ω) and noise spectrum fv(ω), so we will need estima-tors for these parameters to apply the optimal filters. Sometimes, there willbe values, suggested from experience, for the signal-to-noise ratio 1/θ(ω) as afunction of frequency. The analogy between the model here and the usual vari-ance components model in statistics, however, suggests we try an approachalong those lines as in the next section.

Detection and Parameter Estimation

The analogy to the usual variance components situation suggests lookingat the regression and error components of Table 7.2 under the stochasticsignal assumptions. We consider the components of (7.55) and (7.56) at asingle frequency ωk. In order to estimate the spectral components fβ(ω) andfv(ω), we reconsider the linear model (7.50) under the assumption that BBB(ωk)is a random process with spectral matrix fβ(ωk)Iq. Then, the spectral matrixof the observed process is (7.66), evaluated at frequency ωk.

Page 445: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

432 7 Statistical Methods in the Frequency Domain

Consider first the component of the regression power, defined as

SSR(ωk) = sss∗zy(ωk)S−1z (ωk)ssszy(ωk)

= YYY ∗(ωk)Z(ωk)S−1z (ωk)Z∗(ωk)YYY (ωk).

A computation shows

E[SSR(ωk)] = fβ(ωk) tr{Sz(ωk)}+ qfv(ωk),

where tr denotes the trace of a matix. If we can find a set of frequencies of theform ωk + `/n, where the spectra and the Fourier transforms Sz(ωk + `/n) ≈Sz(ω) are relatively constant, the expectation of the averaged values in (7.55)yields

E[SSR(ω)] = Lfβ(ω)tr [Sz(ω)] + Lqfv(ω). (7.74)

A similar computation establishes

E[SSE(ω)] = L(N − q)fv(ω). (7.75)

We may obtain an approximately unbiased estimator for the spectra fv(ω)and fβ(ω) by replacing the expected power components by their values andsolving (7.74) and (7.75).

Example 7.6 Estimating the Power Components and the RandomInfrasonic Signal

In order to provide an optimum estimator for the infrasonic signal, we needto have estimators for the signal and noise spectra fβ(ω) and fv(ω) for thecase considered in Example 7.5. The form of the filter is H0(ω), given in(7.73), and with q = 1 and the matrix Sz(ω) = N at all frequencies in thisexample simplifies the computations considerably. We may estimate

fv(ω) =SSE(ω)

L(N − 1)(7.76)

and

fβ(ω) = (LN)−1(SSR(ω)− SSE(ω)

(N − 1)

), (7.77)

using (7.74) and (7.75) for this special case. Cases will exist in which (7.77)is negative and the estimated signal spectrum can be set to zero for thosefrequencies. The estimators can be substituted into the optimal filters toapply to the beam, say, H0(ω) in (7.73), or to use in the filter applied toeach level (7.72).

The analysis of variance estimators can be computed using the analysis ofpower given in Figure 7.7, and the results of that computation and applying(7.76) and (7.77) are shown in Figure 7.8(a) for a bandwidth of B = 9/2048cycles per point or about .04 cycles per second (Hz). Neither spectrum con-tains any significant power for frequencies greater than .1 Hz. As expected,

Page 446: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

7.5 Random Coefficient Regression 433

0.00 0.05 0.10 0.15 0.20

01

23

45

Frequency (Hz)

Pow

er(a)

0.00 0.05 0.10 0.15 0.20

0.0

0.2

0.4

0.6

0.8

Frequency (Hz)

Fre

quen

cy R

espo

nse

(b)

−40 −20 0 20 40

0.00

00.

010

0.02

0

Index

Impu

lse

Res

pons

e

(c)

Time

beam

0 500 1000 1500 2000

−4

−2

01

2 (d)

Fig. 7.8. (a) Estimated signal (solid line) and noise (dashed line) spectra; (b)Estimated frequency and (c) corresponding impulse filter responses; (d) Raw beam(dashed line) with superimposed filtered beam (solid line).

the signal spectral estimator is substantial over a narrow band, and this leadsto an estimated filter, with estimated frequency response function H0(ω),shown in Figure 7.8(b). The estimated optimal filter essentially deletes fre-quencies above .1 Hz and, subject to slight modification, differs little from astandard low-pass filter with that cutoff. Computing the time version witha cutoff at M = 101 points and using a taper leads to the estimated impulseresponse function h0(t), as shown in Figure 7.8(c). Finally, we apply the op-

timal filter to the beam and get the filtered beam βt shown in Figure 7.4(d).The analysis shows the primary signal as basically a low-frequency signalwith primary power at about .05 Hz or, essentially, a wave with a 20-secondperiod.

The R code for this example is as follows.

Page 447: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

434 7 Statistical Methods in the Frequency Domain

1 attach(beamd)

2 L = 9; M = 100; M2 = M/2; N = 3

3 Y = cbind(beamd, beam <- rowMeans(beamd))

4 n = nextn(nrow(Y)); n.freq=n/2

5 Y[,1:3] = Y[,1:3] - Y[,4]

6 Y.fft = mvfft(as.ts(Y))/sqrt(n)

7 Ef = Y.fft[,1:3] # fft of the error

8 Bf = Y.fft[,4] # beam fft

9 ssr = N*Re(Bf*Conj(Bf)) # Raw Signal Spectrum

10 sse = Re(rowSums(Ef*Conj(Ef))) # Raw Error Spectrum

11 # Smooth

12 SSE = filter(sse, sides=2, filter=rep(1/L,L), circular=TRUE)

13 SSR = filter(ssr, sides=2, filter=rep(1/L,L), circular=TRUE)

14 # Estimate Signal and Noise Spectra

15 fv = SSE/(L*(N-1)) # Equation (7.77)

16 fb = (SSR-SSE/(N-1))/(L*N) # Equation (7.78)

17 fb[fb<0] = 0; H0 = N*fb/(fv+N*fb)

18 H0[ceiling(.04*n):n] = 0 # zero out H0 beyond frequency .04

19 # Extend components to make it a valid transform

20 H0 = c(H0[1:n.freq], rev(H0[2:(n.freq+1)]))

21 h0 = Re(fft(H0, inverse = TRUE)) # Impulse Response

22 h0 = c(rev(h0[2:(M2+1)]), h0[1:(M2+1)]) # center it

23 h1 = spec.taper(h0, p = .5); k1 = h1/sum(h1) # taper/normalize it

24 f.beam = filter(Y$beam, filter=k1, sides=2) # filter it

25 # Graphics

26 nFr = 1:50 # number of freqs displayed

27 Fr = (nFr-1)/n # frequencies

28 layout(matrix(c(1, 2, 4, 1, 3, 4), nc=2)); par(mar=c(4,4,2,1)+.1)

29 plot(10*Fr, fb[nFr], type="l", ylab="Power", xlab="Frequency (Hz)")

30 lines(10*Fr, fv[nFr], lty=2); text(.24, 5, "(a)", cex=1.2)

31 plot(10*Fr, H0[nFr], type="l", ylab="Frequency Response",

xlab="Frequency (Hz)")

32 text(.23, .84, "(b)", cex=1.2)

33 plot(-M2:M2, k1, type="l", ylab="Impulse Response", xlab="Index",

lwd=1.5)

34 text(45, .022, "(c)", cex=1.2)

35 ts.plot(cbind(f.beam,beam), lty=1:2, ylab="beam")

36 text(2040, 2, "(d)", cex=1.2)

7.6 Analysis of Designed Experiments

An important special case (see Brillinger, 1973, 1980) of the regression model(7.49) occurs when the regression (7.38) is of the form

yyyt = zβββt + vvvt, (7.78)

where z = (zzz1, zzz2, . . . , zzzN )′ is a matrix that determines what is observed bythe j-th series; i.e.,

Page 448: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

7.6 Analysis of Designed Experiments 435

yjt = zzz′jβββt + vjt. (7.79)

In this case, the the matrix zzz of independent variables is constant and we willhave the frequency domain model.

YYY (ωk) = ZBBB(ωk) + VVV (ωk) (7.80)

corresponding to (7.50), where the matrix Z(ωk) was a function of frequencyωk. The matrix is purely real, in this case, but the equations (7.51)-(7.57) canbe applied with Z(ωk) replaced by the constant matrix Z.

Equality of Means

A typical general problem that we encounter in analyzing real data is asimple equality of means test in which there might be a collection of timeseries yijt, i = 1, . . . , I, j = 1, . . . , Ni, belonging to I possible groups, withNi series in group i. To test equality of means, we may write the regressionmodel in the form

yijt = µt + αit + vijt, (7.81)

where µt denotes the overall mean and αit denotes the effect of the i-th groupat time t and we require that

∑i αit = 0 for all t. In this case, the full model

can be written in the general regression notation as

yijt = zzz′ijβββt + vijt

whereβββt = (µt, α1t, α2t, . . . , αI−1,t)

denotes the regression vector, subject to the constraint. The reduced modelbecomes

yijt = µt + vijt (7.82)

under the assumption that the group means are equal. In the full model, thereare I possible values for the I × 1 design vectors zzzij ; the first component isalways one for the mean, and the rest have a one in the i-th position fori = 1, . . . , I − 1 and zeros elsewhere. The vectors for the last group take thevalue −1 for i = 2, 3, . . . , I − 1. Under the reduced model, each zzzij is a singlecolumn of ones. The rest of the analysis follows the approach summarizedin (7.51)-(7.57). In this particular case, the power components in Table 7.3(before smoothing) simplify to

SSR(ωk) =

I∑i=1

Ni∑j=1

|Yi·(ωk)− Y··(ωk)|2 (7.83)

and

SSE(ωk) =I∑i=1

Ni∑j=1

|Yij(ωk)− Yi·(ωk)|2, (7.84)

Page 449: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

436 7 Statistical Methods in the Frequency Domain

which are analogous to the usual sums of squares in analysis of variance. Notethat a dot (·) stands for a mean, taken over the appropriate subscript, so theregression power component SSR(ωk) is basically the power in the residualsof the group means from the overall mean and the error power componentSSE(ωk) reflects the departures of the group means from the original datavalues. Smoothing each component over L frequencies leads to the usual F -statistic (7.63) with 2L(I − 1) and 2L(

∑iNi − I) degrees of freedom at each

frequency ω of interest.

Example 7.7 Means Test for the fMRI Data

Figure 7.1 showed the mean responses of subjects to various levels of peri-odic stimulation while awake and while under anesthesia, as collected in apain perception experiment of Antognini et al. (1997). Three types of pe-riodic stimuli were presented to awake and anesthetized subjects, namely,brushing, heat, and shock. The periodicity was introduced by applying thestimuli, brushing, heat, and shocks in on-off sequences lasting 32 secondseach and the sampling rate was one point every two seconds. The bloodoxygenation level (BOLD) signal intensity (Ogawa et al., 1990) was mea-sured at nine locations in the brain. Areas of activation were determinedusing a technique first described by Bandettini et al. (1993). The specific lo-cations of the brain where the signal was measured were Cortex 1: PrimarySomatosensory, Contralateral, Cortex 2: Primary Somatosensory, Ipsilateral,Cortex 3: Secondary Somatosensory, Contralateral, Cortex 4: Secondary So-matosensory, Ipsilateral, Caudate, Thalamus 1: Contralateral, Thalamus 2:Ipsilateral, Cerebellum 1: Contralateral and Cerebellum 2: Ipsilateral. Fig-ure 7.1 shows the mean response of subjects at Cortex 1 for each of thesix treatment combinations, 1: Awake-Brush (5 subjects), 2: Awake-Heat (4subjects), 3: Awake-Shock (5 subjects), 4: Low-Brush (3 subjects), 5: Low-Heat (5 subjects), and 6: Low-Shock( 4 subjects). The objective of this firstanalysis is to test equality of these six group means, paying special atten-tion to the 64-second period band (1/64 cycles per second) expected fromthe periodic driving stimuli. Because a test of equality is needed at each ofthe nine brain locations, we took α = .001 to control for the overall errorrate. Figure 7.9 shows F -statistics, computed from (7.63), with L = 3, andwe see substantial signals for the four cortex locations and for the secondcerebellum trace, but the effects are nonsignificant in the caudate and thala-mus regions. Hence, we will retain the four cortex locations and the secondcerebellum location for further analysis.

The R code for this example is as follows.1 n = 128 # length of series

2 n.freq = 1 + n/2 # number of frequencies

3 Fr = (0:(n.freq-1))/n # the frequencies

4 N = c(5,4,5,3,5,4) # number of series for each cell

5 n.subject = sum(N) # number of subjects (26)

6 n.trt = 6 # number of treatments

7 L = 3 # for smoothing

Page 450: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

7.6 Analysis of Designed Experiments 437

8 num.df = 2*L*(n.trt-1) # df for F test

9 den.df = 2*L*(n.subject-n.trt)

10 # Design Matrix (Z):

11 Z1 = outer(rep(1,N[1]), c(1,1,0,0,0,0))

12 Z2 = outer(rep(1,N[2]), c(1,0,1,0,0,0))

13 Z3 = outer(rep(1,N[3]), c(1,0,0,1,0,0))

14 Z4 = outer(rep(1,N[4]), c(1,0,0,0,1,0))

15 Z5 = outer(rep(1,N[5]), c(1,0,0,0,0,1))

16 Z6 = outer(rep(1,N[6]), c(1,-1,-1,-1,-1,-1))

17 Z = rbind(Z1, Z2, Z3, Z4, Z5, Z6)

18 ZZ = t(Z)%*%Z

19 SSEF <- rep(NA, n) -> SSER

20 HatF = Z%*%solve(ZZ, t(Z))

21 HatR = Z[,1]%*%t(Z[,1])/ZZ[1,1]

22 par(mfrow=c(3,3), mar=c(3.5,4,0,0), oma=c(0,0,2,2), mgp =

c(1.6,.6,0))

23 loc.name = c("Cortex 1","Cortex 2","Cortex 3","Cortex

4","Caudate","Thalamus 1","Thalamus 2","Cerebellum

1","Cerebellum 2")

24 for(Loc in 1:9) {

25 i = n.trt*(Loc-1)

26 Y = cbind(fmri[[i+1]], fmri[[i+2]], fmri[[i+3]], fmri[[i+4]],

fmri[[i+5]], fmri[[i+6]])

27 Y = mvfft(spec.taper(Y, p=.5))/sqrt(n)

28 Y = t(Y) # Y is now 26 x 128 FFTs

29 # Calculation of Error Spectra

30 for (k in 1:n) {

31 SSY = Re(Conj(t(Y[,k]))%*%Y[,k])

32 SSReg = Re(Conj(t(Y[,k]))%*%HatF%*%Y[,k])

33 SSEF[k] = SSY - SSReg

34 SSReg = Re(Conj(t(Y[,k]))%*%HatR%*%Y[,k])

35 SSER[k] = SSY - SSReg }

36 # Smooth

37 sSSEF = filter(SSEF, rep(1/L, L), circular = TRUE)

38 sSSER = filter(SSER, rep(1/L, L), circular = TRUE)

39 eF =(den.df/num.df)*(sSSER-sSSEF)/sSSEF

40 plot(Fr, eF[1:n.freq], type="l", xlab="Frequency", ylab="F

Statistic", ylim=c(0,7))

41 abline(h=qf(.999, num.df, den.df),lty=2)

42 text(.25, 6.5, loc.name[Loc], cex=1.2) }

An Analysis of Variance Model

The arrangement of treatments for the fMRI data in Figure 7.1 suggests moreinformation might be available than was obtained from the simple equality ofmeans test. Separate effects caused by state of consciousness as well as theseparate treatments brush, heat, and shock might exist. The reduced signalpresent in the low shock mean suggests a possible interaction between the

Page 451: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

438 7 Statistical Methods in the Frequency Domain

0.0 0.1 0.2 0.3 0.4 0.5

01

23

45

67

Frequency

F S

tatis

tic

Cortex 1

0.0 0.1 0.2 0.3 0.4 0.5

01

23

45

67

Frequency

F S

tatis

tic

Cortex 2

0.0 0.1 0.2 0.3 0.4 0.5

01

23

45

67

Frequency

F S

tatis

tic

Cortex 3

0.0 0.1 0.2 0.3 0.4 0.5

01

23

45

67

Frequency

F S

tatis

tic

Cortex 4

0.0 0.1 0.2 0.3 0.4 0.5

01

23

45

67

Frequency

F S

tatis

tic

Caudate

0.0 0.1 0.2 0.3 0.4 0.5

01

23

45

67

Frequency

F S

tatis

tic

Thalamus 1

0.0 0.1 0.2 0.3 0.4 0.5

01

23

45

67

Frequency

F S

tatis

tic

Thalamus 2

0.0 0.1 0.2 0.3 0.4 0.5

01

23

45

67

Frequency

F S

tatis

tic

Cerebellum 1

0.0 0.1 0.2 0.3 0.4 0.5

01

23

45

67

Frequency

F S

tatis

tic

Cerebellum 2

Fig. 7.9. Frequency-dependent equality of means tests for fMRI data at 9 brainlocations. L = 3 and critical value F.001(30, 120) = 2.26.

treatments and level of consciousness. The arrangement in the classical two-way table suggests looking at the analog of the two factor analysis of varianceas a function of frequency. In this case, we would obtain a different version ofthe regression model (7.81) of the form

yijkt = µt + αit + βjt + γijt + vijkt (7.85)

for the k-th individual undergoing the i-th level of some factor A and the j-thlevel of some other factor B, i = 1, . . . I, j = 1 . . . , J, k = 1, . . . nij . The numberof individuals in each cell can be different, as for the fMRI data in the nextexample. In the above model, we assume the response can be modeled as thesum of a mean, µt, a row effect (type of stimulus), αit, a column effect (levelof consciousness), βjt and an interaction, γijt, with the usual restrictions

Page 452: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

7.6 Analysis of Designed Experiments 439

Table 7.4. Rows of the Design Matrix zzz′j for fMRI Data.

Awake Low Anesthesia

Brush 1 1 0 1 1 0 (5) 1 1 0 −1 −1 0 (3)Heat 1 0 1 1 0 1 (4) 1 0 1 −1 0 −1 (5)Shock 1 −1 −1 1 −1 −1 (5) 1 −1 −1 −1 1 1 (4)

Number of Observations per Cell in Parentheses

∑i

αit =∑j

βjt =∑i

γijt =∑j

γijt = 0

required for a full rank design matrix Z in the overall regression model (7.80).If the number of observations in each cell were the same, the usual simpleanalogous version of the power components (7.83) and (7.84) would exist fortesting various hypotheses. In the case of (7.85), we are interested in testinghypotheses obtained by dropping one set of terms at a time out of (7.85), soan A factor (testing αit = 0), a B factor (βjt = 0), and an interaction term(γijt = 0) will appear as components in the analysis of power. Because of theunequal numbers of observations in each cell, we often put the model in theform of the regression model (7.78)-(7.80).

Example 7.8 Analysis of Power Tests for the fMRI Series

For the fMRI data given as the means in Figure 7.1, a model of the form(7.85) is plausible and will yield more detailed information than the simpleequality of means test described earlier. The results of that test, shown inFigure 7.9, were that the means were different for the four cortex locationsand for the second cerebellum location. We may examine these differencesfurther by testing whether the mean differences are because of the natureof the stimulus or the consciousness level, or perhaps due to an interactionbetween the two factors. Unequal numbers of observations exist in the cellsthat contributed the means in Figure 7.1. For the regression vector,

(µt, α1t, α2t, β1t, γ11t, γ21t)′,

the rows of the design matrix are as specified in Table 7.4. Note the restric-tions given above for the parameters.

The results of testing the three hypotheses are shown in Figure 7.10 forthe four cortex locations and the cerebellum, the components that showedsome significant differences in the means in Figure 7.9. Again, the regressionpower components were smoothed over L = 3 frequencies. Appealing to theANOPOW results summarized in Table 7.3 for each of the subhypotheses,q2 = 1 when the stimulus effect is dropped, and q2 = 2 when either theconsciousness effect or the interaction terms are dropped. Hence, 2Lq2 =6, 12 for the two cases, with N =

∑ij nij = 26 total observations. Here, the

state of consciousness (Awake, Sedated) has the major effect at the signal

Page 453: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

440 7 Statistical Methods in the Frequency Domain

0.00 0.05 0.10 0.15 0.20 0.25

04

812

Frequency

F S

tatis

ticStimulus

Cor

tex

1

0.00 0.05 0.10 0.15 0.20 0.25

04

812

Frequency

F S

tatis

tic

Consciousness

0.00 0.05 0.10 0.15 0.20 0.25

04

812

Frequency

F S

tatis

tic

Interaction

0.00 0.05 0.10 0.15 0.20 0.25

04

812

Frequency

F S

tatis

tic

Cor

tex

2

0.00 0.05 0.10 0.15 0.20 0.250

48

12Frequency

F S

tatis

tic0.00 0.05 0.10 0.15 0.20 0.25

04

812

Frequency

F S

tatis

tic

0.00 0.05 0.10 0.15 0.20 0.25

04

812

Frequency

F S

tatis

tic

Cor

tex

3

0.00 0.05 0.10 0.15 0.20 0.25

04

812

Frequency

F S

tatis

tic

0.00 0.05 0.10 0.15 0.20 0.25

04

812

Frequency

F S

tatis

tic

0.00 0.05 0.10 0.15 0.20 0.25

04

812

Frequency

F S

tatis

tic

Cor

tex

4

0.00 0.05 0.10 0.15 0.20 0.25

04

812

Frequency

F S

tatis

tic

0.00 0.05 0.10 0.15 0.20 0.25

04

812

Frequency

F S

tatis

tic

0.00 0.05 0.10 0.15 0.20 0.25

04

812

Frequency

F S

tatis

tic

Cer

ebel

lum

2

0.00 0.05 0.10 0.15 0.20 0.25

04

812

Frequency

F S

tatis

tic

0.00 0.05 0.10 0.15 0.20 0.25

04

812

Frequency

F S

tatis

tic

Fig. 7.10. Analysis of power for fMRI data at five locations, L = 3 and criticalvalues F.001(6, 120) = 4.04 for stimulus and F.001(12, 120) = 3.02 for consciousnessand interaction.

frequency. The level of stimulus was less significant at the signal frequency.A significant interaction occurred, however, at the ipsilateral component ofthe primary somatosensory cortex location.

The R code for this example is similar to Example 7.7.1 n = 128; n.freq = 1 + n/2

2 Fr = (0:(n.freq-1))/n; nFr = 1:(n.freq/2)

3 N = c(5,4,5,3,5,4); n.subject = sum(N)

4 n.para = 6 # number of parameters

5 L = 3

6 df.stm=2*L*(3-1) # stimulus (3 levels: Brush,Heat,Shock)

7 df.con=2*L*(2-1) # conscious (2 levels: Awake,Sedated)

8 df.int=2*L*(3-1)*(2-1) # interaction

Page 454: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

7.6 Analysis of Designed Experiments 441

9 den.df= 2*L*(n.subject-n.para) # df for full model

10 # Design Matrix: mu a1 a2 b g1 g2

11 Z1 = outer(rep(1,N[1]), c(1, 1, 0, 1, 1, 0))

12 Z2 = outer(rep(1,N[2]), c(1, 0, 1, 1, 0, 1))

13 Z3 = outer(rep(1,N[3]), c(1, -1, -1, 1, -1, -1))

14 Z4 = outer(rep(1,N[4]), c(1, 1, 0, -1, -1, 0))

15 Z5 = outer(rep(1,N[5]), c(1, 0, 1, -1, 0, -1))

16 Z6 = outer(rep(1,N[6]), c(1, -1, -1, -1, 1, 1))

17 Z = rbind(Z1, Z2, Z3, Z4, Z5, Z6); ZZ = t(Z)%*%Z

18 rep(NA, n)-> SSEF-> SSE.stm-> SSE.con-> SSE.int

19 HatF = Z%*%solve(ZZ,t(Z))

20 Hat.stm = Z[,-(2:3)]%*%solve(ZZ[-(2:3),-(2:3)], t(Z[,-(2:3)]))

21 Hat.con = Z[,-4]%*%solve(ZZ[-4,-4], t(Z[,-4]))

22 Hat.int = Z[,-(5:6)]%*%solve(ZZ[-(5:6),-(5:6)], t(Z[,-(5:6)]))

23 par(mfrow=c(5,3), mar=c(3.5,4,0,0), oma=c(0,0,2,2), mgp =

c(1.6,.6,0))

24 loc.name = c("Cortex 1","Cortex 2","Cortex 3","Cortex 4","Caudate",

"Thalamus 1","Thalamus 2","Cerebellum 1","Cerebellum 2")

25 for(Loc in c(1:4,9)) { # only Loc 1 to 4 and 9 used

26 i = 6*(Loc-1)

27 Y = cbind(fmri[[i+1]], fmri[[i+2]], fmri[[i+3]], fmri[[i+4]],

fmri[[i+5]], fmri[[i+6]])

28 Y = mvfft(spec.taper(Y, p=.5))/sqrt(n); Y = t(Y)

29 for (k in 1:n) {

30 SSY=Re(Conj(t(Y[,k]))%*%Y[,k])

31 SSReg= Re(Conj(t(Y[,k]))%*%HatF%*%Y[,k])

32 SSEF[k]=SSY-SSReg

33 SSReg=Re(Conj(t(Y[,k]))%*%Hat.stm%*%Y[,k])

34 SSE.stm[k] = SSY-SSReg

35 SSReg=Re(Conj(t(Y[,k]))%*%Hat.con%*%Y[,k])

36 SSE.con[k]=SSY-SSReg

37 SSReg=Re(Conj(t(Y[,k]))%*%Hat.int%*%Y[,k])

38 SSE.int[k]=SSY-SSReg }

39 # Smooth

40 sSSEF = filter(SSEF, rep(1/L, L), circular = TRUE)

41 sSSE.stm = filter(SSE.stm, rep(1/L, L), circular = TRUE)

42 sSSE.con = filter(SSE.con, rep(1/L, L), circular = TRUE)

43 sSSE.int = filter(SSE.int, rep(1/L, L), circular = TRUE)

44 eF.stm = (den.df/df.stm)*(sSSE.stm-sSSEF)/sSSEF

45 eF.con = (den.df/df.con)*(sSSE.con-sSSEF)/sSSEF

46 eF.int = (den.df/df.int)*(sSSE.int-sSSEF)/sSSEF

47 plot(Fr[nFr],eF.stm[nFr], type="l", xlab="Frequency", ylab="F

Statistic", ylim=c(0,12))

48 abline(h=qf(.999, df.stm, den.df),lty=2)

49 if(Loc==1) mtext("Stimulus", side=3, line=.3, cex=1)

50 mtext(loc.name[Loc], side=2, line=3, cex=.9)

51 plot(Fr[nFr], eF.con[nFr], type="l", xlab="Frequency", ylab="F

Statistic", ylim=c(0,12))

52 abline(h=qf(.999, df.con, den.df),lty=2)

Page 455: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

442 7 Statistical Methods in the Frequency Domain

53 if(Loc==1) mtext("Consciousness", side=3, line=.3, cex=1)

54 plot(Fr[nFr], eF.int[nFr], type="l", xlab="Frequency", ylab="F

Statistic", ylim=c(0,12))

55 abline(h=qf(.999, df.int, den.df),lty=2)

56 if(Loc==1) mtext("Interaction", side=3, line= .3, cex=1) }

Simultaneous Inference

In the previous examples involving the fMRI data, it would be helpful tofocus on the components that contributed most to the rejection of the equalmeans hypothesis. One way to accomplish this is to develop a test for thesignificance of an arbitrary linear compound of the form

Ψ(ωk) = AAA∗(ωk)BBB(ωk), (7.86)

where the components of the vector AAA(ωk) = (A1(ωk), A2(ωk), . . . , Aq(ωk))′

are chosen in such a way as to isolate particular linear functions of parametersin the regression vector BBB(ωk) in the regression model (7.80). This argumentsuggests developing a test of the hypothesis Ψ(ωk) = 0 for all possible valuesof the linear coefficients in the compound (7.86) as is done in the conventionalanalysis of variance approach (see, for example, Scheffe, 1959).

Recalling the material involving the regression models of the form (7.50),the linear compound (7.86) can be estimated by

Ψ(ωk) = AAA∗(ωk)BBB(ωk), (7.87)

where BBB(ωk) is the estimated vector of regression coefficients given by (7.51)and independent of the error spectrum s2y·z(ωk) in (7.53). It is possible toshow the maximum of the ratio

F (AAA) =N − qq

|Ψ(ωk)− Ψ(ωk)|2

s2y·z(ωk)Q(AAA), (7.88)

whereQ(AAA) = AAA∗(ωk)S−1z (ωk)AAA(ωk) (7.89)

is bounded by a statistic that has an F -distribution with 2q and 2(N − q) de-grees of freedom. Testing the hypothesis that the compound has a particularvalue, usually Ψ(ωk) = 0, then proceeds naturally, by comparing the statis-tic (7.88) evaluated at the hypothesized value with the α level point on anF2q,2(N−q) distribution. We can choose an infinite number of compounds ofthe form (7.86) and the test will still be valid at level α. As before, arguingthe error spectrum is relatively constant over a band enables us to smooththe numerator and denominator of (7.88) separately over L frequencies sodistribution involving the smooth components is F2Lq,2L(N−q).

Page 456: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

7.6 Analysis of Designed Experiments 443

Example 7.9 Simultaneous Inference for the fMRI Series

As an example, consider the previous tests for significance of the fMRI fac-tors, in which we have indicated the primary effects are among the stimulibut have not investigated which of the stimuli, heat, brushing, or shock, hadthe most effect. To analyze this further, consider the means model (7.81)and a 6× 1 contrast vector of the form

Ψ = AAA∗(ωk)BBB(ωk) =

6∑i=1

A∗i (ωk)YYY i·(ωk), (7.90)

where the means are easily shown to be the regression coefficients in thisparticular case. In this case, the means are ordered by columns; the firstthree means are the the three levels of stimuli for the awake state, and thelast three means are the levels for the anesthetized state. In this special case,the denominator terms are

Q =6∑i=1

|Ai(ωk)|2

Ni, (7.91)

with SSE(ωk) available in (7.84). In order to evaluate the effect of a par-ticular stimulus, like brushing over the two levels of consciousness, we maytake A1(ωk) = A4(ωk) = 1 for the two brush levels and A(ωk) = 0 zerootherwise. From Figure 7.11, we see that, at the first and third cortex loca-tions, brush and heat are both significant, whereas the fourth cortex showsonly brush and the second cerebellum shows only heat. Shock appears tobe transmitted relatively weakly, when averaged over the awake and mildlyanesthetized states.

The R code for this example is as follows.1 n = 128; n.freq = 1 + n/2

2 Fr = (0:(n.freq-1))/n; nFr = 1:(n.freq/2)

3 N = c(5,4,5,3,5,4); n.subject = sum(N); L = 3

4 # Design Matrix

5 Z1 = outer(rep(1,N[1]), c(1,0,0,0,0,0))

6 Z2 = outer(rep(1,N[2]), c(0,1,0,0,0,0))

7 Z3 = outer(rep(1,N[3]), c(0,0,1,0,0,0))

8 Z4 = outer(rep(1,N[4]), c(0,0,0,1,0,0))

9 Z5 = outer(rep(1,N[5]), c(0,0,0,0,1,0))

10 Z6 = outer(rep(1,N[6]), c(0,0,0,0,0,1))

11 Z = rbind(Z1, Z2, Z3, Z4, Z5, Z6); ZZ = t(Z)%*%Z

12 # Contrasts: 6 by 3

13 A = rbind(diag(1,3), diag(1,3))

14 nq = nrow(A); num.df = 2*L*nq; den.df = 2*L*(n.subject-nq)

15 HatF = Z%*%solve(ZZ, t(Z)) # full model

16 rep(NA, n)-> SSEF -> SSER; eF = matrix(0,n,3)

17 par(mfrow=c(5,3), mar=c(3.5,4,0,0), oma=c(0,0,2,2), mgp =

c(1.6,.6,0))

Page 457: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

444 7 Statistical Methods in the Frequency Domain

18 loc.name = c("Cortex 1", "Cortex 2", "Cortex 3", "Cortex 4",

"Caudate", "Thalamus 1", "Thalamus 2", "Cerebellum 1",

"Cerebellum 2")

19 cond.name = c("Brush", "Heat", "Shock")

20 for(Loc in c(1:4,9)) {

21 i = 6*(Loc-1)

22 Y = cbind(fmri[[i+1]], fmri[[i+2]], fmri[[i+3]], fmri[[i+4]],

fmri[[i+5]], fmri[[i+6]])

23 Y = mvfft(spec.taper(Y, p=.5))/sqrt(n); Y = t(Y)

24 for (cond in 1:3){

25 Q = t(A[,cond])%*%solve(ZZ, A[,cond])

26 HR = A[,cond]%*%solve(ZZ, t(Z))

27 for (k in 1:n){

28 SSY = Re(Conj(t(Y[,k]))%*%Y[,k])

29 SSReg = Re(Conj(t(Y[,k]))%*%HatF%*%Y[,k])

30 SSEF[k] = (SSY-SSReg)*Q

31 SSReg = HR%*%Y[,k]

32 SSER[k] = Re(SSReg*Conj(SSReg)) }

33 # Smooth

34 sSSEF = filter(SSEF, rep(1/L, L), circular = TRUE)

35 sSSER = filter(SSER, rep(1/L, L), circular = TRUE)

36 eF[,cond] = (den.df/num.df)*(sSSER/sSSEF) }

37 plot(Fr[nFr], eF[nFr,1], type="l", xlab="Frequency", ylab="F

Statistic", ylim=c(0,5))

38 abline(h=qf(.999, num.df, den.df),lty=2)

39 if(Loc==1) mtext("Brush", side=3, line=.3, cex=1)

40 mtext(loc.name[Loc], side=2, line=3, cex=.9)

41 plot(Fr[nFr], eF[nFr,2], type="l", xlab="Frequency", ylab="F

Statistic", ylim=c(0,5))

42 abline(h=qf(.999, num.df, den.df),lty=2)

43 if(Loc==1) mtext("Heat", side=3, line=.3, cex=1)

44 plot(Fr[nFr], eF[nFr,3], type="l", xlab="Frequency", ylab="F

Statistic", ylim=c(0,5))

45 abline(h = qf(.999, num.df, den.df) ,lty=2)

46 if(Loc==1) mtext("Shock", side=3, line=.3, cex=1) }

Multivariate Tests

Although it is possible to develop multivariate regression along lines analo-gous to the usual real valued case, we will only look at tests involving equalityof group means and spectral matrices, because these tests appear to be usedmost often in applications. For these results, consider the p-variate time seriesyyyijt = (yijt1, . . . , yijtp)

′ to have arisen from observations on j = 1, . . . , Ni indi-viduals in group i, all having mean µµµit and stationary autocovariance matrixΓi(h). Denote the DFTs of the group mean vectors as YYY i·(ωk) and the p× pspectral matrices as fi(ωk) for the i = 1, 2, . . . , I groups. Assume the samegeneral properties as for the vector series considered in §7.3.

Page 458: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

7.6 Analysis of Designed Experiments 445

0.00 0.05 0.10 0.15 0.20 0.25

01

23

45

Frequency

F S

tatis

ticBrush

Cor

tex

1

0.00 0.05 0.10 0.15 0.20 0.25

01

23

45

Frequency

F S

tatis

tic

Heat

0.00 0.05 0.10 0.15 0.20 0.25

01

23

45

Frequency

F S

tatis

tic

Shock

0.00 0.05 0.10 0.15 0.20 0.25

01

23

45

Frequency

F S

tatis

tic

Cor

tex

2

0.00 0.05 0.10 0.15 0.20 0.250

12

34

5Frequency

F S

tatis

tic0.00 0.05 0.10 0.15 0.20 0.25

01

23

45

Frequency

F S

tatis

tic

0.00 0.05 0.10 0.15 0.20 0.25

01

23

45

Frequency

F S

tatis

tic

Cor

tex

3

0.00 0.05 0.10 0.15 0.20 0.25

01

23

45

Frequency

F S

tatis

tic

0.00 0.05 0.10 0.15 0.20 0.25

01

23

45

Frequency

F S

tatis

tic

0.00 0.05 0.10 0.15 0.20 0.25

01

23

45

Frequency

F S

tatis

tic

Cor

tex

4

0.00 0.05 0.10 0.15 0.20 0.25

01

23

45

Frequency

F S

tatis

tic

0.00 0.05 0.10 0.15 0.20 0.25

01

23

45

Frequency

F S

tatis

tic

0.00 0.05 0.10 0.15 0.20 0.25

01

23

45

Frequency

F S

tatis

tic

Cer

ebel

lum

2

0.00 0.05 0.10 0.15 0.20 0.25

01

23

45

Frequency

F S

tatis

tic

0.00 0.05 0.10 0.15 0.20 0.25

01

23

45

Frequency

F S

tatis

tic

Fig. 7.11. Power in simultaneous linear compounds at five locations, enhancingbrush, heat, and shock effects, L = 3, F.001(36, 120) = 2.16.

In the multivariate case, we obtain the analogous versions of (7.83) and(7.84) as the between cross-power and within cross-power matrices

SPR(ωk) =I∑i=1

Ni∑j=1

(YYY i·(ωk)− YYY ··(ωk)

)(YYY i·(ωk)− YYY ··(ωk)

)∗(7.92)

and

SPE(ωk) =

I∑i=1

Ni∑j=1

(YYY ij(ωk)− YYY i·(ωk)

)(YYY ij(ωk)− YYY i·(ωk)

)∗. (7.93)

The equality of means test is rejected using the fact that the likelihood ratiotest yields a monotone function of

Page 459: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

446 7 Statistical Methods in the Frequency Domain

Λ(ωk) =|SPE(ωk)|

|SPE(ωk) + SPR(ωk)|. (7.94)

Khatri (1965) and Hannan (1970) give the approximate distribution of thestatistic

χ22(I−1)p = −2

(∑Ni − I − p− 1

)logΛ(ωk) (7.95)

as chi-squared with 2(I − 1)p degrees of freedom when the group means areequal.

The case of I = 2 groups reduces to Hotelling’s T 2, as has been shown byGiri (1965), where

T 2 =N1N2

(N1 +N2)

[YYY 1·(ωk)− YYY 2·(ωk)

]∗f−1v (ωk)

[YYY 1·(ωk)− YYY 2·(ωk)

], (7.96)

where

fv(ωk) =SPE(ωk)∑iNi − I

(7.97)

is the pooled error spectrum given in (7.93),with I = 2. The test statistic, inthis case, is

F2p,2(N1+N2−p−1) =(N1 +N2 − 2)p

(N1 +N2 − p− 1)T 2, (7.98)

which was shown by Giri (1965) to have the indicated limiting F -distributionwith 2p and 2(N1 + N2 − p − 1) degrees of freedom when the means are thesame. The classical t-test for inequality of two univariate means will be just(7.97) and (7.98) with p = 1.

Testing equality of the spectral matrices is also of interest, not only fordiscrimination and pattern recognition, as considered in the next section, butalso as a test indicating whether the equality of means test, which assumesequal spectral matrices, is valid. The test evolves from the likelihood rationcriterion, which compares the single group spectral matrices

fi(ωk) =1

Ni − 1

Ni∑j=1

(YYY ij(ωk)− YYY i·(ωk)

)(YYY ij(ωk)− YYY i·(ωk)

)∗(7.99)

with the pooled spectral matrix (7.97). A modification of the likelihood ratiotest, which incorporates the degrees of freedom Mi = Ni − 1 and M =

∑Mi

rather than the sample sizes into the likelihood ratio statistic, uses

L′(ωk) =MMp∏Ii=1M

Mipi

∏|Mifi(ωk)|Mi

|Mfv(ωk)|M. (7.100)

Krishnaiah et al. (1976) have given the moments of L′(ωk) and calculated95% critical points for p = 3, 4 using a Pearson Type I approximation. For

Page 460: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

7.6 Analysis of Designed Experiments 447

reasonably large samples involving smoothed spectral estimators, the approx-imation involving the first term of the usual chi-squared series will suffice andShumway (1982) has given

χ2(I−1)p2 = −2r logL′(ωk), (7.101)

where

1− r =(p+ 1)(p− 1)

6p(I − 1)

(∑i

M−1i −M−1), (7.102)

with an approximate chi-squared distribution with (I − 1)p2 degrees of free-dom when the spectral matrices are equal. Introduction of smoothing over Lfrequencies leads to replacing Mj and M by LMj and LM in the equationsabove.

Of course, it is often of great interest to use the above result for testingequality of two univariate spectra, and it is obvious from the material inChapter 4

F2LM1,2LM2=f1(ω)

f2(ω)(7.103)

will have the requisite F -distribution with 2LM1 and 2LM2 degrees of freedomwhen spectra are smoothed over L frequencies.

Example 7.10 Equality of Means and Spectral Matrices forEarthquakes and Explosions

An interesting problem arises when attempting to develop a methodologyfor discriminating between waveforms originating from explosions and thosethat came from the more commonly occurring earthquakes. Figure 7.2 showsa small subset of a larger population of bivariate series consisting of twophases from each of eight earthquakes and eight explosions. If the large–sample approximations to normality hold for the DFTs of these series, it is ofinterest to known whether the differences between the two classes are betterrepresented by the mean functions or by the spectral matrices. The testsdescribed above can be applied to look at these two questions. The upperleft panel of Figure 7.12 shows the test statistic (7.98) with the straightline denoting the critical level for α = .001, i.e., F.001(4, 26) = 7.36, forequal means using L = 1, and the test statistic remains well below itscritical value at all frequencies, implying that the means of the two classes ofseries are not significantly different. Checking Figure 7.2 shows little reasonexists to suspect that either the earthquakes or explosions have a nonzeromean signal. Checking the equality of the spectra and the spectral matrices,however, leads to a different conclusion. Some smoothing (L = 21) is usefulhere, and univariate tests on both the P and S components using (7.103)and N1 = N2 = 8 lead to strong rejections of the equal spectra hypotheses.The rejection seems stronger for the S component and we might tentativelyidentify that component as being dominant. Testing equality of the spectral

Page 461: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

448 7 Statistical Methods in the Frequency Domain

0 5 10 15 20

02

46

8

Equal Means

Frequency (Hz)

F S

tatis

tic

0 5 10 15 20

0.5

1.0

1.5

2.0

Equal P−Spectra

Frequency (Hz)

F S

tatis

tic

0 5 10 15 20

12

34

5

Equal S−Spectra

Frequency (Hz)

F S

tatis

tic

0 5 10 15 20

4200

4600

5000

Equal Spectral Matrices

Frequency (Hz)

Chi

−Sq

Sta

tistic

Fig. 7.12. Tests for equality of means, spectra, and spectral matrices for the earth-quake and explosion data p = 2, L = 21, n = 1024 points at 40 points per second.

matrices using (7.101) and χ2.001(4) = 18.47 shows a similar strong rejection

of the equality of spectral matrices. We use these results to suggest optimaldiscriminant functions based on spectral differences in the next section.

The R code for this example is as follows. We make use of the recyclingfeature of R and the fact that the data are bivariate to produce simple codespecific to this problem in order to avoid having to use multiple arrays.

1 P=1:1024; S=P+1024; N=8; n=1024; p.dim=2; m=10; L=2*m+1

2 eq.P = as.ts(eqexp[P,1:8]); eq.S = as.ts(eqexp[S,1:8])

3 eq.m = cbind(rowMeans(eq.P), rowMeans(eq.S))

4 ex.P = as.ts(eqexp[P,9:16]); ex.S = as.ts(eqexp[S,9:16])

5 ex.m = cbind(rowMeans(ex.P), rowMeans(ex.S))

6 m.diff = mvfft(eq.m - ex.m)/sqrt(n)

7 eq.Pf = mvfft(eq.P-eq.m[,1])/sqrt(n)

8 eq.Sf = mvfft(eq.S-eq.m[,2])/sqrt(n)

9 ex.Pf = mvfft(ex.P-ex.m[,1])/sqrt(n)

10 ex.Sf = mvfft(ex.S-ex.m[,2])/sqrt(n)

11 fv11=rowSums(eq.Pf*Conj(eq.Pf))+rowSums(ex.Pf*Conj(ex.Pf))/(2*(N-1))

12 fv12=rowSums(eq.Pf*Conj(eq.Sf))+rowSums(ex.Pf*Conj(ex.Sf))/(2*(N-1))

13 fv22=rowSums(eq.Sf*Conj(eq.Sf))+rowSums(ex.Sf*Conj(ex.Sf))/(2*(N-1))

14 fv21 = Conj(fv12)

15 # Equal Means

16 T2 = rep(NA, 512)

Page 462: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

7.6 Analysis of Designed Experiments 449

17 for (k in 1:512){

18 fvk = matrix(c(fv11[k], fv21[k], fv12[k], fv22[k]), 2, 2)

19 dk = as.matrix(m.diff[k,])

20 T2[k] = Re((N/2)*Conj(t(dk))%*%solve(fvk,dk)) }

21 eF = T2*(2*p.dim*(N-1))/(2*N-p.dim-1)

22 par(mfrow=c(2,2), mar=c(3,3,2,1), mgp = c(1.6,.6,0), cex.main=1.1)

23 freq = 40*(0:511)/n # Hz

24 plot(freq, eF, type="l", xlab="Frequency (Hz)", ylab="F Statistic",

main="Equal Means")

25 abline(h=qf(.999, 2*p.dim, 2*(2*N-p.dim-1)))

26 # Equal P

27 kd = kernel("daniell",m);

28 u = Re(rowSums(eq.Pf*Conj(eq.Pf))/(N-1))

29 feq.P = kernapply(u, kd, circular=TRUE)

30 u = Re(rowSums(ex.Pf*Conj(ex.Pf))/(N-1))

31 fex.P = kernapply(u, kd, circular=TRUE)

32 plot(freq, feq.P[1:512]/fex.P[1:512], type="l", xlab="Frequency

(Hz)", ylab="F Statistic", main="Equal P-Spectra")

33 abline(h=qf(.999, 2*L*(N-1), 2*L*(N-1)))

34 # Equal S

35 u = Re(rowSums(eq.Sf*Conj(eq.Sf))/(N-1))

36 feq.S = kernapply(u, kd, circular=TRUE)

37 u = Re(rowSums(ex.Sf*Conj(ex.Sf))/(N-1))

38 fex.S = kernapply(u, kd, circular=TRUE)

39 plot(freq, feq.S[1:512]/fex.S[1:512], type="l", xlab="Frequency

(Hz)", ylab="F Statistic", main="Equal S-Spectra")

40 abline(h=qf(.999, 2*L*(N-1), 2*L*(N-1)))

41 # Equal Spectra

42 u = rowSums(eq.Pf*Conj(eq.Sf))/(N-1)

43 feq.PS = kernapply(u, kd, circular=TRUE)

44 u = rowSums(ex.Pf*Conj(ex.Sf)/(N-1))

45 fex.PS = kernapply(u, kd, circular=TRUE)

46 fv11 = kernapply(fv11, kd, circular=TRUE)

47 fv22 = kernapply(fv22, kd, circular=TRUE)

48 fv12 = kernapply(fv12, kd, circular=TRUE)

49 Mi = L*(N-1); M = 2*Mi

50 TS = rep(NA,512)

51 for (k in 1:512){

52 det.feq.k= Re(feq.P[k]*feq.S[k] - feq.PS[k]*Conj(feq.PS[k]))

53 det.fex.k= Re(fex.P[k]*fex.S[k] - fex.PS[k]*Conj(fex.PS[k]))

54 det.fv.k = Re(fv11[k]*fv22[k] - fv12[k]*Conj(fv12[k]))

55 log.n1 = log(M)*(M*p.dim); log.d1 = log(Mi)*(2*Mi*p.dim)

56 log.n2 = log(Mi)*2 +log(det.feq.k)*Mi + log(det.fex.k)*Mi

57 log.d2 = (log(M)+log(det.fv.k))*M

58 r = 1 - ((p.dim+1)*(p.dim-1)/6*p.dim*(2-1))*(2/Mi - 1/M)

59 TS[k] = -2*r*(log.n1+log.n2-log.d1-log.d2) }

60 plot(freq, TS, type="l", xlab="Frequency (Hz)", ylab="Chi-Sq

Statistic", main="Equal Spectral Matrices")

61 abline(h = qchisq(.9999, p.dim^2))

Page 463: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

450 7 Statistical Methods in the Frequency Domain

7.7 Discrimination and Cluster Analysis

The extension of classical pattern-recognition techniques to experimental timeseries is a problem of great practical interest. A series of observations indexedin time often produces a pattern that may form a basis for discriminatingbetween different classes of events. As an example, consider Figure 7.2, whichshows regional (100-2000 km) recordings of several typical Scandinavian earth-quakes and mining explosions measured by stations in Scandinavia. A listingof the events is given in Kakizawa et al. (1998). The problem of discriminat-ing between mining explosions and earthquakes is a reasonable proxy for theproblem of discriminating between nuclear explosions and earthquakes. Thislatter problem is one of critical importance for monitoring a comprehensivetest-ban treaty. Time series classification problems are not restricted to geo-physical applications, but occur under many and varied circumstances in otherfields. Traditionally, the detection of a signal embedded in a noise series hasbeen analyzed in the engineering literature by statistical pattern recognitiontechniques (see Problems 7.10 and 7.11).

The historical approaches to the problem of discriminating among differ-ent classes of time series can be divided into two distinct categories. The op-timality approach, as found in the engineering and statistics literature, makesspecific Gaussian assumptions about the probability density functions of theseparate groups and then develops solutions that satisfy well-defined minimumerror criteria. Typically, in the time series case, we might assume the differencebetween classes is expressed through differences in the theoretical mean andcovariance functions and use likelihood methods to develop an optimal classi-fication function. A second class of techniques, which might be described as afeature extraction approach, proceeds more heuristically by looking at quanti-ties that tend to be good visual discriminators for well-separated populationsand have some basis in physical theory or intuition. Less attention is paidto finding functions that are approximations to some well-defined optimalitycriterion.

As in the case of regression, both time domain and frequency domain ap-proaches to discrimination will exist. For relatively short univariate series,a time domain approach that follows conventional multivariate discriminantanalysis as described in conventional multivariate texts, such as Anderson(1984) or Johnson and Wichern (1992) may be preferable. We might evencharacterize differences by the autocovariance functions generated by differ-ent ARMA or state-space models. For longer multivariate time series that canbe regarded as stationary after the common mean has been subtracted, thefrequency domain approach will be easier computationally because the np di-mensional vector in the time domain, represented here as xxx = (xxx′1, xxx

′t, . . . , xxx

′n)′,

with xxxt = (xt1, . . . , xtp)′, will reduced to separate computations made on the

p-dimensional DFTs. This happens because of the approximate independenceof the DFTs, XXX(ωk), 0 ≤ ωk ≤ 1, a property that we have often used inpreceding chapters.

Page 464: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

7.7 Discrimination and Cluster Analysis 451

Finally, the grouping properties of measures like the discrimination in-formation and likelihood-based statistics can be used to develop measures ofdisparity for clustering multivariate time series. In this section, we define ameasure of disparity between two multivariate times series by the spectralmatrices of the two processes and then apply hierarchical clustering and par-titioning techniques to identify natural groupings within the bivariate earth-quake and explosion populations.

The General Discrimination Problem

The general problem of classifying a vector time series xxx occurs in thefollowing way. We observe a time series xxx known to belong to one of g popula-tions, denoted by Π1, Π2, . . . ,Πg. The general problem is to assign or classifythis observation into one of the g groups in some optimal fashion. An exam-ple might be the g = 2 populations of earthquakes and explosions shown inFigure 7.2. We would like to classify the unknown event, shown as NZ in thebottom two panels, as belonging to either the earthquake (Π1) or explosion(Π2) populations. To solve this problem, we need an optimality criterion thatleads to a statistic T (xxx) that can be used to assign the NZ event to either theearthquake or explosion populations. To measure the success of the classifi-cation, we need to evaluate errors that can be expected in the future relatingto the number of earthquakes classified as explosions (false alarms) and thenumber of explosions classified as earthquakes (missed signals).

The problem can be formulated by assuming the observed series xxx has aprobability density pi(xxx) when the observed series is from population Πi fori = 1, . . . , g. Then, partition the space spanned by the np-dimensional processxxx into g mutually exclusive regions R1, R2, . . . , Rg such that, if xxx falls in Ri, weassign xxx to population Πi. The misclassification probability is defined as theprobability of classifying the observation into population Πj when it belongsto Πi, for j 6= i and would be given by the expression

P (j|i) =

∫Rj

pi(xxx) dxxx. (7.104)

The overall total error probability depends also on the prior probabilities, say,π1, π2, . . . , πg, of belonging to one of the g groups. For example, the probabilitythat an observation xxx originates from Πi and is then classified into Πj isobviously πiP (j|i), and the total error probability becomes

Pe =

g∑i=1

πi∑j 6=i

P (j|i). (7.105)

Although costs have not been incorporated into (7.105), it is easy to do so bymultiplying P (j|i) by C(j|i), the cost of assigning a series from populationΠi to Πj .

The overall error Pe is minimized by classifying xxx into Πi if

Page 465: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

452 7 Statistical Methods in the Frequency Domain

pi(xxx)

pj(xxx)>πjπi

(7.106)

for all j 6= i (see, for example, Anderson, 1984). A quantity of interest, fromthe Bayesian perspective, is the posterior probability an observation belongsto population Πi, conditional on observing xxx, say,

P (Πi|xxx) =πipi(xxx)∑

j πj(xxx)pj(xxx). (7.107)

The procedure that classifies xxx into the population Πi for which the poste-rior probability is largest is equivalent to that implied by using the criterion(7.106). The posterior probabilities give an intuitive idea of the relative oddsof belonging to each of the plausible populations.

Many situations occur, such as in the classification of earthquakes andexplosions, in which there are only g = 2 populations of interest. For two pop-ulations, the Neyman–Pearson lemma implies, in the absence of prior proba-bilities, classifying an observation into Π1 when

p1(xxx)

p2(xxx)> K (7.108)

minimizes each of the error probabilities for a fixed value of the other. Therule is identical to the Bayes rule (7.106) when K = π2/π1.

The theory given above takes a simple form when the vector xxx has a p-variate normal distribution with mean vectors µµµj and covariance matrices Σjunder Πj for j = 1, 2, . . . , g. In this case, simply use

pj(xxx) = (2π)−p/2|Σj |−1/2 exp

{−1

2(xxx− µµµj)′Σ−1j (xxx− µµµj)

}. (7.109)

The classification functions are conveniently expressed by quantities that areproportional to the logarithms of the densities, say,

gj(xxx) = −1

2ln |Σj | −

1

2xxx′Σ−1j xxx+ µµµ′jΣ

−1j xxx− 1

2µµµ′jΣ

−1j µµµj + lnπj . (7.110)

In expressions involving the log likelihood, we will generally ignore termsinvolving the constant − ln 2π. For this case, we may assign an observation xxxto population Πi whenever

gi(xxx) > gj(xxx) (7.111)

for j 6= i, j = 1, . . . , g and the posterior probability (7.107) has the form

P (Πi|xxx) =exp{gi(xxx)}∑j exp{gj(xxx)}

.

A common situation occurring in applications involves classification forg = 2 groups under the assumption of multivariate normality and equal co-variance matrices; i.e., Σ1 = Σ2 = Σ. Then, the criterion (7.111) can beexpressed in terms of the linear discriminant function

Page 466: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

7.7 Discrimination and Cluster Analysis 453

dl(xxx) = g1(xxx)− g2(xxx)

= (µµµ1 − µµµ2)′Σ−1xxx− 1

2(µµµ1 − µµµ2)′Σ−1(µµµ1 + µµµ2) + ln

π1π2, (7.112)

where we classify into Π1 or Π2 according to whether dl(xxx) ≥ 0 or dl(xxx) < 0.The linear discriminant function is clearly a combination of normal variablesand, for the case π1 = π2 = .5, will have mean D2/2 under Π1 and mean−D2/2 under Π2, with variances given by D2 under both hypotheses, where

D2 = (µµµ1 − µµµ2)′Σ−1(µµµ1 − µµµ2) (7.113)

is the Mahalanobis distance between the mean vectors µµµ1 and µµµ2. In this case,the two misclassification probabilities (7.1) are

P (1|2) = P (2|1) = Φ

(−D

2

), (7.114)

and the performance is directly related to the Mahalanobis distance (7.113).For the case in which the covariance matrices cannot be assumed to be the

the same, the discriminant function takes a different form, with the differenceg1(xxx)− g2(xxx) taking the form

dq(xxx) = −1

2ln|Σ1||Σ2|

− 1

2xxx′(Σ−11 −Σ−12 )xxx

+(µµµ′1Σ−11 − µµµ′2Σ−12 )xxx+ ln

π1π2

(7.115)

for g = 2 groups. This discriminant function differs from the equal covari-ance case in the linear term and in a nonlinear quadratic term involving thediffering covariance matrices. The distribution theory is not tractable for thequadratic case so no convenient expression like (7.114) is available for theerror probabilities for the quadratic discriminant function.

A difficulty in applying the above theory to real data is that the groupmean vectors µµµj and covariance matrices Σj are seldom known. Some engi-neering problems, such as the detection of a signal in white noise, assume themeans and covariance parameters are known exactly, and this can lead to anoptimal solution (see Problems 7.14 and 7.15). In the classical multivariatesituation, it is possible to collect a sample of Ni training vectors from groupΠi, say, xxxij , for j = 1, . . . , Ni, and use them to estimate the mean vectors andcovariance matrices for each of the groups i = 1, 2, . . . , g; i.e., simply choosexxxi· and

Si = (Ni − 1)−1Ni∑j=1

(xxxij − xxxi·)(xxxij − xxxi·)′ (7.116)

as the estimators for µµµi and Σi, respectively. In the case in which the covari-ance matrices are assumed to be equal, simply use the pooled estimator

Page 467: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

454 7 Statistical Methods in the Frequency Domain

S =

(∑i

Ni − g)−1∑

i

(Ni − 1)Si. (7.117)

For the case of a linear discriminant function, we may use

gi(xxx) = xxx′i·S−1xxx− 1

2xxx′i·S

−1xxxi· + log πi (7.118)

as a simple estimator for gi(xxx). For large samples, xxxi· and S converge to µµµiand Σ in probability so gi(xxx) converges in distribution to gi(xxx) in that case.The procedure works reasonably well for the case in which Ni, i = 1, . . . g arelarge, relative to the length of the series n, a case that is relatively rare in timeseries analysis. For this reason, we will resort to using spectral approximationsfor the case in which data are given as long time series.

The performance of sample discriminant functions can be evaluated inseveral different ways. If the population parameters are known, (7.113) and(7.114) can be evaluated directly. If the parameters are estimated, the esti-

mated Mahalanobis distance D2 can be substituted for the theoretical value invery large samples. Another approach is to calculate the apparent error ratesusing the result of applying the classification procedure to the training sam-ples. If nij denotes the number of observations from population Πj classifiedinto Πi, the sample error rates can be estimated by the ratio

P (i|j) =nij∑i nij

(7.119)

for i 6= j. If the training samples are not large, this procedure may be bi-ased and a resampling option like cross-validation or the bootstrap can beemployed. A simple version of cross-validation is the jackknife procedure pro-posed by Lachenbruch and Mickey (1968), which holds out the observation tobe classified, deriving the classification function from the remaining observa-tions. Repeating this procedure for each of the members of the training sampleand computing (7.119) for the holdout samples leads to better estimators ofthe error rates.

Example 7.11 Discriminant Analysis Using Amplitudes fromEarthquakes and Explosions

We can give a simple example of applying the above procedures to thelogarithms of the amplitudes of the separate P and S components of theoriginal earthquake and explosion traces. The logarithms (base 10) of themaximum peak-to-peak amplitudes of the P and S components, denoted bylog10 P and log10 S, can be considered as two-dimensional feature vectors,say, xxx = (x1, x2)′ = (log10 P, log10 S)′, from a bivariate normal populationwith differering means and covariances. The original data, from Kakizawa etal. (1998), are shown in Figure 7.13. The figure includes the Novaya Zemlya(NZ) event of unknown origin. The tendency of the earthquakes to have

Page 468: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

7.7 Discrimination and Cluster Analysis 455

0.0 0.5 1.0 1.5

0.8

0.9

1.0

1.1

1.2

Classification Based on Magnitude Features

log mag(P)

log

mag

(S)

EQ1

EQ2

EQ3

EQ4

EQ5

EQ6

EQ7

EQ8

EX1

EX2

EX3

EX4

EX5

EX6

EX7

EX8

NZ

EQEXNZ

Fig. 7.13. Classification of earthquakes and explosions based on linear discriminantanalysis using the magnitude features.

higher values for log10 S, relative to log10 P has been noted by many andthe use of the logarithm of the ratio, i.e., log10 P − log10 S in some references(see Lay, 1997, pp. 40-41) is a tacit indicator that a linear function of thetwo parameters will be a useful discriminant.

The sample means xxx1· = (.346, 1.024)′ and xxx2· = (.922, .993)′, and covari-ance matrices

S1 =

(.026 −.007−.007 .010

)and S2 =

(.025 −.001−.001 .010

)are immediate from (7.116), with the pooled covariance matrix given by

S =

(.026 −.004−.004 .010

)from (7.117). Although the covariance matrices are not equal, we try thelinear discriminant function anyway, which yields (with equal prior proba-

Page 469: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

456 7 Statistical Methods in the Frequency Domain

bilities π1 = π2 = .5) the sample discriminant functions

g1(xxx) = 30.668x1 + 111.411x2 − 62.401

andg2(xxx) = 54.048x1 + 117.255x2 − 83.142

from (7.118), with the estimated linear discriminant function (7.112) as

dl(xxx) = −23.380x1 − 5.843x2 + 20.740.

The jackknifed posterior probabilities of being an earthquake for the earth-quake group ranged from .621 to 1.000, whereas the explosion probabilitiesfor the explosion group ranged from .717 to 1.000. The unknown event, NZ,was classified as an explosion, with posterior probability .960.

The R code for this example is as follows.1 P = 1:1024; S = P+1024

2 mag.P = log10(apply(eqexp[P,], 2, max) - apply(eqexp[P,], 2, min))

3 mag.S = log10(apply(eqexp[S,], 2, max) - apply(eqexp[S,], 2, min))

4 eq.P = mag.P[1:8]; eq.S = mag.S[1:8]

5 ex.P = mag.P[9:16]; ex.S = mag.S[9:16]

6 NZ.P = mag.P[17]; NZ.S = mag.S[17]

7 # Compute linear discriminant function

8 cov.eq = var(cbind(eq.P, eq.S))

9 cov.ex = var(cbind(ex.P, ex.S))

10 cov.pooled = (cov.ex + cov.eq)/2

11 means.eq = colMeans(cbind(eq.P, eq.S))

12 means.ex = colMeans(cbind(ex.P, ex.S))

13 slopes.eq = solve(cov.pooled, means.eq)

14 inter.eq = -sum(slopes.eq*means.eq)/2

15 slopes.ex = solve(cov.pooled, means.ex)

16 inter.ex = -sum(slopes.ex*means.ex)/2

17 d.slopes = slopes.eq - slopes.ex

18 d.inter = inter.eq - inter.ex

19 # Classify new observation

20 new.data = cbind(NZ.P, NZ.S)

21 d = sum(d.slopes*new.data) + d.inter

22 post.eq = exp(d)/(1+exp(d))

23 # Print (disc function, posteriors) and plot results

24 cat(d.slopes[1], "mag.P +" , d.slopes[2], "mag.S +" , d.inter,"\n")

25 cat("P(EQ|data) =", post.eq, " P(EX|data) =", 1-post.eq, "\n" )

26 plot(eq.P, eq.S, xlim=c(0,1.5), ylim=c(.75,1.25), xlab="log

mag(P)", ylab ="log mag(S)", pch = 8, cex=1.1, lwd=2,

main="Classification Based on Magnitude Features")

27 points(ex.P, ex.S, pch = 6, cex=1.1, lwd=2)

28 points(new.data, pch = 3, cex=1.1, lwd=2)

29 abline(a = -d.inter/d.slopes[2], b = -d.slopes[1]/d.slopes[2])

30 text(eq.P-.07,eq.S+.005, label=names(eqexp[1:8]), cex=.8)

31 text(ex.P+.07,ex.S+.003, label=names(eqexp[9:16]), cex=.8)

Page 470: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

7.7 Discrimination and Cluster Analysis 457

32 text(NZ.P+.05,NZ.S+.003, label=names(eqexp[17]), cex=.8)

33 legend("topright",c("EQ","EX","NZ"),pch=c(8,6,3),pt.lwd=2,cex=1.1)

34 # Cross-validation

35 all.data = rbind(cbind(eq.P, eq.S), cbind(ex.P, ex.S))

36 post.eq <- rep(NA, 8) -> post.ex

37 for(j in 1:16) {

38 if (j <= 8){samp.eq = all.data[-c(j, 9:16),]

39 samp.ex = all.data[9:16,]}

40 if (j > 8){samp.eq = all.data[1:8,]

41 samp.ex = all.data[-c(j, 1:8),] }

42 df.eq = nrow(samp.eq)-1; df.ex = nrow(samp.ex)-1

43 mean.eq = colMeans(samp.eq); mean.ex = colMeans(samp.ex)

44 cov.eq = var(samp.eq); cov.ex = var(samp.ex)

45 cov.pooled = (df.eq*cov.eq + df.ex*cov.ex)/(df.eq + df.ex)

46 slopes.eq = solve(cov.pooled, mean.eq)

47 inter.eq = -sum(slopes.eq*mean.eq)/2

48 slopes.ex = solve(cov.pooled, mean.ex)

49 inter.ex = -sum(slopes.ex*mean.ex)/2

50 d.slopes = slopes.eq - slopes.ex

51 d.inter = inter.eq - inter.ex

52 d = sum(d.slopes*all.data[j,]) + d.inter

53 if (j <= 8) post.eq[j] = exp(d)/(1+exp(d))

54 if (j > 8) post.ex[j-8] = 1/(1+exp(d)) }

55 Posterior = cbind(1:8, post.eq, 1:8, post.ex)

56 colnames(Posterior) = c("EQ","P(EQ|data)","EX","P(EX|data)")

57 round(Posterior,3) # Results from Cross-validation (not shown)

Frequency Domain Discrimination

The feature extraction approach often works well for discriminating be-tween classes of univariate or multivariate series when there is a simple low-dimensional vector that seems to capture the essence of the differences betweenthe classes. It still seems sensible, however, to develop optimal methods forclassification that exploit the differences between the multivariate means andcovariance matrices in the time series case. Such methods can be based onthe Whittle approximation to the log likelihood given in §7.2. In this case,the vector DFTs, say, XXX(ωk), are assumed to be approximately normal, withmeans MMM j(ωk) and spectral matrices fj(ωk) for population Πj at frequenciesωk = k/n, for k = 0, 1, . . . [n/2], and are approximately uncorrelated at differ-ent frequencies, say, ωk and ω` for k 6= `. Then, writing the complex normaldensities as in §7.2 leads to a criterion similar to (7.110); namely,

gj(XXX) = lnπj −∑

0<ωk<1/2

[ln |fj(ωk)|+XXX∗(ωk)f−1j (ωk)XXX(ωk)

− 2MMM∗j (ωk)f−1j (ωk)XXX(ωk) +MMM∗j (k)f−1j (ωk)MMM j(ωk)

],

(7.120)

Page 471: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

458 7 Statistical Methods in the Frequency Domain

where the sum goes over frequencies for which |fj(ωk)| 6= 0. The periodicityof the spectral density matrix and DFT allows adding over 0 < k < 1/2. Theclassification rule is as in (7.111).

In the time series case, it is more likely the discriminant analysis involvesassuming the covariance matrices are different and the means are equal. Forexample, the tests, shown in Figure 7.12, imply, for the earthquakes and ex-plosions, the primary differences are in the bivariate spectral matrices and themeans are essentially the same. For this case, it will be convenient to writethe Whittle approximation to the log likelihood in the form

ln pj(XXX) =∑

0<ωk<1/2

[− ln |fj(ωk)| −XXX∗(ωk)f−1j (ωk)XXX(ωk)

], (7.121)

where we have omitted the prior probabilities from the equation. The quadraticdetector in this case can be written in the form

ln pj(XXX) =∑

0<ωk<1/2

[− ln |fj(ωk)| − tr

{I(ωk)f−1j (ωk)

}], (7.122)

whereI(ωk) = XXX(ωk)XXX∗(ωk) (7.123)

denotes the periodogram matrix. For equal prior probabilities, we may assignan observation xxx into population Πi whenever

ln pi(XXX) > ln pj(XXX) (7.124)

for j 6= i, j = 1, 2, . . . , g.Numerous authors have considered various versions of discriminant anal-

ysis in the frequency domain. Shumway and Unger (1974) considered (7.120)for p = 1 and equal covariance matrices, so the criterion reduces to a simplelinear one. They apply the criterion to discriminating between earthquakesand explosions using teleseismic P wave data in which the means over the twogroups might be considered as fixed. Alagon (1989) and Dargahi-Noubary andLaycock (1981) considered discriminant functions of the form (7.120) in theunivariate case when the means are zero and the spectra for the two groups aredifferent. Taniguchi et al. (1994) adopted (7.121) as a criterion and discussedits non-Gaussian robustness. Shumway (1982) reviews general discriminantfunctions in both the univariate and multivariate time series cases.

Measures of Disparity

Before proceeding to examples of discriminant and cluster analysis, it isuseful to consider the relation to the Kullback–Leibler (K-L) discriminationinformation, as defined in Problem 2.4. Using the spectral approximation andnoting the periodogram matrix has the approximate expectation

Page 472: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

7.7 Discrimination and Cluster Analysis 459

EjI(ωk) = fj(ωk)

under the assumption that the data come from population Πj , and approxi-mating the ratio of the densities by

lnp1(XXX)

p2(XXX)=

∑0<ωk<1/2

[− ln

|f1(ωk)||f2(ωk)|

− tr

{(f−12 (ωk)− f−11 (ωk)

)I(ωk)

}],

we may write the approximate discrimination information as

I(f1; f2) =1

nE1 ln

p1(XXX)

p2(XXX)

=1

n

∑0<ωk<1/2

[tr{f1(ωk)f−12 (ωk)

}− ln

|f1(ωk)||f2(ωk)|

− p]. (7.125)

The approximation may be carefully justified by noting the multivariate nor-mal time series xxx = (xxx′1, xxx

′2 . . . , xxx

′n) with zero means and np × np stationary

covariance matrices Γ1 and Γ2 will have p, n× n blocks, with elements of the

form γ(l)ij (s− t), s, t = 1, . . . , n, i, j = 1, . . . , p for population Π`, ` = 1, 2. The

discrimination information, under these conditions, becomes

I(1; 2 : xxx) =1

nE1 ln

p1(xxx)

p2(xxx)=

1

n

[tr{Γ1Γ

−12

}− ln

|Γ1||Γ2|− np

]. (7.126)

The limiting result

limn→∞

I(1; 2 : xxx) =1

2

∫ 1/2

−1/2

[tr{f1(ω)f−12 (ω)} − ln

|f1(ω)||f2(ω)|

− p]dω

has been shown, in various forms, by Pinsker (1964), Hannan (1970), andKazakos and Papantoni-Kazakos (1980). The discrete version of (7.125) isjust the approximation to the integral of the limiting form. The K-L measureof disparity is not a true distance, but it can be shown that I(1; 2) ≥ 0, withequality if and only if f1(ω) = f2(ω) almost everywhere. This result makes itpotentially suitable as a measure of disparity between the two densities.

A connection exists, of course, between the discrimination informationnumber, which is just the expectation of the likelihood criterion and the like-lihood itself. For example, we may measure the disparity between the sampleand the process defined by the theoretical spectrum fj(ωk) corresponding to

population Πj in the sense of Kullback (1958), as I(f ; fj), where

f(ωk) =m∑

`=−m

h`I(ωk + `/n) (7.127)

denotes the smoothed spectral matrix with weights {h`}. The likelihood ra-tio criterion can be thought of as measuring the disparity between the peri-odogram and the theoretical spectrum for each of the populations. To make

Page 473: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

460 7 Statistical Methods in the Frequency Domain

the discrimination information finite, we replace the periodogram implied bythe log likelihood by the sample spectrum. In this case, the classificationprocedure can be regarded as finding the population closest, in the sense ofminimizing disparity between the sample and theoretical spectral matrices.The classification in this case proceeds by simply choosing the population Πj

that minimizes I(f ; fj), i.e., assigning xxx to population Πi whenever

I(f ; fi) < I(f ; fj) (7.128)

for j 6= i, j = 1, 2, . . . , g.Kakizawa et al. (1998) proposed using the Chernoff (CH) information mea-

sure (Chernoff, 1952, Renyi, 1961), defined as

Bα(1; 2) = − lnE2

{(p2(xxx)

p1(xxx)

)α}, (7.129)

where the measure is indexed by a regularizing parameter α, for 0 < α < 1.When α = .5, the Chernoff measure is the symmetric divergence proposed byBhattacharya (1943). For the multivariate normal case,

Bα(1; 2 : xxx) =1

n

[ln|αΓ1 + (1− α)Γ2|

|Γ2|− α ln

|Γ1||Γ2|

]. (7.130)

The large sample spectral approximation to the Chernoff information measureis analogous to that for the discrimination information, namely,

Bα(f1; f2) =1

2n

∑0<ωk<1/2

[ln|αf1(ωk) + (1− α)f2(ωk)|

|f2(ωk)|

− α ln|f1(ωk)||f2(ωk)|

].

(7.131)

The Chernoff measure, when divided by α(1 − α), behaves like the discrimi-nation information in the limit in the sense that it converges to I(1; 2 : xxx) forα→ 0 and to I(2; 1 : xxx) for α→ 1. Hence, near the boundaries of the param-eter α, it tends to behave like discrimination information and for other valuesrepresents a compromise between the two information measures. The classifi-cation rule for the Chernoff measure reduces to assigning xxx to population Πi

wheneverBα(f ; fi) < Bα(f ; fj) (7.132)

for j 6= i, j = 1, 2, . . . , g.Although the classification rules above are well defined if the group spectral

matrices are known, this will not be the case in general. If there are g trainingsamples, xxxij , j = 1, . . . , Ni, i = 1 . . . , g, with Ni vector observations availablein each group, the natural estimator for the spectral matrix of the group iis just the average spectral matrix (7.99), namely, with fij(ωk) denoting theestimated spectral matrix of series j from the i-th population,

Page 474: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

7.7 Discrimination and Cluster Analysis 461

fi(ωk) =1

Ni

Ni∑j=1

fij(ωk). (7.133)

A second consideration is the choice of the regularization parameter α forthe Chernoff criterion, (7.131). For the case of g = 2 groups, it should bechosen to maximize the disparity between the two group spectra, as definedin (7.131). Kakizawa et al. (1998) simply plot (7.131) as a function of α,using the estimated group spectra in (7.133), choosing the value that givesthe maximum disparity between the two groups.

Example 7.12 Discriminant Analysis on Seismic Data

The simplest approaches to discriminating between the earthquake and ex-plosion groups have been based on either the relative amplitudes of the Pand S phases, as in Figure 7.5 or on relative power components in variousfrequency bands. Considerable effort has been expended on using variousspectral ratios involving the bivariate P and S phases as discrimination fea-tures. Kakizawa et al. (1998) mention a number of measures that have beused in the seismological literature as features. These features include ratiosof power for the two phases and ratios of power components in high- andlow-frequency bands. The use of such features of the spectrum suggests anoptimal procedure based on discriminating between the spectral matrices oftwo stationary processes would be reasonable. The fact that the hypothe-sis that the spectral matrices were equal, tested in Example 7.10, was alsosoundly rejected suggests the use of a discriminant function based on spec-tral differences. Recall the sampling rate is 40 points per second, leading toa folding frequency of 20 Hz.

Figure 7.14 displays the diagonal elements of the average spectral matri-ces for each group. The maximum value of the estimated Chernoff disparityBα(f1; f2) occurs for α = .4, and we use that value in the discriminantcriterion (7.131). Figure 7.15 shows the results of using the Chernoff dif-ferences along with the Kullback-Leibler differences for classification. Thedifferences are the measures for earthquakes minus explosions, so negativevalues of the differences indicate earthquake and positive values indicate ex-plosion. Hence, points in the first quadrant of Figure 7.15 are classified anexplosion and points in the third quadrant are classified as earthquakes. Wenote that Explosion 6 is misclassified as an earthquake. Also, Earthquake 1,which falls in the fourth quadrant has an uncertain classification, the Cher-noff distance classifies it as an earthquake, however, the Kullback-Leiblerdifference classifies it as an explosion.

The NZ event of unknown origin was also classified using these distancemeasures, and, as in Example 7.11, it is classified as an explosion. TheRussians have asserted no mine blasting or nuclear testing occurred in thearea in question, so the event remains as somewhat of a mystery. The factthat it was relatively removed geographically from the test set may also haveintroduced some uncertainties into the procedure.

Page 475: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

462 7 Statistical Methods in the Frequency Domain

0 5 10 15 20

0.00

0.05

0.10

0.15

Frequency (Hz)0 5 10 15 20

02

46

Frequency (Hz)

0 5 10 15 20

0.0

0.2

0.4

0.6

0.8

Frequency (Hz)0 5 10 15 20

01

23

4

Frequency (Hz)

Average P−spectraE

arth

quak

esAverage S−spectra

Exp

losi

ons

−0.5 0.0 0.5

0.00

0.02

Fig. 7.14. Average P-spectra and S-spectra of the earthquake and explosion se-ries. The insert on the upper right shows the smoothing kernel used; the resultingbandwidth is about .75 Hz.

The R code for this example is as follows.1 P = 1:1024; S = P+1024; p.dim = 2; n =1024

2 eq = as.ts(eqexp[, 1:8])

3 ex = as.ts(eqexp[, 9:16])

4 nz = as.ts(eqexp[, 17])

5 f.eq <- array(dim=c(8, 2, 2, 512)) -> f.ex

6 f.NZ = array(dim=c(2, 2, 512))

7 # below calculates determinant for 2x2 Hermitian matrix

8 det.c=function(mat){return(Re(mat[1,1]*mat[2,2]-mat[1,2]*mat[2,1]))}

9 L = c(15,13,5) # for smoothing

10 for (i in 1:8){ # compute spectral matrices

11 f.eq[i,,,] = mvspec(cbind(eq[P,i], eq[S,i]), spans=L, taper=.5)$fxx

12 f.ex[i,,,] = mvspec(cbind(ex[P,i], ex[S,i]), spans=L,

taper=.5)$fxx}

Page 476: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

7.7 Discrimination and Cluster Analysis 463

−0.5 0.0 0.5 1.0 1.5

−0.

06−

0.04

−0.

020.

000.

020.

040.

06

Classification Based on Chernoff and K−L Distances

Kullback−Leibler Difference

Che

rnof

f Diff

eren

ce

EQEXNZ

EQ4

EQ5

EQ6

EQ8

EX1

EX2

EX3

EX4 EX5

EX7

EX8

NZ

EQ1

EQ2

EQ3

EQ7

EX6

Fig. 7.15. Classification (by quadrant) of earthquakes and explosions using theChernoff and Kullback-Leibler differences.

13 u = mvspec(cbind(nz[P], nz[S]), spans=L, taper=.5)

14 f.NZ = u$fxx

15 bndwidth = u$bandwidth*sqrt(12)*40 # about .75 Hz

16 fhat.eq = apply(f.eq, 2:4, mean) # average spectra

17 fhat.ex = apply(f.ex, 2:4, mean)

18 # plot the average spectra

19 par(mfrow=c(2,2), mar=c(3,3,2,1), mgp = c(1.6,.6,0))

20 Fr = 40*(1:512)/n

21 plot(Fr,Re(fhat.eq[1,1,]),type="l",xlab="Frequency (Hz)",ylab="")

22 plot(Fr,Re(fhat.eq[2,2,]),type="l",xlab="Frequency (Hz)",ylab="")

23 plot(Fr,Re(fhat.ex[1,1,]),type="l",xlab="Frequency (Hz)",ylab="")

24 plot(Fr,Re(fhat.ex[2,2,]),type="l",xlab="Frequency (Hz)",ylab="")

25 mtext("Average P-spectra", side=3, line=-1.5, adj=.2, outer=TRUE)

26 mtext("Earthquakes", side=2, line=-1, adj=.8, outer=TRUE)

27 mtext("Average S-spectra", side=3, line=-1.5, adj=.82, outer=TRUE)

28 mtext("Explosions", side=2, line=-1, adj=.2, outer=TRUE)

Page 477: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

464 7 Statistical Methods in the Frequency Domain

29 par(fig = c(.75, 1, .75, 1), new = TRUE)

30 ker = kernel("modified.daniell", L)$coef; ker = c(rev(ker),ker[-1])

31 plot((-33:33)/40, ker, type="l", ylab="", xlab="", cex.axis=.7,

yaxp=c(0,.04,2))

32 # Choose alpha

33 Balpha = rep(0,19)

34 for (i in 1:19){ alf=i/20

35 for (k in 1:256) {

36 Balpha[i]= Balpha[i] + Re(log(det.c(alf*fhat.ex[,,k] +

(1-alf)*fhat.eq[,,k])/det.c(fhat.eq[,,k])) -

alf*log(det.c(fhat.ex[,,k])/det.c(fhat.eq[,,k])))} }

37 alf = which.max(Balpha)/20 # alpha = .4

38 # Calculate Information Criteria

39 rep(0,17) -> KLDiff -> BDiff -> KLeq -> KLex -> Beq -> Bex

40 for (i in 1:17){

41 if (i <= 8) f0 = f.eq[i,,,]

42 if (i > 8 & i <= 16) f0 = f.ex[i-8,,,]

43 if (i == 17) f0 = f.NZ

44 for (k in 1:256) { # only use freqs out to .25

45 tr = Re(sum(diag(solve(fhat.eq[,,k],f0[,,k]))))

46 KLeq[i] = KLeq[i] + tr + log(det.c(fhat.eq[,,k])) -

log(det.c(f0[,,k]))

47 Beq[i] = Beq[i] +

Re(log(det.c(alf*f0[,,k]+(1-alf)*fhat.eq[,,k])/det.c(fhat.eq[,,k]))

- alf*log(det.c(f0[,,k])/det.c(fhat.eq[,,k])))

48 tr = Re(sum(diag(solve(fhat.ex[,,k],f0[,,k]))))

49 KLex[i] = KLex[i] + tr + log(det.c(fhat.ex[,,k])) -

log(det.c(f0[,,k]))

50 Bex[i] = Bex[i] +

Re(log(det.c(alf*f0[,,k]+(1-alf)*fhat.ex[,,k])/det.c(fhat.ex[,,k]))

- alf*log(det.c(f0[,,k])/det.c(fhat.ex[,,k]))) }

51 KLDiff[i] = (KLeq[i] - KLex[i])/n

52 BDiff[i] = (Beq[i] - Bex[i])/(2*n) }

53 x.b = max(KLDiff)+.1; x.a = min(KLDiff)-.1

54 y.b = max(BDiff)+.01; y.a = min(BDiff)-.01

55 dev.new()

56 plot(KLDiff[9:16], BDiff[9:16], type="p", xlim=c(x.a,x.b),

ylim=c(y.a,y.b), cex=1.1,lwd=2, xlab="Kullback-Leibler

Difference",ylab="Chernoff Difference", main="Classification

Based on Chernoff and K-L Distances", pch=6)

57 points(KLDiff[1:8], BDiff[1:8], pch=8, cex=1.1, lwd=2)

58 points(KLDiff[17], BDiff[17], pch=3, cex=1.1, lwd=2)

59 legend("topleft", legend=c("EQ", "EX", "NZ"), pch=c(8,6,3),

pt.lwd=2)

60 abline(h=0, v=0, lty=2, col="gray")

61 text(KLDiff[-c(1,2,3,7,14)]-.075, BDiff[-c(1,2,3,7,14)],

label=names(eqexp[-c(1,2,3,7,14)]), cex=.7)

62 text(KLDiff[c(1,2,3,7,14)]+.075, BDiff[c(1,2,3,7,14)],

label=names(eqexp[c(1,2,3,7,14)]), cex=.7)

Page 478: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

7.7 Discrimination and Cluster Analysis 465

Cluster Analysis

For the purpose of clustering, it may be more useful to consider a sym-metric disparity measures and we introduce the J-Divergence measure

J(f1; f2) = I(f1; f2) + I(f2; f1) (7.134)

and the symmetric Chernoff number

JBα(f1; f2) = Bα(f1; f2) +Bα(f2; f1) (7.135)

for that purpose. In this case, we define the disparity between the samplespectral matrix of a single vector, xxx, and the population Πj as

J(f ; fj) = I(f ; fj) + I(fj ; f) (7.136)

andJBα(f ; fj) = Bα(f ; fj) +Bα(fj ; f), (7.137)

respectively and use these as quasi-distances between the vector and popula-tion Πj .

The measures of disparity can be used to cluster multivariate time series.The symmetric measures of disparity, as defined above ensure that the dispar-ity between fi and fj is the same as the disparity between fj and fi. Hence,we will consider the symmetric forms (7.136) and (7.137) as quasi-distancesfor the purpose of defining a distance matrix for input into one of the stan-dard clustering procedures (see Johnson and Wichern, 1992). In general, wemay consider either hierarchical or partitioned clustering methods using thequasi-distance matrix as an input.

For purposes of illustration, we may use the symmetric divergence (7.136),which implies the quasi-distance between sample series with estimated spec-tral matrices fi and fj would be (7.136); i.e.,

J(fi; fj) =1

n

∑0<ωk<1/2

[tr{fi(ωk)f−1j (ωk)

}+ tr

{fj(ωk)f−1i (ωk)

}− 2p

],

(7.138)for i 6= j. We can also use the comparable form for the Chernoff divergence,but we may not want to make an assumption for the regularization parameterα.

For hierarchical clustering, we begin by clustering the two members ofthe population that minimize the disparity measure (7.138). Then, these twoitems form a cluster, and we can compute distances between unclustered itemsas before. The distance between unnclustered items and a current clusteris defined here as the average of the distances to elements in the cluster.Again, we combine objects that are closest together. We may also computethe distance between the unclustered items and clustered items as the closestdistance, rather than the average. Once a series is in a cluster, it stays there.

Page 479: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

466 7 Statistical Methods in the Frequency Domain

At each stage, we have a fixed number of clusters, depending on the mergingstage.

Alternatively, we may think of clustering as a partitioning of the sampleinto a prespecified number of groups. MacQueen (1967) has proposed thisusing k-means clustering, using the Mahalonobis distance between an ob-servation and the group mean vectors. At each stage, a reassignment of anobservation into its closest affinity group is possible. To see how this proce-dure applies in the current context, consider a preliminary partition into afixed number of groups and define the disparity between the spectral matrixof the observation, say, f , and the average spectral matrix of the group, say,fi, as J(f ; fi), where the group spectral matrix can be estimated by (7.133).At any pass, a single series is reassigned to the group for which its disparity isminimized. The reassignment procedure is repeated until all observations stayin their current groups. Of course, the number of groups must be specified foreach repetition of the partitioning algorithm and a starting partition must bechosen. This assignment can either be random or chosen from a preliminaryhierarchical clustering, as described above.

Example 7.13 Cluster Analysis for Earthquakes and Explosions

It is instructive to try a clustering procedure on the population of knownearthquakes and explosions. Figures 7.16 shows the results of applying thePartitioning Around Medoids (PAM) clustering algorithm, which is essen-tially a robustification of the k-means procedure (see Ch. 2 of Kaufman& Rousseeuw, 1990), under the assumption that two groups are appropri-ate. The two-group partition tends to produce a final partition that agreesclosely with the known configuration with earthquake 1 (EQ1) and explo-sion 8 (EX8) being misclassified; as in previous examples, the NZ event isclassified as an explosion.

The R code for this example uses the cluster package and our mvspec

script for estimating spectral matrices.1 library(cluster)

2 P = 1:1024; S = P+1024; p.dim = 2; n =1024

3 eq = as.ts(eqexp[, 1:8])

4 ex = as.ts(eqexp[, 9:16])

5 nz = as.ts(eqexp[, 17])

6 f = array(dim=c(17, 2, 2, 512))

7 L = c(15, 15) # for smoothing

8 for (i in 1:8){ # compute spectral matrices

9 f[i,,,] = mvspec(cbind(eq[P,i], eq[S,i]), spans=L, taper=.5)$fxx

10 f[i+8,,,] = mvspec(cbind(ex[P,i], ex[S,i]), spans=L, taper=.5)$fxx

}

11 f[17,,,] = mvspec(cbind(nz[P], nz[S]), spans=L, taper=.5)$fxx

12 JD = matrix(0, 17, 17)

13 # Calculate Symmetric Information Criteria

14 for (i in 1:16){

15 for (j in (i+1):17){

Page 480: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

7.7 Discrimination and Cluster Analysis 467

Fig. 7.16. Clustering results for the earthquake and explosion series based on sym-metric divergence using a robust version of k-means clustering with two groups.Circles indicate Group I classification, triangles indicate Group II classification.

16 for (k in 1:256) { # only use freqs out to .25

17 tr1 = Re(sum(diag(solve(f[i,,,k], f[j,,,k]))))

18 tr2 = Re(sum(diag(solve(f[j,,,k], f[i,,,k]))))

19 JD[i,j] = JD[i,j] + (tr1 + tr2 - 2*p.dim)}}}

20 JD = (JD + t(JD))/n

21 colnames(JD) = c(colnames(eq), colnames(ex), "NZ")

22 rownames(JD) = colnames(JD)

23 cluster.2 = pam(JD, k = 2, diss = TRUE)

24 summary(cluster.2) # print results

25 par(mgp = c(1.6,.6,0), cex=3/4, cex.lab=4/3, cex.main=4/3)

26 clusplot(JD, cluster.2$cluster, col.clus=1, labels=3, lines=0,

col.p=1, main="Clustering Results for Explosions and

Earthquakes")

27 text(-7,-.5, "Group I", cex=1.1, font=2)

28 text(1, 5, "Group II", cex=1.1, font=2)

Page 481: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

468 7 Statistical Methods in the Frequency Domain

7.8 Principal Components and Factor Analysis

In this section, we introduce the related topics of spectral domain principalcomponents analysis and factor analysis for time series. The topics of principalcomponents and canonical analysis in the frequency domain are rigorouslypresented in Brillinger (1981, Chapters 9 and 10) and many of the detailsconcerning these concepts can be found there.

The techniques presented here are related to each other in that they focuson extracting pertinent information from spectral matrices. This informationis important because dealing directly with a high-dimensional spectral matrixf(ω) itself is somewhat cumbersome because it is a function into the set ofcomplex, nonnegative-definite, Hermitian matrices. We can view these tech-niques as easily understood, parsimonious tools for exploring the behavior ofvector-valued time series in the frequency domain with minimal loss of infor-mation. Because our focus is on spectral matrices, we assume for conveniencethat the time series of interest have zero means; the techniques are easilyadjusted in the case of nonzero means.

In this and subsequent sections, it will be convenient to work occasionallywith complex-valued time series. A p × 1 complex-valued time series can berepresented as xxxt = xxx1t − ixxx2t, where xxx1t is the real part and xxx2t is the imag-inary part of xxxt. The process is said to be stationary if E(xxxt) and E(xxxt+hxxx

∗t )

exist and are independent of time t. The p× p autocovariance function,

Γxx(h) = E(xxxt+hxxx∗t )− E(xxxt+h)E(xxx∗t ),

of xxxt satisfies conditions similar to those of the real-valued case. WritingΓxx(h) = {γij(h)}, for i, j = 1, . . . , p, we have (i) γii(0) ≥ 0 is real, (ii)|γij(h)|2 ≤ γii(0)γjj(0) for all integers h, and (iii) Γxx(h) is a non-negativedefinite function. The spectral theory of complex-valued vector time seriesis analogous to the real-valued case. For example, if

∑h ||Γxx(h)|| < ∞, the

spectral density matrix of the complex series xxxt is given by

fxx(ω) =∞∑

h=−∞

Γxx(h) exp(−2πihω).

Principal Components

Classical principal component analysis (PCA) is concerned with explain-ing the variance–covariance structure among p variables, xxx = (x1, . . . , xp)

′,through a few linear combinations of the components of xxx. Suppose we wishto find a linear combination

y = ccc′xxx = c1x1 + · · ·+ cpxp (7.139)

of the components of xxx such that var(y) is as large as possible. Because var(y)can be increased by simply multiplying ccc by a constant, it is common to

Page 482: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

7.8 Principal Components and Factor Analysis 469

restrict ccc to be of unit length; that is, ccc′ccc = 1. Noting that var(y) = ccc′Σxxccc,where Σxx is the p×p variance–covariance matrix of xxx, another way of statingthe problem is to find ccc such that

maxccc6=000

ccc′Σxxccc

ccc′ccc. (7.140)

Denote the eigenvalue–eigenvector pairs of Σxx by {(λ1, eee1), . . . , (λp, eeep)},where λ1 ≥ λ2 ≥ · · · ≥ λp ≥ 0, and the eigenvectors are of unit length. Thesolution to (7.140) is to choose ccc = eee1, in which case the linear combinationy1 = eee′1xxx has maximum variance, var(y1) = λ1. In other words,

maxccc 6=000

ccc′Σxxccc

ccc′ccc=eee′1Σxxeee1eee′1eee1

= λ1. (7.141)

The linear combination, y1 = eee′1xxx, is called the first principal component.Because the eigenvalues of Σxx are not necessarily unique, the first principalcomponent is not necessarily unique.

The second principal component is defined to be the linear combinationy2 = ccc′xxx that maximizes var(y2) subject to ccc′ccc = 1 and such that cov(y1, y2) =0. The solution is to choose ccc = eee2, in which case, var(y2) = λ2. In general,the k-th principal component, for k = 1, 2, . . . , p, is the the linear combinationyk = ccc′xxx that maximizes var(yk) subject to ccc′c = 1 and such that cov(yk, yj) =0, for j = 1, 2, . . . , k − 1. The solution is to choose ccc = eeek, in which casevar(yk) = λk.

One measure of the importance of a principal component is to assess theproportion of the total variance attributed to that principal component. Thetotal variance of xxx is defined to be the sum of the variances of the individualcomponents; that is, var(x1) + · · · + var(xp) = σ11 + · · · + σpp, where σjj isthe j-th diagonal element of Σxx. This sum is also denoted as tr(Σxx), orthe trace of Σxx. Because tr(Σxx) = λ1 + · · · + λp, the proportion of thetotal variance attributed to the k-th principal component is given simply byvar(yk)

/tr(Σxx) = λk

/ ∑pj=1 λj .

Given a random sample xxx1, . . . , xxxn, the sample principal components aredefined as above, but with Σxx replaced by the sample variance–covariancematrix, Sxx = (n− 1)−1

∑ni=1(xxxi − xxx)(xxxi − xxx)′. Further details can be found

in the introduction to classical principal component analysis in Johnson andWichern (1992, Chapter 9).

For the case of time series, suppose we have a zero mean, p× 1, stationaryvector process xxxt that has a p × p spectral density matrix given by fxx(ω).Recall fxx(ω) is a complex-valued, nonnegative-definite, Hermitian matrix.Using the analogy of classical principal components, and in particular (7.139)and (7.140), suppose, for a fixed value of ω, we want to find a complex-valuedunivariate process yt(ω) = ccc(ω)∗xt, where ccc(ω) is complex, such that thespectral density of yt(ω) is maximized at frequency ω, and ccc(ω) is of unitlength, ccc(ω)∗ccc(ω) = 1. Because, at frequency ω, the spectral density of yt(ω)

Page 483: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

470 7 Statistical Methods in the Frequency Domain

is fy(ω) = ccc(ω)∗fxx(ω)ccc(ω), the problem can be restated as: Find complexvector ccc(ω) such that

maxccc(ω)6=000

ccc(ω)∗fxx(ω)ccc(ω)

ccc(ω)∗ccc(ω). (7.142)

Let {(λ1(ω), eee1(ω)) ,. . . , (λp(ω), eeep(ω))} denote the eigenvalue–eigenvectorpairs of fxx(ω), where λ1(ω) ≥ λ2(ω) ≥ · · · ≥ λp(ω) ≥ 0, and the eigen-vectors are of unit length. We note that the eigenvalues of a Hermitian matrixare real. The solution to (7.142) is to choose ccc(ω) = eee1(ω); in which case thedesired linear combination is yt(ω) = eee1(ω)∗xxxt. For this choice,

maxccc(ω)6=000

ccc(ω)∗fxx(ω)ccc(ω)

ccc(ω)∗ccc(ω)=eee1(ω)∗fx(ω)eee1(ω)

eee1(ω)∗eee1(ω)= λ1(ω). (7.143)

This process may be repeated for any frequency ω, and the complex-valuedprocess, yt1(ω) = eee1(ω)∗xxxt, is called the first principal component at frequencyω. The k-th principal component at frequency ω, for k = 1, 2, . . . , p, is thecomplex-valued time series ytk(ω) = eeek(ω)∗xxxt, in analogy to the classicalcase. In this case, the spectral density of ytk(ω) at frequency ω is fyk(ω) =eeek(ω)∗fxx(ω)eeek(ω) = λk(ω).

The previous development of spectral domain principal components isrelated to the spectral envelope methodology first discussed in Stoffer etal. (1993). We will present the spectral envelope in the next section, where wemotivate the use of principal components as it is presented above. Anotherway to motivate the use of principal components in the frequency domainwas given in Brillinger (1981, Chapter 9). Although this technique leads tothe same analysis, the motivation may be more satisfactory to the readerat this point. In this case, we suppose we have a stationary, p-dimensional,vector-valued process xxxt and we are only able to keep a univariate processyt such that, when needed, we may reconstruct the vector-valued process, xxxt,according to an optimality criterion.

Specifically, we suppose we want to approximate a mean-zero, stationary,vector-valued time series, xxxt, with spectral matrix fxx(ω), by a univariateprocess yt defined by

yt =∞∑

j=−∞ccc∗t−jxxxj , (7.144)

where {cccj} is a p×1 vector-valued filter, such that {cccj} is absolutely summable;that is,

∑∞j=−∞ |cccj | < ∞. The approximation is accomplished so the recon-

struction of xxxt from yt, say,

xxxt =∞∑

j=−∞bbbt−jyj , (7.145)

where {bbbj} is an absolutely summable p×1 filter, is such that the mean squareapproximation error

Page 484: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

7.8 Principal Components and Factor Analysis 471

E{(xxxt − xxxt)∗(xxxt − xxxt)} (7.146)

is minimized.Let bbb(ω) and ccc(ω) be the transforms of {bbbj} and {cccj}, respectively. For

example,

ccc(ω) =

∞∑j=−∞

cccj exp(−2πijω), (7.147)

and, consequently,

cccj =

∫ 1/2

−1/2ccc(ω) exp(2πijω)dω. (7.148)

Brillinger (1981, Theorem 9.3.1) shows the solution to the problem is to chooseccc(ω) to satisfy (7.142) and to set bbb(ω) = ccc(ω). This is precisely the previousproblem, with the solution given by (7.143). That is, we choose ccc(ω) = eee1(ω)and bbb(ω) = eee1(ω); the filter values can be obtained via the inversion formulagiven by (7.148). Using these results, in view of (7.144), we may form the firstprincipal component series, say yt1.

This technique may be extended by requesting another series, say, yt2,for approximating xxxt with respect to minimum mean square error, butwhere the coherency between yt2 and yt1 is zero. In this case, we chooseccc(ω) = eee2(ω). Continuing this way, we can obtain the first q ≤ p prin-cipal components series, say, yyyt = (yt1, . . . , ytq)

′, having spectral densityfq(ω) = diag{λ1(ω), . . . , λq(ω)}. The series ytk is the k-th principal com-ponent series.

As in the classical case, given observations, xxx1, xxx2, . . . , xxxn, from the processxxxt, we can form an estimate fxx(ω) of fxx(ω) and define the sample principal

component series by replacing fxx(ω) with fxx(ω) in the previous discussion.Precise details pertaining to the asymptotic (n→∞) behavior of the principalcomponent series and their spectra can be found in Brillinger (1981, Chapter9). To give a basic idea of what we can expect, we focus on the first principalcomponent series and on the spectral estimator obtained by smoothing theperiodogram matrix, In(ωj); that is

fxx(ωj) =m∑

`=−m

h`In(ωj + `/n), (7.149)

where L = 2m+ 1 is odd and the weights are chosen so h` = h−` are positive

and∑` h` = 1. Under the conditions for which fxx(ωj) is a well-behaved

estimator of fxx(ωj), and for which the largest eigenvalue of fxx(ωj) is unique,{ηnλ1(ωj)− λ1(ωj)

λ1(ωj); ηn [eee1(ωj)− eee1(ωj)] ; j = 1, . . . , J

}(7.150)

converges (n → ∞) jointly in distribution to independent, zero-mean nor-mal distributions, the first of which is standard normal. In (7.150), η−2n =

Page 485: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

472 7 Statistical Methods in the Frequency Domain

0 4 8 12 16 20

02

46

8

cort1

Cycles0 4 8 12 16 20

02

46

8

cort2

Cycles0 4 8 12 16 20

02

46

8

cort3

Cycles0 4 8 12 16 20

02

46

8

cort4

Cycles

0 4 8 12 16 20

02

46

8

thal1

Cycles0 4 8 12 16 20

02

46

8

thal2

Cycles0 4 8 12 16 20

02

46

8

cere1

Cycles0 4 8 12 16 20

02

46

8

cere2

Cycles

Per

iodo

gram

Per

iodo

gram

Fig. 7.17. The individual periodograms of xtk, for k = 1, . . . , 8, in Example 7.14.

∑m`=−m h

2` , noting we must have L → ∞ and ηn → ∞, but L/n → 0 as

n→∞. The asymptotic variance–covariance matrix of eee1(ω), say, Σe1(ω), isgiven by

Σe1(ω) = η−2n λ1(ω)

p∑`=2

λ`(ω) {λ1(ω)− λ`(ω)}−2 eee`(ω)eee∗` (ω). (7.151)

The distribution of eee1(ω) depends on the other latent roots and vectors offx(ω). Writing eee1(ω) = (e11(ω), e12(ω), . . . , e1p(ω))′, we may use this resultto form confidence regions for the components of eee1 by approximating thedistribution of

2 |eee1,j(ω)− eee1,j(ω)|2

s2j (ω), (7.152)

for j = 1, . . . , p, by a χ2 distribution with two degrees of freedom. In (7.152),

s2j (ω) is the j-th diagonal element of Σe1(ω), the estimate of Σe1(ω). We canuse (7.152) to check whether the value of zero is in the confidence region bycomparing 2|eee1,j(ω)|2/s2j (ω) with χ2

2(1−α), the 1−α upper tail cutoff of the

χ22 distribution.

Example 7.14 Principal Component Analysis of the fMRI Data

Recall Example 1.6 of Chapter 1, where the vector time series xxxt =(xt1, . . . , xt8)′, t = 1, . . . , 128, represents consecutive measures of averageblood oxygenation level dependent (bold) signal intensity, which measuresareas of activation in the brain. Recall subjects were given a non-painful

Page 486: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

7.8 Principal Components and Factor Analysis 473

0 8 16 24 32 40 48 56 64

0.0

0.5

1.0

1.5

2.0

Cycles (Frequency x 128)

Firs

t Prin

cipa

l Com

pone

nt

Fig. 7.18. The estimated spectral density, λ1(j/128), of the first principal compo-nent series in Example 7.14.

brush on the hand and the stimulus was applied for 32 seconds and thenstopped for 32 seconds; thus, the signal period is 64 seconds (the samplingrate was one observation every two seconds for 256 seconds). The seriesxtk for k = 1, 2, 3, 4 represent locations in cortex, series xt5 and xt6 repre-sent locations in the thalamus, and xt7 and xt8 represent locations in thecerebellum.

As is evident from Figure 1.6 in Chapter 1, different areas of the brainare responding differently, and a principal component analysis may help inindicating which locations are responding with the most spectral power,and which locations do not contribute to the spectral power at the stimulussignal period. In this analysis, we will focus primarily on the signal periodof 64 seconds, which translates to four cycles in 256 seconds or ω = 4/128cycles per time point.

Figure 7.17 shows individual periodograms of the series xtk for k =1, . . . , 8. As was evident from Figure 1.6, a strong response to the brushstimulus occurred in areas of the cortex. To estimate the spectral densityof xxxt, we used (7.149) with L = 5 and {h0 = 3/9, h±1 = 2/9, h±2 = 1/9};this is a Daniell kernel with m = 1 passed twice. Calling the estimatedspectrum fxx(j/128), for j = 0, 1, . . . , 64, we can obtain the estimated spec-trum of the first principal component series yt1 by calculating the largesteigenvalue, λ1(j/128), of fxx(j/128) for each j = 0, 1, . . . , 64. The result,

λ1(j/128), is shown in Figure 7.18. As expected, there is a large peak at

the stimulus frequency 4/128, wherein λ1(4/128) = 2. The total power at

Page 487: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

474 7 Statistical Methods in the Frequency Domain

Table 7.5. Magnitudes of the PC Vector at the Stimulus Frequency

Location 1 2 3 4 5 6 7 8∣∣ eee1( 4128

)∣∣ .64 .36 .36 .22 .32 .05* .13 .39

*Zero is in an approximate 99% confidence region for this component.

the stimulus frequency is tr(fxx(4/128)) = 2.05, so the proportion of thepower at frequency 4/128 attributed to the first principal component seriesis about 2/2.05 or roughly 98%. Because the first principal component ex-plains nearly all of the total power at the stimulus frequency, there is noneed to explore the other principal component series at this frequency.

The estimated first principal component series at frequency 4/128 is givenby yt1(4/128) = eee∗1(4/128)xxxt, and the components of eee1(4/128) can give in-sight as to which locations of the brain are responding to the brush stimulus.Table 7.5 shows the magnitudes of eee1(4/128). In addition, an approximate99% confidence interval was obtained for each component using (7.152). Asexpected, the analysis indicates that location 6 is not contributing to thepower at this frequency, but surprisingly, the analysis suggests location 5(cerebellum 1) is responding to the stimulus.

The R code for this example is as follows.1 n = 128; Per = abs(mvfft(fmri1[,-1]))^2/n

2 par(mfrow=c(2,4),mar=c(3,2,2,1),mgp = c(1.6,.6,0),oma=c(0,1,0,0))

3 for (i in 1:8) plot(0:20, Per[1:21,i], type="l", ylim=c(0,8),

main=colnames(fmri1)[i+1], xlab="Cycles",ylab="",xaxp=c(0,20,5))

4 mtext("Periodogram", side=2, line=-.3, outer=TRUE, adj=c(.2,.8))

5 fxx = mvspec(fmri1[,-1], kernel("daniell", c(1,1)), taper=.5)$fxx

6 l.val = rep(NA,64)

7 for (k in 1:64) {

8 u = eigen(fxx[,,k], symmetric=TRUE, only.values = TRUE)

9 l.val[k] = u$values[1]} # largest e-value

10 dev.new()

11 plot(l.val, type="l",xaxp=c(0,64,8), xlab="Cycles (Frequency x

128)", ylab="First Principal Component")

12 axis(1, seq(4,60,by=8), labels=FALSE)

13 # At freq 4/128

14 u = eigen(fxx[,,4], symmetric=TRUE)

15 lam=u$values; evec=u$vectors

16 lam[1]/sum(lam) # % of variance explained

17 sig.e1 = matrix(0,8,8)

18 for (l in 2:5){ # last 3 evs are 0

19 sig.e1= sig.e1 +

lam[l]*evec[,l]%*%Conj(t(evec[,l]))/(lam[1]-lam[l])^2}

20 sig.e1 = Re(sig.e1)*lam[1]*sum(kernel("daniell", c(1,1))$coef^2)

21 p.val = round(pchisq(2*abs(evec[,1])^2/diag(sig.e1), 2,

lower.tail=FALSE), 3)

22 cbind(colnames(fmri1)[-1], abs(evec[,1]), p.val) # table values

Page 488: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

7.8 Principal Components and Factor Analysis 475

Factor Analysis

Classical factor analysis is similar to classical principal component analysis.Suppose xxx is a mean-zero, p× 1, random vector with variance–covariance ma-trix Σxx. The factor model proposes that xxx is dependent on a few unobservedcommon factors, z1, . . . , zq, plus error. In this model, one hopes that q will bemuch smaller than p. The factor model is given by

xxx = Bzzz + εεε, (7.153)

where B is a p × q matrix of factor loadings, zzz = (z1, . . . , zq)′ is a random

q× 1 vector of factors such that E(zzz) = 000 and E(zzzzzz′) = Iq, the q× q identitymatrix. The p × 1 unobserved error vector εεε is assumed to be independentof the factors, with zero mean and diagonal variance-covariance matrix D =diag{δ21 , . . . , δ2p}. Note, (7.153) differs from the multivariate regression modelin §5.7 because the factors, zzz, are unobserved. Equivalently, the factor model,(7.153), can be written in terms of the covariance structure of xxx,

Σxx = BB′ +D; (7.154)

i.e., the variance-covariance matrix of xxx is the sum of a symmetric, nonnegative-definite rank q ≤ p matrix and a nonnegative-definite diagonal matrix. Ifq = p, then Σxx can be reproduced exactly as BB′, using the fact thatΣxx = λ1eee1eee

′1 + · · · + λpeeepeee

′p, where (λi, eeei) are the eigenvalue–eigenvector

pairs of Σxx. As previously indicated, however, we hope q will be much smallerthan p. Unfortunately, most covariance matrices cannot be factored as (7.154)when q is much smaller than p.

To motivate factor analysis, suppose the components of xxx can be groupedinto meaningful groups. Within each group, the components are highly cor-related, but the correlation between variables that are not in the same groupis small. A group is supposedly formed by a single construct, represented asan unobservable factor, responsible for the high correlations within a group.For example, a person competing in a decathlon performs p = 10 athleticevents, and we may represent the outcome of the decathlon as a 10× 1 vectorof scores. The events in a decathlon involve running, jumping, or throwing,and it is conceivable the 10× 1 vector of scores might be able to be factoredinto q = 4 factors, (1) arm strength, (2) leg strength, (3) running speed, and(4) running endurance. The model (7.153) specifies that cov(xxx, zzz) = B, orcov(xi, zj) = bij where bij is the ij-th component of the factor loading ma-trix B, for i = 1, . . . , p and j = 1, . . . , q. Thus, the elements of B are used toidentify which hypothetical factors the components of xxx belong to, or load on.

At this point, some ambiguity is still associated with the factor model. LetQ be a q × q orthogonal matrix; that is Q′Q = QQ′ = Iq. Let B∗ = BQ andzzz∗ = Q′zzz so (7.153) can be written as

xxx = Bzzz + εεε = BQQ′zzz + εεε = B∗zzz∗ + εεε. (7.155)

Page 489: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

476 7 Statistical Methods in the Frequency Domain

The model in terms of B∗ and zzz∗ fulfills all of the factor model requirements,for example, cov(zzz∗) = Q′cov(zzz)Q = QQ′ = Iq, so

Σxx = B∗cov(zzz∗)B′∗ +D = BQQ′B′ +D = BB′ +D. (7.156)

Hence, on the basis of observations on xxx, we cannot distinguish between theloadings B and the rotated loadings B∗ = BQ. Typically, Q is chosen so thematrix B is easy to interpret, and this is the basis of what is called factorrotation.

Given a sample xxx1, . . . , xxxn, a number of methods are used to estimatethe parameters of the factor model, and we discuss two of them here. Thefirst method is the principal component method. Let Sxx denote the samplevariance–covariance matrix, and let (λi, eeei) be the eigenvalue–eigenvector pairsof Sxx. The p× q matrix of estimated factor loadings is found by setting

B =[λ1/21 eee1

∣∣∣ λ1/22 eee2

∣∣∣ · · · ∣∣∣ λ1/2q eeeq

]. (7.157)

The argument here is that if q factors exist, then

Sxx ≈ λ1eee1eee′1 + · · ·+ λqeeeqeee′q = BB′, (7.158)

because the remaining eigenvalues, λq+1, . . . , λp, will be negligible. The es-

timated diagonal matrix of error variances is then obtained by setting D =diag{δ21 , . . . , δ2p}, where δ2j is the j-th diagonal element of Sxx − BB′.

The second method, which can give answers that are considerably differentfrom the principal component method is maximum likelihood. Upon furtherassumption that in (7.153), zzz and ε are multivariate normal, the log likelihoodof B and D ignoring a constant is

−2 lnL(B, D) = n ln |Σxx|+n∑j=1

xxx′jΣ−1xx xxxj . (7.159)

The likelihood depends on B and D through (7.154), Σxx = BB′ + D. Asdiscussed in (7.155)-(7.156), the likelihood is not well defined because B canbe rotated. Typically, restricting BD−1B′ to be a diagonal matrix is a compu-tationally convenient uniqueness condition. The actual maximization of thelikelihood is accomplished using numerical methods.

One obvious method of performing maximum likelihood for the Gaussianfactor model is the EM algorithm. For example, suppose the factor vector zzzis known. Then, the factor model is simply the multivariate regression modelgiven in §5.7, that is, write X ′ = [xxx1, xxx2, . . . , xxxn] and Z ′ = [zzz1, zzz2, . . . , zzzn], andnote that X is n× p and Z is n× q. Then, the MLE of B is

B = X ′Z(Z ′Z)−1 =(n−1

n∑j=1

xxxjzzz′j

)(n−1

n∑j=1

zzzjzzz′j

)−1def= CxzC

−1zz (7.160)

Page 490: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

7.8 Principal Components and Factor Analysis 477

and the MLE of D is

D = diag{n−1

n∑j=1

(xxxj − Bzzzj

)(xxxj − Bzzzj

)′}; (7.161)

that is, only the diagonal elements of the right-hand side of (7.161) are used.The bracketed quantity in (7.161) reduces to

Cxx − CxzC−1zz C ′xz, (7.162)

where Cxx = n−1∑nj=1 xxxjxxx

′j .

Based on the derivation of the EM algorithm for the state-space model,(4.66)-(4.75), we conclude that, to employ the EM algorithm here, given thecurrent parameter estimates, in Cxz, we replace xxxjzzz

′j by xxxjzzz

′j , where zzzj =

E(zzzj∣∣ xxxj), and in Czz, we replace zzzjzzz

′j by Pz + zzzjzzz

′j , where Pz = var(zzzj

∣∣ xxxj).Using the fact that the (p+ q)× 1 vector (xxx′j , zzz

′j)′ is multivariate normal with

mean-zero, and variance–covariance matrix given by(BB′ +D BB′ Iq

), (7.163)

we havezzzj ≡ E(zzzj

∣∣ xxxj) = B′(B′B +D)−1xxxj (7.164)

andPz ≡ var(zzzj

∣∣ xxxj) = Iq − B′(B′B +D)−1B. (7.165)

For time series, suppose xxxt is a stationary p×1 process with p×p spectralmatrix fxx(ω). Analogous to the classical model displayed in (7.154), we maypostulate that at a given frequency of interest, ω, the spectral matrix of xxxtsatisfies

fxx(ω) = B(ω)B(ω)∗ +D(ω), (7.166)

where B(ω) is a complex-valued p × q matrix with rank(B(ω)

)= q ≤ p and

D(ω) is a real, nonnegative-definite, diagonal matrix. Typically, we expect qwill be much smaller than p.

As an example of a model that gives rise to (7.166), let xxxt = (xt1, . . . , xtp)′,

and supposextj = cjst−τj + εtj , j = 1, . . . , p, (7.167)

where cj ≥ 0 are individual amplitudes and st is a common unobserved signal(factor) with spectral density fss(ω). The values τj are the individual phaseshifts. Assume st is independent of εεεt = (εt1, . . . , εtp)

′ and the spectral matrixof εεεt, Dεε(ω), is diagonal. The DFT of xtj is given by

Xj(ω) = n−1/2n∑t=1

xtj exp(−2πitω)

Page 491: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

478 7 Statistical Methods in the Frequency Domain

and, in terms of the model (7.167),

Xj(ω) = aj(ω)Xs(ω) +Xεj (ω), (7.168)

where aj(ω) = cj exp(−2πiτjω), and Xs(ω) and Xεj (ω) are the respectiveDFTs of the signal st and the noise εtj . Stacking the individual elements of(7.168), we obtain a complex version of the classical factor model with onefactor, X1(ω)

...Xp(ω)

=

a1(ω)...

ap(ω)

Xs(ω) +

Xε1(ω)...

Xεp(ω)

,

or more succinctly,XXX(ω) = aaa(ω)Xs(ω) +XXXε(ω). (7.169)

From (7.169), we can identify the spectral components of the model; that is,

fxx(ω) = bbb(ω)bbb(ω)∗ +Dεε(ω), (7.170)

where bbb(ω) is a p × 1 complex-valued vector, bbb(ω)bbb(ω)∗ = aaa(ω)fss(ω)aaa(ω)∗.Model (7.170) could be considered the one-factor model for time series. Thismodel can be extended to more than one factor by adding other independentsignals into the original model (7.167). More details regarding this and relatedmodels can be found in Stoffer (1999).

Example 7.15 Single Factor Analysis of the fMRI Data

The fMRI data analyzed in Example 7.14 is well suited for a single factoranalysis using the model (7.167), or, equivalently, the complex-valued, singlefactor model (7.169). In terms of (7.167), we can think of the signal st asrepresenting the brush stimulus signal. As before, the frequency of interestis ω = 4/128, which corresponds to a period of 32 time points, or 64 seconds.

A simple way to estimate the components bbb(ω) and Dεε(ω), as specified in

(7.170), is to use the principal components method. Let fxx(ω) denote theestimate of the spectral density of xxxt = (xt1, . . . , xt8)′ obtained in Example7.14. Then, analogous to (7.157) and (7.158), we set

bbb(ω) =

√λ1(ω) eee1(ω),

where(λ1(ω), eee1(ω)

)is the first eigenvalue–eigenvector pair of fxx(ω). The

diagonal elements of Dεε(ω) are obtained from the diagonal elements of

fxx(ω) − bbb(ω)bbb(ω)∗. The appropriateness of the model can be assessed by

checking the elements of the residual matrix, fxx(ω)− [bbb(ω)bbb(ω)∗+ Dεε(ω)],are negligible in magnitude.

Concentrating on the stimulus frequency, recall λ1(4/128) = 2. The mag-nitudes of eee1(4/128) are displayed in Table 7.5, indicating all locations load

Page 492: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

7.8 Principal Components and Factor Analysis 479

on the stimulus factor except for location 6, and location 7 could be consid-ered borderline. The diagonal elements of fxx(ω)− bbb(ω)bbb(ω)∗ yield

Dεε(4/128) = 0.001× diag{1.36, 2.04, 6.22, 11.30, 0.73, 13.26, 6.93, 5.88}.

The magnitudes of the elements of the residual matrix at ω = 4/128 are

0.001×

0.00 1.73 3.88 3.61 0.88 2.04 1.60 2.812.41 0.00 1.17 3.77 1.49 5.58 3.68 4.218.49 5.34 0.00 2.94 7.58 10.91 8.36 10.64

12.65 11.84 6.12 0.00 12.56 14.64 13.34 16.100.32 0.29 2.10 2.01 0.00 1.18 2.01 1.18

10.34 16.69 17.09 15.94 13.49 0.00 5.78 14.745.71 8.51 8.94 10.18 7.56 0.97 0.00 8.666.25 8.00 10.31 10.69 5.95 8.69 7.64 0.00

,

indicating the model fit is good. Assuming the results of the previous exam-ple are available, use the following R code.

1 bhat = sqrt(lam[1])*evec[,1]

2 Dhat = Re(diag(fxx[,,4] - bhat%*%Conj(t(bhat))))

3 res = Mod(fxx[,,4] - Dhat - bhat%*%Conj(t(bhat)))

A number of authors have considered factor analysis in the spectral do-main, for example Priestley et al. (1974); Priestley and Subba Rao (1975);Geweke (1977), and Geweke and Singleton (1981), to mention a few. An ob-vious extension of simple model (7.167) is the factor model

xxxt =∞∑

j=−∞Λjssst−j + εεεt, (7.171)

where {Λj} is a real-valued p × q filter, ssst is a q × 1 stationary, unobservedsignal, with independent components, and εεεt is white noise. We assume thesignal and noise process are independent, ssst has q × q real, diagonal spectralmatrix fss(ω) = diag{fs1(ω), . . . , fsq(ω)}, and εεεt has a real, diagonal, p × pspectral matrix given by Dεε(ω) = diag{fε1(ω), . . . , fεp(ω)}. If, in addition,∑||Λj || <∞, the spectral matrix of xxxt can be written as

fxx(ω) = Λ(ω)fss(ω)Λ(ω)∗ +Dεε(ω) = B(ω)B(ω)∗ +Dεε(ω), (7.172)

where

Λ(ω) =∞∑

t=−∞Λt exp(−2πitω) (7.173)

and B(ω) = Λ(ω)f1/2ss (ω). Thus, by (7.172), the model (7.171) is seen to satisfy

the basic requirement of the spectral domain factor analysis model; that is,the p × p spectral density matrix of the process of interest, fxx(ω), is the

Page 493: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

480 7 Statistical Methods in the Frequency Domain

sum of a rank q ≤ p matrix, B(ω)B(ω)∗, and a real, diagonal matrix, Dεε(ω).For the purpose of identifiability we set fss(ω) = Iq for all ω; in which case,B(ω) = Λ(ω). As in the classical case [see (7.156)], the model is specified onlyup to rotations; for details, see Bloomfield and Davis (1994).

Parameter estimation for the model (7.171), or equivalently (7.172), can

be accomplished using the principal component method. Let fxx(ω) be an

estimate of fxx(ω), and let(λj(ω), eeej(ω)

), for j = 1, . . . , p, be the eigenvalue–

eigenvector pairs, in the usual order, of fxx(ω). Then, as in the classical case,the p× q matrix B is estimated by

B(ω) =[λ1(ω)1/2 eee1(ω)

∣∣∣ λ2(ω)1/2 eee2(ω)∣∣∣ · · · ∣∣∣ λq(ω)1/2 eeeq(ω)

]. (7.174)

The estimated diagonal spectral density matrix of errors is then obtained bysetting Dεε(ω) = diag{fε1(ω), . . . , fεp(ω)}, where fεj(ω) is the j-th diagonal

element of fxx(ω)− B(ω)B(ω)∗.Alternatively, we can estimate the parameters by approximate likelihood

methods. As in (7.169), let XXX(ωj) denote the DFT of the data xxx1, . . . , xxxn atfrequency ωj = j/n. Similarly, let XXXs(ωj) and XXXε(ωj) be the DFTs of thesignal and of the noise processes, respectively. Then, under certain conditions(see Pawitan and Shumway, 1989), for ` = 0,±1, . . . ,±m,

XXX(ωj + `/n) = Λ(ωj)XXXs(ωj + `/n) +XXXε(ωj + `/n) + oas(n−α), (7.175)

where Λ(ωj) is given by (7.173) and oas(n−α) → 0 almost surely for some

0 ≤ α < 1/2 as n → ∞. In (7.175), the XXX(ωj + `/n) are the DFTs of thedata at the L odd frequencies {ωj + `/n; ` = 0,±1, . . . ,±m} surrounding thecentral frequency of interest ωj = j/n.

Under appropriate conditions {XXX(ωj+`/n); ` = 0,±1, . . . ,±m} in (7.175)are approximately (n → ∞) independent, complex Gaussian random vectorswith variance–covariance matrix fxx(ωj). The approximate likelihood is givenby

−2 lnL(B(ωj), Dεε(ωj)

)= n ln

∣∣fxx(ωj)∣∣+

m∑`=−m

XXX∗(ωj + `/n)f−1xx (ωj)XXX(ωj + `/n), (7.176)

with the constraint fxx(ωj) = B(ωj)B(ωj)∗+Dεε(ωj). As in the classical case,

we can use various numerical methods to maximize L(B(ωj), Dεε(ωj)

)at every

frequency, ωj , of interest. For example, the EM algorithm discussed for theclassical case, (7.160)-(7.165), can easily be extended to this case.

Assuming fss(ω) = Iq, the estimate of B(ωj) is also the estimate of Λ(ωj).

Calling this estimate Λ(ωj), the time domain filter can be estimated by

ΛMt = M−1M−1∑j=0

Λ(ωj) exp(2πijt/n), (7.177)

Page 494: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

7.8 Principal Components and Factor Analysis 481

−20

010

2030

40

unem

p

−2

−1

01

23

4

gnp

−2

−1

01

23

4

1950 1960 1970 1980

cons

um

Time

−5

05

1015

govi

nv

−20

−10

010

20

1950 1960 1970 1980

prin

v

Time

Growth Rates (%)

Fig. 7.19. The seasonally adjusted, quarterly growth rate (as percentages) of fivemacroeconomic series, unemployment, GNP, consumption, government investment,and private investment in the United States between 1948 and 1988, n = 160 values.

for some 0 < M ≤ n, which is the discrete and finite version of the inversionformula given by

Λt =

∫ 1/2

−1/2Λ(ω) exp(2πiωt)dω. (7.178)

Note that we have used this approximation earlier in Chapter 4, (4.130), forestimating the time response of a frequency response function defined over afinite number of frequencies.

Example 7.16 Government Spending, Private Investment, and Un-employment

Figure 7.19 shows the seasonally adjusted, quarterly growth rate (as per-centages) of five macroeconomic series, unemployment, GNP, consumption,government investment, and private investment in the United States be-

Page 495: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

482 7 Statistical Methods in the Frequency Domain

0.0 0.5 1.0 1.5 2.0

0.0

0.1

0.2

0.3

0.4

0.5

frequency

spec

trum

Individual Spectra

unempgnpconsumgovinvprinv

Fig. 7.20. The individual estimated spectra (scaled by 1000) of each series show inFigure 7.19 in terms of the number of cycles in 160 quarters.

tween 1948 and 1988, n = 160 values. These data are analyzed in thetime domain by Young and Pedregal (1998), who were investigating howgovernment spending and private capital investment influenced the rate ofunemployment.

Spectral estimation was performed on the detrended, standardized, andtapered growth rate values; see the R code at the end of this example fordetails. Figure 7.20 shows the individual estimated spectra of each series.We focus on three interesting frequencies. First, we note the lack of spectralpower near the annual cycle (ω = 1, or one cycle every four quarters),indicating the data have been seasonally adjusted. In addition, because ofthe seasonal adjustment, some spectral power appears near the seasonalfrequency; this is a distortion apparently caused by the method of seasonallyadjusting the data. Next, we note the spectral power near ω = .25, or onecycle every four years, in unemployment, GNP, consumption, and, to lesserdegree, in private investment. Finally, spectral power appears near ω = .125,or one cycle every eight years in government investment, and perhaps tolesser degrees in unemployment, GNP, and consumption.

Figure 7.21 shows the coherences among various series. At the frequenciesof interest, ω = .125 and .25, pairwise, GNP, Unemployment, Consumption,

Page 496: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

7.8 Principal Components and Factor Analysis 483

0.0

0.4

0.8

gnp

Squared Coherencies

0.0

0.4

0.8

cons

um

Squared CoherenciesSquared Coherencies

0.0

0.4

0.8

govi

nv

Squared CoherenciesSquared CoherenciesSquared Coherencies

0.0

0.4

0.8

prin

v

0.0 0.5 1.0 1.5 2.0

unemp

Squared Coherencies

0.0 0.5 1.0 1.5 2.0

gnp

Squared Coherencies

0.0 0.5 1.0 1.5 2.0

consum

Squared Coherencies

0.0 0.5 1.0 1.5 2.0

govinv

Squared Coherencies

Fig. 7.21. The squared coherencies between the various series displayed in Figure7.19.

and Private Investment (except for Unemployment and Private Investment)are coherent. Government Investment is either not coherent or minimallycoherent with the other series.

Figure 7.22 shows λ1(ω) and λ2(ω), the first and second eigenvalues of

the estimated spectral matrix fxx(ω). These eigenvalues suggest the firstfactor is identified by the frequency of one cycle every four years, whereasthe second factor is identified by the frequency of one cycle every eightyears. The modulus of the corresponding eigenvectors at the frequencies ofinterest, eee1(10/160) and eee2(5/160), are shown in Table 7.6. These valuesconfirm Unemployment, GNP, Consumption, and Private Investment loadon the first factor, and Government Investment loads on the second factor.The remainder of the details involving the factor analysis of these data isleft as an exercise.

The following code was used to perform the analysis is R.

Page 497: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

484 7 Statistical Methods in the Frequency Domain

0.0 0.5 1.0 1.5 2.0

0.2

0.4

0.6

0.8

1.0

1.2

First Eigenvalue

Frequency

0.0 0.5 1.0 1.5 2.0

0.05

0.15

0.25

0.35

Second Eigenvalue

Frequency

Fig. 7.22. The first, λ1(ω), and second, λ2(ω), eigenvalues of the estimated spectral

matrix fxx(ω). The vertical dashed lines are shown at ω = .25 and .125, respectively.

1 gr = diff(log(ts(econ5, start=1948, frequency=4))) # growth rate

2 plot(100*gr, main="Growth Rates (%)")

3 # scale each series to have variance 1

4 gr = ts(apply(gr,2,scale), freq=4) # scaling strips ts attributes

5 L = c(7,7) # degree of smoothing

6 gr.spec = mvspec(gr, spans=L, demean=FALSE, detrend=FALSE,

taper=.25)

7 dev.new()

8 plot(kernel("modified.daniell", L)) # view the kernel - not shown

9 dev.new()

10 plot(gr.spec, log="no",col=1, main="Individual Spectra", lty=1:5,

lwd=2)

11 legend("topright", colnames(econ5), lty=1:5, lwd=2)

12 dev.new()

13 plot.spec.coherency(gr.spec, ci=NA, main="Squared Coherencies")

14 # PCs

15 n.freq = length(gr.spec$freq)

16 lam = matrix(0,n.freq,5)

17 for (k in 1:n.freq) lam[k,] = eigen(gr.spec$fxx[,,k],

symmetric=TRUE, only.values=TRUE)$values

Page 498: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

7.9 The Spectral Envelope 485

Table 7.6. Magnitudes of the Eigenvectors in Example 7.16

Unemp GNP Cons G. Inv. P. Inv.∣∣ eee1( 10160

)∣∣ 0.53 0.50 0.51 0.06 0.44∣∣ eee2( 5

160)∣∣ 0.19 0.14 0.23 0.93 0.16

18 dev.new()

19 par(mfrow=c(2,1), mar=c(4,2,2,1), mgp=c(1.6,.6,0))

20 plot(gr.spec$freq, lam[,1], type="l", ylab="", xlab="Frequency",

main="First Eigenvalue")

21 abline(v=.25, lty=2)

22 plot(gr.spec$freq, lam[,2], type="l", ylab="", xlab="Frequency",

main="Second Eigenvalue")

23 abline(v=.125, lty=2)

24 e.vec1 = eigen(gr.spec$fxx[,,10], symmetric=TRUE)$vectors[,1]

25 e.vec2 = eigen(gr.spec$fxx[,,5], symmetric=TRUE)$vectors[,2]

26 round(Mod(e.vec1), 2); round(Mod(e.vec2), 3)

7.9 The Spectral Envelope

The concept of spectral envelope for the spectral analysis and scaling of cate-gorical time series was first introduced in Stoffer et al. (1993). Since then, theidea has been extended in various directions (not only restricted to categoricaltime series), and we will explore these problems as well. First, we give a briefintroduction to the concept of scaling time series.

The spectral envelope was motivated by collaborations with researcherswho collected categorical-valued time series with an interest in the cyclic be-havior of the data. For example, Table 7.7 shows the per-minute sleep stateof an infant taken from a study on the effects of prenatal exposure to alcohol.Details can be found in Stoffer et al. (1988), but, briefly, an electroencephalo-graphic (EEG) sleep recording of approximately two hours is obtained on afull-term infant 24 to 36 hours after birth, and the recording is scored by apediatric neurologist for sleep state. Sleep state is categorized, per minute,into one of six possible states: qt: quiet sleep - trace alternant, qh: quiet sleep- high voltage, tr: transitional sleep, al: active sleep - low voltage, ah: activesleep - high voltage, and aw: awake. This particular infant was never awakeduring the study.

It is not difficult to notice a pattern in the data if we concentrate on activevs. quiet sleep (that is, focus on the first letter). But, it would be difficult totry to assess patterns in a longer sequence, or if more categories were present,without some graphical aid. One simple method would be to scale the data,that is, assign numerical values to the categories and then draw a time plotof the scaled series. Because the states have an order, one obvious scaling is

Page 499: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

486 7 Statistical Methods in the Frequency Domain

Table 7.7. Infant EEG Sleep-States (per minute)

(read down and across)

ah qt qt al tr qt al ah

ah qt qt ah tr qt al ah

ah qt tr ah tr qt al ah

ah qt al ah qh qt al ah

ah qt al ah qh qt al ah

ah tr al ah qt qt al ah

ah qt al ah qt qt al ah

ah qt al ah qt qt al ah

tr qt tr tr qt qt al tr

ah qt ah tr qt tr al

tr qt al ah qt al al

ah qt al ah qt al al

ah qt al ah qt al al

qh qt al ah qt al ah

1 = qt 2 = qh 3 = tr 4 = al 5 = ah 6 = aw, (7.179)

and Figure 7.23 shows the time plot using this scaling. Another interestingscaling might be to combine the quiet states and the active states:

1 = qt 1 = qh 2 = tr 3 = al 3 = ah 4 = aw. (7.180)

The time plot using (7.180) would be similar to Figure 7.23 as far as thecyclic (in and out of quiet sleep) behavior of this infant’s sleep pattern. Fig-ure 7.23 shows the periodogram of the sleep data using the scaling in (7.179).A large peak exists at the frequency corresponding to one cycle every 60 min-utes. As we might imagine, the general appearance of the periodogram usingthe scaling (7.180) (not shown) is similar to Figure 7.23. Most of us wouldfeel comfortable with this analysis even though we made an arbitrary and adhoc choice about the particular scaling. It is evident from the data (withoutany scaling) that if the interest is in infant sleep cycling, this particular sleepstudy indicates an infant cycles between active and quiet sleep at a rate ofabout one cycle per hour.

The intuition used in the previous example is lost when we consider along DNA sequence. Briefly, a DNA strand can be viewed as a long string oflinked nucleotides. Each nucleotide is composed of a nitrogenous base, a fivecarbon sugar, and a phosphate group. There are four different bases, and theycan be grouped by size; the pyrimidines, thymine (T) and cytosine (C), andthe purines, adenine (A) and guanine (G). The nucleotides are linked togetherby a backbone of alternating sugar and phosphate groups with the 5′ carbonof one sugar linked to the 3′ carbon of the next, giving the string direction.DNA molecules occur naturally as a double helix composed of polynucleotidestrands with the bases facing inwards. The two strands are complementary, so

Page 500: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

7.9 The Spectral Envelope 487

Minute of Sleep

Sle

ep−

Sta

te

0 20 40 60 80 100

12

34

5

0.0 0.1 0.2 0.3 0.4 0.5

020

4060

80

Frequency

Per

iodo

gram

Fig. 7.23. [Top] Time plot of the EEG sleep state data in Table 7.10 using the scalingin (7.179). [Bottom] Periodogram of the EEG sleep state data in Figure 7.23 basedon the scaling in (7.179). The peak corresponds to a frequency of approximately onecycle every 60 minutes.

it is sufficient to represent a DNA molecule by a sequence of bases on a singlestrand. Thus, a strand of DNA can be represented as a sequence of letters,termed base pairs (bp), from the finite alphabet {A, C, G, T}. The order of thenucleotides contains the genetic information specific to the organism. Expres-sion of information stored in these molecules is a complex multistage process.One important task is to translate the information stored in the protein-coding sequences (CDS) of the DNA. A common problem in analyzing longDNA sequence data is in identifying CDS dispersed throughout the sequenceand separated by regions of noncoding (which makes up most of the DNA).Table 7.11 shows part of the Epstein–Barr virus (EBV) DNA sequence. Theentire EBV DNA sequence consists of approximately 172,000 bp.

Page 501: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

488 7 Statistical Methods in the Frequency Domain

Table 7.8. Part of the Epstein–Barr Virus DNA Sequence

(read across and down)

AGAATTCGTC TTGCTCTATT CACCCTTACT TTTCTTCTTG CCCGTTCTCT TTCTTAGTAT

GAATCCAGTA TGCCTGCCTG TAATTGTTGC GCCCTACCTC TTTTGGCTGG CGGCTATTGC

CGCCTCGTGT TTCACGGCCT CAGTTAGTAC CGTTGTGACC GCCACCGGCT TGGCCCTCTC

ACTTCTACTC TTGGCAGCAG TGGCCAGCTC ATATGCCGCT GCACAAAGGA AACTGCTGAC

ACCGGTGACA GTGCTTACTG CGGTTGTCAC TTGTGAGTAC ACACGCACCA TTTACAATGC

ATGATGTTCG TGAGATTGAT CTGTCTCTAA CAGTTCACTT CCTCTGCTTT TCTCCTCAGT

CTTTGCAATT TGCCTAACAT GGAGGATTGA GGACCCACCT TTTAATTCTC TTCTGTTTGC

ATTGCTGGCC GCAGCTGGCG GACTACAAGG CATTTACGGT TAGTGTGCCT CTGTTATGAA

ATGCAGGTTT GACTTCATAT GTATGCCTTG GCATGACGTC AACTTTACTT TTATTTCAGT

TCTGGTGATG CTTGTGCTCC TGATACTAGC GTACAGAAGG AGATGGCGCC GTTTGACTGT

TTGTGGCGGC ATCATGTTTT TGGCATGTGT ACTTGTCCTC ATCGTCGACG CTGTTTTGCA

GCTGAGTCCC CTCCTTGGAG CTGTAACTGT GGTTTCCATG ACGCTGCTGC TACTGGCTTT

CGTCCTCTGG CTCTCTTCGC CAGGGGGCCT AGGTACTCTT GGTGCAGCCC TTTTAACATT

GGCAGCAGGT AAGCCACACG TGTGACATTG CTTGCCTTTT TGCCACATGT TTTCTGGACA

CAGGACTAAC CATGCCATCT CTGATTATAG CTCTGGCACT GCTAGCGTCA CTGATTTTGG

GCACACTTAA CTTGACTACA ATGTTCCTTC TCATGCTCCT ATGGACACTT GGTAAGTTTT

CCCTTCCTTT AACTCATTAC TTGTTCTTTT GTAATCGCAG CTCTAACTTG GCATCTCTTT

TACAGTGGTT CTCCTGATTT GCTCTTCGTG CTCTTCATGT CCACTGAGCA AGATCCTTCT

We could try scaling according to the purine–pyrimidine alphabet, that isA = G = 0 and C = T = 1, but this is not necessarily of interest for everyCDS of EBV. Numerous possible alphabets of interest exist. For example, wemight focus on the strong–weak hydrogen-bonding alphabet C = G = 0 andA = T = 1. Although model calculations as well as experimental data stronglyagree that some kind of periodic signal exists in certain DNA sequences, alarge disagreement about the exact type of periodicity exists. In addition,a disagreement exists about which nucleotide alphabets are involved in thesignals.

If we consider the naive approach of arbitrarily assigning numerical values(scales) to the categories and then proceeding with a spectral analysis, theresult will depend on the particular assignment of numerical values. For exam-ple, consider the artificial sequence ACGTACGTACGT. . . . Then, setting A = G = 0and C = T = 1 yields the numerical sequence 010101010101. . . , or one cycleevery two base pairs. Another interesting scaling is A = 1, C = 2, G = 3, andT = 4, which results in the sequence 123412341234. . . , or one cycle every fourbp. In this example, both scalings (that is, {A, C, G, T} = {0, 1, 0, 1} and{A, C, G, T} = {1, 2, 3, 4}) of the nucleotides are interesting and bring outdifferent properties of the sequence. Hence, we do not want to focus on onlyone scaling. Instead, the focus should be on finding all possible scalings thatbring out all of the interesting features in the data. Rather than choose valuesarbitrarily, the spectral envelope approach selects scales that help emphasizeany periodic feature that exists in a categorical time series of virtually any

Page 502: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

7.9 The Spectral Envelope 489

length in a quick and automated fashion. In addition, the technique can help indetermining whether a sequence is merely a random assignment of categories.

The Spectral Envelope for Categorical Time Series

As a general description, the spectral envelope is a frequency-based principalcomponents technique applied to a multivariate time series. First, we willfocus on the basic concept and its use in the analysis of categorical timeseries. Technical details can be found in Stoffer et al. (1993).

Briefly, in establishing the spectral envelope for categorical time series, thebasic question of how to efficiently discover periodic components in categoricaltime series was addressed. This, was accomplished via nonparametric spectralanalysis as follows. Let xt, t = 0, ±1, ±2, . . . , be a categorical-valued timeseries with finite state-space C = {c1, c2, . . ., ck}. Assume xt is stationary andpj = Pr{xt = cj} > 0 for j = 1, 2, . . . , k. For βββ = (β1, β2, . . . , βk)′ ∈ Rk,denote by xt(βββ) the real-valued stationary time series corresponding to thescaling that assigns the category cj the numerical value βj , j = 1, 2, . . . ,k. The spectral density of xt(βββ) will be denoted by fxx(ω;βββ). The goal is tofind scalings βββ, so the spectral density is in some sense interesting, and tosummarize the spectral information by what is called the spectral envelope.

In particular, βββ is chosen to maximize the power at each frequency, ω, ofinterest, relative to the total power σ2(βββ) = var{xt(βββ)}. That is, we choseβββ(ω), at each ω of interest, so

λ(ω) = maxβ

{fxx(ω;βββ)

σ2(βββ)

}, (7.181)

over all βββ not proportional to 111k, the k × 1 vector of ones. Note, λ(ω) is notdefined if βββ = a111k for a ∈ R because such a scaling corresponds to assigningeach category the same value a; in this case, fxx(ω ; βββ) ≡ 0 and σ2(βββ) = 0. Theoptimality criterion λ(ω) possesses the desirable property of being invariantunder location and scale changes of βββ.

As in most scaling problems for categorical data, it is useful to repre-sent the categories in terms of the unit vectors uuu1, uuu2, . . ., uuuk, where uuuj rep-resents the k × 1 vector with a one in the j-th row, and zeros elsewhere.We then define a k-dimensional stationary time series yyyt by yyyt = uuuj whenxt = cj . The time series xt(βββ) can be obtained from the yyyt time series bythe relationship xt(βββ) = βββ′yyyt. Assume the vector process yyyt has a continu-ous spectral density denoted by fyy(ω). For each ω, fyy(ω) is, of course, ak × k complex-valued Hermitian matrix. The relationship xt(βββ) = βββ′yyyt im-plies fxx(ω; βββ) = βββ′fyy(ω)βββ = βββ′freyy(ω)βββ, where freyy(ω) denotes the real part2

of fyy(ω). The imaginary part disappears from the expression because it isskew-symmetric, that is, f imyy (ω)′ = −f imyy (ω). The optimality criterion canthus be expressed as

2 In this section, it is more convenient to write complex values in the form z =zre + izim, which represents a change from the notation used previously.

Page 503: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

490 7 Statistical Methods in the Frequency Domain

λ(ω) = maxβ

{βββ′freyy(ω)βββ

βββ′V βββ

}, (7.182)

where V is the variance–covariance matrix of yyyt. The resulting scaling βββ(ω)is called the optimal scaling.

The yyyt process is a multivariate point process, and any particular com-ponent of yyyt is the individual point process for the corresponding state (forexample, the first component of yyyt indicates whether the process is in statec1 at time t). For any fixed t, yyyt represents a single observation from a simplemultinomial sampling scheme. It readily follows that V = D − p p′, wherep = (p1, . . ., pk)′, and D is the k × k diagonal matrix D = diag{p1, . . .,pk}. Because, by assumption, pj > 0 for j = 1, 2, . . . , k, it follows thatrank(V ) = k − 1 with the null space of V being spanned by 111k. For anyk × (k − 1) full rank matrix Q whose columns are linearly independent of 111k,Q′V Q is a (k − 1)× (k − 1) positive-definite symmetric matrix.

With the matrix Q as previously defined, define λ(ω) to be the largesteigenvalue of the determinantal equation

|Q′freyy(ω)Q− λ(ω)Q′V Q| = 0,

and let bbb(ω) ∈ Rk−1 be any corresponding eigenvector, that is,

Q′freyy(ω)Qbbb(ω) = λ(ω)Q′V Qbbb(ω).

The eigenvalue λ(ω) ≥ 0 does not depend on the choice of Q. Although theeigenvector bbb(ω) depends on the particular choice of Q, the equivalence classof scalings associated with βββ(ω) = Qbbb(ω) does not depend on Q. A convenientchoice of Q is Q = [Ik−1 | 000 ]′, where Ik−1 is the (k − 1) × (k − 1) identitymatrix and 000 is the (k−1)×1 vector of zeros . For this choice, Q′freyy(ω)Q andQ′V Q are the upper (k − 1) × (k − 1) blocks of freyy(ω) and V , respectively.This choice corresponds to setting the last component of βββ(ω) to zero.

The value λ(ω) itself has a useful interpretation; specifically, λ(ω)dω rep-resents the largest proportion of the total power that can be attributed tothe frequencies (ω, ω + dω) for any particular scaled process xt(βββ), with themaximum being achieved by the scaling βββ(ω). Because of its central role, λ(ω)is defined to be the spectral envelope of a stationary categorical time series.

The name spectral envelope is appropriate since λ(ω) envelopes the stan-dardized spectrum of any scaled process. That is, given any βββ normalized sothat xt(βββ) has total power one, fxx(ω ; βββ) ≤ λ(ω) with equality if and only ifβββ is proportional to βββ(ω).

Given observations xt, for t = 1, . . . , n, on a categorical time series, weform the multinomial point process yyyt, for t = 1, . . . , n. Then, the theory forestimating the spectral density of a multivariate, real-valued time series canbe applied to estimating fyy(ω), the k × k spectral density of yyyt. Given an

estimate fyy(ω) of fyy(ω), estimates λ(ω) and βββ(ω) of the spectral envelope,λ(ω), and the corresponding scalings, βββ(ω), can then be obtained. Details on

Page 504: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

7.9 The Spectral Envelope 491

estimation and inference for the sample spectral envelope and the optimalscalings can be found in Stoffer et al. (1993), but the main result of that

paper is as follows: If fyy(ω) is a consistent spectral estimator and if for eachj = 1, . . ., J , the largest root of freyy(ωj) is distinct, then{

ηn[λ(ωj)− λ(ωj)]/λ(ωj), ηn[βββ(ωj)− βββ(ωj)]; j = 1, . . . , J}

(7.183)

converges (n→∞) jointly in distribution to independent zero-mean, normal,distributions, the first of which is standard normal; the asymptotic covariancestructure of βββ(ωj) is discussed in Stoffer et al. (1993). Result (7.183) is similar

to (7.150), but in this case, βββ(ω) and βββ(ω) are real. The term ηn is the same asin (7.183), and its value depends on the type of estimator being used. Based onthese results, asymptotic normal confidence intervals and tests for λ(ω) canbe readily constructed. Similarly, for βββ(ω), asymptotic confidence ellipsoidsand chi-square tests can be constructed; details can be found in Stoffer etal. (1993, Theorems 3.1 – 3.3).

Peak searching for the smoothed spectral envelope estimate can be aidedusing the following approximations. Using a first-order Taylor expansion, wehave

log λ(ω) ≈ log λ(ω) +λ(ω)− λ(ω)

λ(ω), (7.184)

so ηn[log λ(ω) − log λ(ω)] is approximately standard normal. It follows that

E[log λ(ω)] ≈ log λ(ω) and var[log λ(ω)] ≈ η−2n . If no signal is present ina sequence of length n, we expect λ(j/n) ≈ 2/n for 1 < j < n/2, and

hence approximately (1 − α) × 100% of the time, log λ(ω) will be less thanlog(2/n) + (zα/ηn), where zα is the (1− α) upper tail cutoff of the standard

normal distribution. Exponentiating, the α critical value for λ(ω) becomes(2/n) exp(zα/ηn). Useful values of zα are z.001 = 3.09, z.0001 = 3.71, andz.00001 = 4.26, and from our experience, thresholding at these levels workswell.

Example 7.17 Spectral Analysis of DNA SequencesWe give explicit instructions for the calculations involved in estimating the

spectral envelope of a DNA sequence, xt, for t = 1, . . . , n, using the nucleotidealphabet.

(i) In this example, we hold the scale for T fixed at zero. In this case, weform the 3× 1 data vectors yyyt:

yyyt = (1, 0, 0)′ if xt = A; yyyt = (0, 1, 0)′ if xt = C;yyyt = (0, 0, 1)′ if xt = G; yyyt = (0, 0, 0)′ if xt = T.

The scaling vector is βββ = (β1, β2, β3)′, and the scaled process is xt(βββ) =βββ′yyyt.

Page 505: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

492 7 Statistical Methods in the Frequency Domain

(ii) Calculate the DFT of the data

YYY (j/n) = n−1/2n∑t=1

yyyt exp(−2πitj/n).

Note YYY (j/n) is a 3×1 complex-valued vector. Calculate the periodogram,I(j/n) = YYY (j/n)YYY ∗(j/n), for j = 1, . . . , [n/2], and retain only the realpart, say, Ire(j/n).

(iii) Smooth the Ire(j/n) to obtain an estimate of freyy(j/n). For example,using (7.149) with L = 3 and triangular weighting, we would calculate

freyy(j/n) =1

4Ire(j − 1

n

)+

1

2Ire(j

n

)+

1

4Ire(j + 1

n

).

(iv) Calculate the 3× 3 sample variance–covariance matrix,

Syy = n−1n∑t=1

(yyyt − yyy)(yyyt − yyy)′,

where yyy = n−1∑nt=1 yyyt is the sample mean of the data.

(v) For each ωj = j/n, j = 0, 1, . . . , [n/2], determine the largest eigenvalue

and the corresponding eigenvector of the matrix 2n−1S−1/2yy freyy(ωj)S

−1/2yy .

Note, S1/2yy is the unique square root matrix of Syy.

(vi) The sample spectral envelope λ(ωj) is the eigenvalue obtained in theprevious step. If bbb(ωj) denotes the eigenvector obtained in the previous

step, the optimal sample scaling is βββ(ωj) = S−1/2yy bbb(ωj); this will result

in three values, the value corresponding to the fourth category, T beingheld fixed at zero.

Example 7.18 Dynamic Analysis of the Gene Labeled BNRF1 ofthe Epstein–Barr Virus

In this example, we focus on a dynamic (or sliding-window) analysis of thegene labeled BNRF1 (bp 1736-5689) of Epstein–Barr. Figure 7.24 shows thespectral envelope estimate of the entire coding sequence (3954 bp long). Thefigure also shows a strong signal at frequency 1/3; the corresponding optimalscaling was A = .10, C = .61, G = .78, T = 0, which indicates the signal is inthe strong–weak bonding alphabet, S = {C, G} and W = {A, T}.

Figure 7.25 shows the result of computing the spectral envelope over threenonoverlapping 1000-bp windows and one window of 954 bp, across theCDS, namely, the first, second, third, and fourth quarters of BNRF1. Anapproximate 0.0001 significance threshold is .69%. The first three quarterscontain the signal at the frequency 1/3 (Figure 7.25a-c); the correspondingsample optimal scalings for the first three windows were (a) A = .01, C =.71, G = .71, T = 0; (b) A = .08, C = 0.71, G = .70, T = 0; (c) A = .20, C =

Page 506: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

7.9 The Spectral Envelope 493

0.0 0.1 0.2 0.3 0.4 0.5

0.1

0.2

0.3

0.4

0.5

0.6

0.7

frequency

Spec

tral E

nvel

ope

(%)

Fig. 7.24. Smoothed sample spectral envelope of the BNRF1 gene from the Epstein–Barr virus.

.58, G = .79, T = 0. The first two windows are consistent with the overallanalysis. The third section, however, shows some minor departure from thestrong-weak bonding alphabet. The most interesting outcome is that thefourth window shows that no signal is present. This leads to the conjecturethat the fourth quarter of BNRF1 of Epstein–Barr is actually noncoding.

The R code for this example is as follows.1 u = factor(bnrf1ebv) # first, input the data as factors and then

2 x = model.matrix(~u-1)[,1:3] # make an indicator matrix

3 # x = x[1:1000,] # select subsequence if desired

4 Var = var(x) # var-cov matrix

5 xspec = mvspec(x, spans=c(7,7))

6 fxxr = Re(xspec$fxx) # fxxr is real(fxx)

7 # compute Q = Var^-1/2

8 ev = eigen(Var)

9 Q = ev$vectors%*%diag(1/sqrt(ev$values))%*%t(ev$vectors)

10 # compute spec envelope and scale vectors

11 num = xspec$n.used # sample size used for FFT

12 nfreq = length(xspec$freq) # number of freqs used

13 specenv = matrix(0,nfreq,1) # initialize the spec envelope

14 beta = matrix(0,nfreq,3) # initialize the scale vectors

15 for (k in 1:nfreq){

16 ev = eigen(2*Q%*%fxxr[,,k]%*%Q/num, symmetric=TRUE)

17 specenv[k] = ev$values[1] # spec env at freq k/n is max evalue

18 b = Q%*%ev$vectors[,1] # beta at freq k/n

Page 507: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

494 7 Statistical Methods in the Frequency Domain

0.0 0.1 0.2 0.3 0.4 0.5

0.0

0.2

0.4

0.6

0.8

1.0

1.2

frequency

Spe

ctra

l Env

elop

e (%

)(a)

0.0 0.1 0.2 0.3 0.4 0.5

0.0

0.2

0.4

0.6

0.8

1.0

1.2

frequency

Spe

ctra

l Env

elop

e (%

)

(b)

0.0 0.1 0.2 0.3 0.4 0.5

0.0

0.2

0.4

0.6

0.8

1.0

1.2

frequency

Spe

ctra

l Env

elop

e (%

)

(c)

0.0 0.1 0.2 0.3 0.4 0.5

0.0

0.2

0.4

0.6

0.8

1.0

1.2

frequency

Spe

ctra

l Env

elop

e (%

)(d)

Fig. 7.25. Smoothed sample spectral envelope of the BNRF1 gene from the Epstein–Barr virus: (a) first 1000 bp, (b) second 1000 bp, (c) third 1000 bp, and (d) last 954bp.

19 beta[k,] = b/sqrt(sum(b^2)) } # helps to normalize beta

20 # output and graphics

21 frequency = xspec$freq

22 plot(frequency, 100*specenv, type="l", ylab="Spectral Envelope (%)")

23 # add significance threshold to plot

24 m = xspec$kernel$m

25 etainv = sqrt(sum(xspec$kernel[-m:m]^2))

26 thresh = 100*(2/num)*exp(qnorm(.9999)*etainv)*rep(1,nfreq)

27 lines(frequency, thresh, lty="dashed", col="blue")

28 # details

29 output = cbind(frequency, specenv, beta)

30 colnames(output) = c("freq","specenv", "A", "C", "G")

31 round(output,3)

Page 508: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

7.9 The Spectral Envelope 495

The Spectral Envelope for Real-Valued Time Series

The concept of the spectral envelope for categorical time series was extendedto real-valued time series, {xt; t = 0,±1,±2, . . . , }, in McDougall et al. (1997).The process xt can be vector-valued, but here we will concentrate on theunivariate case. Further details can be found in McDougall et al. (1997). Theconcept is similar to projection pursuit (Friedman and Stuetzle, 1981). Let Gdenote a k-dimensional vector space of continuous real-valued transformationswith {g1, . . . , gk} being a set of basis functions satisfying E[gi(xt)

2] < ∞,i = 1, . . . , k. Analogous to the categorical time series case, define the scaledtime series with respect to the set G to be the real-valued process

xt(βββ) = βββ′yyyt = β1g1(xt) + · · ·+ βkgk(xt)

obtained from the vector process

yyyt =(g1(Xt), . . . , gk(Xt)

)′,

where βββ = (β1, . . . , βk)′ ∈ Rk. If the vector process, yyyt, is assumed to havea continuous spectral density, say, fyy(ω), then xt(βββ) will have a continuousspectral density fxx(ω;βββ) for all βββ 6= 000. Noting, fxx(ω;βββ) = βββ′fyy(ω)βββ =βββ′freyy(ω)βββ, and σ2(βββ) = var[xt(βββ)] = βββ′V βββ, where V = var(yyyt) is assumed tobe positive definite, the optimality criterion

λ(ω) = supβββ 6=000

{βββ′freyy(ω)βββ

βββ′V βββ

}, (7.185)

is well defined and represents the largest proportion of the total power thatcan be attributed to the frequency ω for any particular scaled process xt(βββ).This interpretation of λ(ω) is consistent with the notion of the spectral en-velope introduced in the previous section and provides the following workingdefinition: The spectral envelope of a time series with respect to the space Gis defined to be λ(ω).

The solution to this problem, as in the categorical case, is attained byfinding the largest scalar λ(ω) such that

freyy(ω)βββ(ω) = λ(ω)V βββ(ω) (7.186)

for βββ(ω) 6= 000. That is, λ(ω) is the largest eigenvalue of freyy(ω) in the metric ofV , and the optimal scaling, βββ(ω), is the corresponding eigenvector.

If xt is a categorical time series taking values in the finite state-spaceS = {c1, c2, . . . , ck}, where cj represents a particular category, an appropriatechoice for G is the set of indicator functions gj(xt) = I(xt = cj). Hence, thisis a natural generalization of the categorical case. In the categorical case, Gdoes not consist of linearly independent g’s, but it was easy to overcome thisproblem by reducing the dimension by one. In the vector-valued case, xxxt =

Page 509: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

496 7 Statistical Methods in the Frequency Domain

0.0 0.5 1.0 1.5 2.0

0.2

0.4

0.6

0.8

1.0

1.2

frequency

Spec

tral E

nvel

ope

(%)

Fig. 7.26. Spectral envelope with respect to G = {x, |x|, x2} of the residuals froman MA(2) fit to the U.S. GNP growth rate data.

(x1t, . . . , xpt)′, we consider G to be the class of transformations from Rp into R

such that the spectral density of g(xxxt) exists. One class of transformations ofinterest are linear combinations of xxxt. In Tiao et al. (1993), for example, lineartransformations of this type are used in a time domain approach to investigatecontemporaneous relationships among the components of multivariate timeseries. Estimation and inference for the real-valued case are analogous to themethods described in the previous section for the categorical case. We focuson two examples here; numerous other examples can be found in McDougallet al. (1997).

Example 7.19 Residual Analysis

A relevant situation may be when xt is the residual process obtained fromsome modeling procedure. If the fitted model is appropriate, the residualsshould exhibit properties similar to an iid sequence. Departures of the datafrom the fitted model may suggest model misspecification, non-Gaussiandata, or the existence of a nonlinear structure, and the spectral envelopewould provide a simple diagnostic tool to aid in a residual analysis.

The series considered here is the quarterly U.S. real GNP which was an-alyzed in Chapter 3, Examples (3.38) and (3.39). Recall an MA(2) modelwas fit to the growth rate, and the residuals from this fit are plotted inFigure 3.17 . As discussed in Example (3.39), the residuals from the modelfit appear to be uncorrelated; there appears to be one or two outliers, but

Page 510: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

7.9 The Spectral Envelope 497

their magnitudes are not that extreme. In addition, the standard residualanalyses showed no obvious structure among the residuals.

Although the MA(2) model appears to be appropriate, Tiao and Tsay(1994) investigated the possibility of nonlinearities in GNP growth rate.Their overall conclusion was that there is subtle nonlinear behavior in thedata because the economy behaves differently during expansion periods thanduring recession periods.

The spectral envelope, used as a diagnostic tool on the residuals, clearlyindicates the MA(2) model is not adequate, and that further analysis iswarranted. Here, the generating set G = {x, |x|, x2}—which seems natu-ral for a residual analysis—was used to estimate the spectral envelope forthe residuals from the MA(2) fit, and the result is plotted in Figure 7.26.Clearly, the residuals are not iid, and considerable power is present at thelow frequencies. The presence of spectral power at very low frequencies indetrended economic series has been frequently reported and is typically as-sociated with long-range dependence. In fact, our choice of G was partlyinfluenced by the work of Ding et al. (1993) who applied transformationsof the form |xt|d, for d ∈ (0, 3], to the S&P 500 stock market series. Theestimated optimal transformation at the first nonzero frequency, ω = .018,was βββ(.018) = (1, 22,−478)′, which leads to the transformation

g(x) = x+ 22|x| − 478x2. (7.187)

This transformation is plotted in Figure 7.27. The transformation givenin (7.187) is basically the absolute value (with some slight curvature andasymmetry) for most of the residual values, but the effect of extreme-valuedresiduals (outliers) is dampened.

The following R code, which is nearly identical to the previous example,was used in this example.

1 u = arima(diff(log(gnp)), order=c(0,0,2))$resid # residuals

2 x = cbind(u, abs(u), u^2) # transformation set

3 Var = var(x)

4 xspec = mvspec(x, spans=c(5,3), taper=.1)

5 fxxr = Re(xspec$fxx); ev = eigen(Var)

6 Q = ev$vectors%*%diag(1/sqrt(ev$values))%*%t(ev$vectors)

7 num = xspec$n.used; nfreq = length(xspec$freq)

8 specenv = matrix(0, nfreq, 1); beta = matrix(0,nfreq,3)

9 for (k in 1:nfreq){

10 ev = eigen(2*Q%*%fxxr[,,k]%*%Q/num)

11 specenv[k] = ev$values[1]

12 b = Q%*%ev$vectors[,1]

13 beta[k,] = b/b[1] }

14 # output and graphics

15 frequency = xspec$freq

16 plot(frequency, 100*specenv, type="l", ylab="Spectral Envelope (%)")

17 output = cbind(frequency, specenv, beta)

18 colnames(output) = c("freq","specenv","x", "|x|", "x^2")

Page 511: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

498 7 Statistical Methods in the Frequency Domain

−0.04 −0.02 0.00 0.02 0.04

0.00

0.05

0.10

0.15

0.20

0.25

x

g (x

)

Fig. 7.27. Estimated optimal transformation, (7.187), for the GNP residuals atω = .018.

19 round(output,4) # results (not shown)

20 # plot transformation

21 b = output[1, 3:5]; g = function(x) b[1]*x+b[2]*abs(x)+b[3]*x^2

22 curve(g, -.04, .04)

Example 7.20 Optimal Transformations

In this example, we consider a contrived data set, in which we know theoptimal transformation, say, g0, and we determine whether the technologycan find the transformation when g0 is not in G. The data, xt, are generatedby the nonlinear model

xt = exp{3 sin(2πtω0) + εt}, t = 1, . . . , 500, (7.188)

where ω0 = .1 and εt is white Gaussian noise with a variance of 16. Thisexample is adapted from Breiman and Friedman (1985), where the ACEalgorithm is introduced. The optimal transformation in this case is g0(xt) =ln(xt), wherein the data are generated from a sinusoid plus noise. Of the500 generated data, about 96% were less than 4000. Occasionally, the datavalues were extremely large (the data exceeded 100,000 five times). Thespectral estimate of data is shown in Figure 7.28 and provides no evidenceof any dominant frequency, including ω0. In contrast, the sample spectralenvelope (Figure 7.29) computed with respect to G = {x,

√x, 3√x} has no

difficulty in isolating ω0.

Page 512: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

7.9 The Spectral Envelope 499

0.0 0.1 0.2 0.3 0.4 0.5

8990

9192

9394

frequency

spec

trum

(dB)

Series: uSmoothed Periodogram

bandwidth = 0.00289, 95% C.I. is (−4.13, 7.12)dB

Fig. 7.28. Periodogram, in decibels, of the data generated from (7.188) after ta-pering by a cosine bell.

Figure 7.30 compares the estimated optimal transformation with respectto G with the log transformation for values less than 4000. The estimatedtransformation at ω0 is given by

g(x) = 4.5 + .0001x− 0.1187√x+ .6887 3

√x; (7.189)

that is, βββ(ω0) = (.0001,−.1187, .6887)′ after rescaling so (7.189) can becompared directly with y = ln(x).

Finally, it is worth mentioning the result obtained when the rather inap-propriate basis, {x, x2, x3}, was used. Surprisingly, the spectral envelope inthis case looks similar to that of Figure 7.29. Also, the resulting estimatedoptimal transformation at ω0, is close to the log transformation for largevalues of the data. Details may be found in McDougall et al. (1997).

The R code for this example is as follows.1 set.seed(90210)

2 u = exp(3*sin(2*pi*1:500*.1) + rnorm(500,0,4)) # the data

3 spec.pgram(u, spans=c(5,3), taper=.5, log="dB")

4 dev.new()

5 x = cbind(u, sqrt(u), u^(1/3)) # transformation set

6 Var = var(x)

7 xspec = mvspec(x, spans=c(5,3), taper=.5)

8 fxxr = Re(xspec$fxx)

Page 513: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

500 7 Statistical Methods in the Frequency Domain

0.0 0.1 0.2 0.3 0.4 0.5

0.5

1.0

1.5

2.0

frequency

Spec

tral E

nvel

ope

(%)

Fig. 7.29. Spectral envelope with respect to G = {x,√x, 3√x} of data generated

from (7.188).

9 ev = eigen(Var)

10 Q = ev$vectors%*%diag(1/sqrt(ev$values))%*%t(ev$vectors)

11 num = xspec$n.used

12 nfreq = length(xspec$freq)

13 specenv = matrix(0,nfreq,1)

14 beta=matrix(0,nfreq,3)

15 for (k in 1:nfreq){

16 ev = eigen(2*Q%*%fxxr[,,k]%*%Q/num)

17 specenv[k] = ev$values[1]

18 b = Q%*%ev$vectors[,1]

19 beta[k,] = b/sign(b[1]) }

20 # Output and Graphics

21 frequency = xspec$freq

22 plot(frequency, 100*specenv, type="l", ylab="Spectral Envelope (%)")

23 m = xspec$kernel$m

24 etainv = sqrt(sum(xspec$kernel[-m:m]^2))

25 thresh = 100*(2/num)*exp(qnorm(.999)*etainv)*rep(1,nfreq)

26 lines(frequency,thresh, lty="dashed", col="blue")

27 output = cbind(frequency, specenv, beta)

28 colnames(output) = c("freq","specenv","x", "sqrt(x)", "x^(1/3)")

29 round(output,4)

30 # Plot Transform

31 dev.new()

32 b = output[50,3:5]

Page 514: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

Problems 501

0 1000 2000 3000 4000

5.0

5.5

6.0

6.5

7.0

7.5

8.0

x

g (x

)

Fig. 7.30. The estimated optimal transformation at ω0 as given in (7.189) (solidline) and the log transformation, g(x) = ln(x) (dashed line).

33 g = function(x) 4.5 + b[1]*x+b[2]*sqrt(x)+b[3]*x^(1/3)

34 curve(g, 1, 4000)

35 lines(log(1:4000), lty=2)

Problems

Section 7.2

7.1 Consider the complex Gaussian distribution for the random variableXXX = XXXc − iXXXs, as defined in (7.1)-(7.3), where the argument ωk has beensuppressed. Now, the 2p× 1 real random variable ZZZ = (XXX ′c,XXX

′s)′ has a multi-

variate normal distribution with density

p(ZZZ) = (2π)−p|Σ|−1/2 exp

{−1

2(ZZZ − µµµ)′Σ−1(ZZZ − µµµ)

},

where µµµ = (MMM ′c,MMM′s)′ is the mean vector. Prove

|Σ| =(

1

2

)2p

|C − iQ|2,

using the result that the eigenvectors and eigenvalues of Σ occur in pairs, i.e.,(vvv′c, vvv

′s)′ and (vvv′s,−vvv′c)′, where vvvc − ivvvs denotes the eigenvector of fxx. Show

that

Page 515: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

502 7 Statistical Methods in the Frequency Domain

1

2(ZZZ − µµµ)′Σ−1(ZZZ − µµµ)) = (XXX −MMM)∗f−1(XXX −MMM)

so p(XXX) = p(ZZZ) and we can identify the density of the complex multivariatenormal variable XXX with that of the real multivariate normal ZZZ.

7.2 Prove f in (7.6) maximizes the log likelihood (7.5) by minimizing thenegative of the log likelihood

L ln |f |+ L tr {ff−1}

in the formL∑i

(λi − lnλi − 1

)+ Lp+ L ln |f |,

where the λi values correspond to the eigenvalues in a simultaneous diago-nalization of the matrices f and f ; i.e., there exists a matrix P such thatP ∗fP = I and P ∗fP = diag (λ1, . . . , λp) = Λ. Note, λi − lnλi − 1 ≥ 0 withequality if and only if λi = 1, implying Λ = I maximizes the log likelihoodand f = f is the maximizing value.

Section 7.3

7.3 Verify (7.18) and (7.19) for the mean-squared prediction error MSE in(7.11). Use the orthogonality principle, which implies

MSE = E

[(yt −

∞∑r=−∞

βββ′rxxxt−r)yt

]and gives a set of equations involving the autocovariance functions. Then,use the spectral representations and Fourier transform results to get the finalresult. Next, consider the predicted series

yt =∞∑

r=−∞βββ′rxxxt−r,

where βββr satisfies (7.13). Show the ordinary coherence between yt and yt isexactly the multiple coherence (7.20).

7.4 Consider the complex regression model (7.28) in the form

YYY = XB + VVV ,

where YYY = (Y1, Y2, . . . YL)′ denotes the observed DFTs after they have been re-indexed and X = (XXX1,XXX2, . . . ,XXXL)′ is a matrix containing the reindexed in-put vectors. The model is a complex regression model with YYY = YYY c−iYYY s, X =Xc − iXs,BBB = BBBc − iBBBs, and VVV = VVV c − iVVV s denoting the representation interms of the usual cosine and sine transforms. Show the partitioned real re-gression model involving the 2L× 1 vector of cosine and sine transforms, say,

Page 516: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

Problems 503(YYY cYYY s

)=

(Xc −Xs

Xs Xc

)(BBBcBBBs

)+

(VVV cVVV s

),

is isomorphic to the complex regression regression model in the sense that thereal and imaginary parts of the complex model appear as components of thevectors in the real regression model. Use the usual regression theory to verify(7.27) holds. For example, writing the real regression model as

yyy = xbbb+ vvv,

the isomorphism would imply

L(fyy − f∗xy f−1xx fxy) = YYY ∗YYY − YYY ∗X(X∗X)−1X∗YYY

= yyy′yyy − yyy′x(x′x)−1x′yyy.

Section 7.4

7.5 Consider estimating the function

ψt =∞∑

r=−∞aaa′rβββt−r

by a linear filter estimator of the form

ψt =∞∑

r=−∞aaa′rβββt−r,

where βββt is defined by (7.42). Show a sufficient condition for ψt to be an

unbiased estimator; i.e., E ψt = ψt, is

H(ω)Z(ω) = I

for all ω. Similarly, show any other unbiased estimator satisfying the abovecondition has minimum variance (see Shumway and Dean, 1968), so the esti-mator given is a best linear unbiased (BLUE) estimator.

7.6 Consider a linear model with mean value function µt and a signal αtdelayed by an amount τj on each sensor, i.e.,

yjt = µt + αt−τj + vjt.

Show the estimators (7.42) for the mean and the signal are the Fourier trans-forms of

M(ω) =Y·(ω)− φ(ω)Bw(ω)

1− |φ(ω)|2

and

Page 517: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

504 7 Statistical Methods in the Frequency Domain

A(ω) =Bw(ω)− φ(ω)Y·(ω)

1− |φ(ω)|2,

where

φ(ω) =1

N

N∑j=1

eee2πiωτj

and Bw(ω) is defined in (7.64).

Section 7.5

7.7 Consider the estimator (7.67) as applied in the context of the random coef-ficient model (7.65). Prove the filter coefficients for the minimum mean squareestimator can be determined from (7.68) and the mean square covariance isgiven by (7.71).

7.8 For the random coefficient model, verify the expected mean square of theregression power component is

E[SSR(ωk)] = E[Y ∗(ωk)Z(ωk)S−1z (ωk)Z∗(ωk)Y (ωk)]

= Lfβ(ωk)tr {Sz(ωk)}+ Lqfv(ωk).

Recall, the underlying frequency domain model is

YYY (ωk) = Z(ωk)BBB(ωk) + VVV (ωk),

where BBB(ωk) has spectrum fβ(ωk)Iq and VVV (ωk) has spectrum fv(ωk)IN andthe two processes are uncorrelated.

Section 7.6

7.9 Suppose we have I = 2 groups and the models

y1jt = µt + α1t + v1jt

for the j = 1, . . . , N observations in group 1 and

y2jt = µt + α2t + v2jt

for the j = 1, . . . , N observations in group 2, with α1t + α2t = 0. Suppose wewant to test equality of the two group means; i.e.,

yijt = µt + vijt, i = 1, 2.

Derive the residual and error power components corresponding to (7.83) and(7.84) for this particular case.Verify the forms of the linear compounds involving the mean given in (7.90)and (7.91), using (7.88) and (7.89).Show the ratio of the two smoothed spectra in (7.103) has the indicated F -distribution when f1(ω) = f2(ω). When the spectra are not equal, show thevariable is proportional to an F -distribution, where the proportionality con-stant depends on the ratio of the spectra.

Page 518: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

Problems 505

Section 7.7

7.10 The problem of detecting a signal in noise can be considered using themodel

xt = st + wt, t = 1, . . . , n,

for p1(xxx) when a signal is present and the model

xt = wt, t = 1, . . . , n,

for p2(xxx) when no signal is present. Under multivariate normality, we mightspecialize even further by assuming the vector www = (w1, . . . , wn)′ has a mul-tivariate normal distribution with mean 0 and covariance matrix Σ = σ2

wIn,corresponding to white noise. Assuming the signal vector sss = (s1, . . . , sn)′ isfixed and known, show the discriminant function (7.112) becomes the matchedfilter

1

σ2w

n∑t=1

stxt −1

2

(S

N

)+ ln

π1π2,

where (S

N

)=

∑nt=1 s

2t

σ2w

denotes the signal-to-noise ratio. Give the decision criterion if the prior proba-bilities are assumed to be the same. Express the false alarm and missed signalprobabilities in terms of the normal cdf and the signal-to-noise ratio.

7.11 Assume the same additive signal plus noise representations as in theprevious problem, except, the signal is now a random process with a zeromean and covariance matrix σ2

sI. Derive the comparable version of (7.115) asa quadratic detector, and characterize its performance under both hypothesesin terms of constant multiples of the chi-squared distribution.

Section 7.8

7.12 Perform principal component analyses on the stimulus conditions (i)awake-heat and (ii) awake-shock, and compare your results to the results ofExample 7.14. Use the data in fmri and average across subjects.

7.13 For this problem, consider the first three earthquake series (EQ1, EQ2,EQ3) listed in eqexp.

(a) Estimate and compare the spectral density of the P component and thenof the S component for each individual earthquake.

(b) Estimate and compare the squared coherency between the P and S com-ponents of each individual earthquake. Comment on the strength of thecoherence.

Page 519: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

506 7 Statistical Methods in the Frequency Domain

(c) Let xti be the P component of earthquake i = 1, 2, 3, and let xxxt =(xt1, xt2, xt3)′ be the 3 × 1 vector of P components. Estimate the spec-tral density, λ1(ω), of the first principal component series of xxxt. Comparethis to the corresponding spectra calculated in (a).

(d) Analogous to part (c), let yyyt denote the 3×1 vector series of S componentsof the first three earthquakes. Repeat the analysis of part (c) on yyyt.

7.14 In the factor analysis model (7.154), let p = 3, q = 1, and

Σxx =

1 .4 .9.4 1 .7.9 .7 1

.Show there is a unique choice for B and D, but δ23 < 0, so the choice is notvalid.

7.15 Extend the EM algorithm for classical factor analysis, (7.160)-(7.165),to the time series case of maximizing lnL

(B(ωj), Dεε(ωj)

)in (7.176). Then,

for the data used in Example 7.16, find the approximate maximum likelihoodestimates of B(ωj) and Dεε(ωj), and, consequently, Λt.

Section 7.9

7.16 Verify, as stated in (7.181), the imaginary part of a k×k spectral matrix,f im(ω), is skew symmetric, and then show βββ′f imyy (ω)βββ = 0 for a real k × 1vector, βββ.

7.17 Repeat the analysis of Example 7.18 on BNRF1 of herpesvirus saimiri(the data file is bnrf1hvs), and compare the results with the results obtainedfor Epstein–Barr.

7.18 For the NYSE returns, say, rt, analyzed in Chapter 5, Example 5.5:

(a) Estimate the spectrum of the rt. Does the spectral estimate appear tosupport the hypothesis that the returns are white?

(b) Examine the possibility of spectral power near the zero frequency for atransformation of the returns, say, g(rt), using the spectral envelope withExample 7.19 as your guide. Compare the optimal transformation near orat the zero frequency with the usual transformation yt = r2t .

Page 520: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

Appendix ALarge Sample Theory

A.1 Convergence Modes

The study of the optimality properties of various estimators (such as thesample autocorrelation function) depends, in part, on being able to assessthe large-sample behavior of these estimators. We summarize briefly here thekinds of convergence useful in this setting, namely, mean square convergence,convergence in probability, and convergence in distribution.

We consider first a particular class of random variables that plays an im-portant role in the study of second-order time series, namely, the class ofrandom variables belonging to the space L2, satisfying E|x|2 < ∞. In prov-ing certain properties of the class L2 we will often use, for random variablesx, y ∈ L2, the Cauchy–Schwarz inequality,

|E(xy)|2 ≤ E(|x|2)E(|y|2), (A.1)

and the Tchebycheff inequality,

P{|x| ≥ a} ≤ E(|x|2)

a2, (A.2)

for a > 0.Next, we investigate the properties of mean square convergence of random

variables in L2.

Definition A.1 A sequence of L2 random variables {xn} is said to convergein mean square to a random variable x ∈ L2, denoted by

xnms→ x, (A.3)

if and only ifE|xn − x|2 → 0 (A.4)

as n→∞.

© Springer Science+Business Media, LLC 2011

R.H. Shumway and D.S. Stoffer, Time Series Analysis and Its Applications: With R Examples,Springer Texts in Statistics, DOI 10.1007/978-1-4419-7865-3,

507

Page 521: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

508 Appendix A: Large Sample Theory

Example A.1 Mean Square Convergence of the Sample MeanConsider the white noise sequence wt and the signal plus noise series

xt = µ+ wt.

Then, because

E|xn − µ|2 =σ2w

n→ 0

as n→∞, where xn = n−1∑nt=1 xt is the sample mean, we have xn

ms→ µ.

We summarize some of the properties of mean square convergence as fol-lows. If xn

ms→ x, and ynms→ y, then, as n→∞,

E(xn)→ E(x); (A.5)

E(|xn|2)→ E(|x|2); (A.6)

E(xnyn)→ E(xy). (A.7)

We also note the L2 completeness theorem known as the Riesz–FisherTheorem.

Theorem A.1 Let {xn} be a sequence in L2. Then, there exists an x in L2

such that xnms→ x if and only if

E|xn − xm|2 → 0 (A.8)

for m,n→∞.

Often the condition of Theorem A.1 is easier to verify to establish that a meansquare limit x exists without knowing what it is. Sequences that satisfy (A.8)are said to be Cauchy sequences in L2 and (A.8) is also known as the Cauchycriterion for L2.

Example A.2 Time Invariant Linear Filter

As an important example of the use of the Riesz–Fisher Theorem and theproperties of mean square convergent series given in (A.5)–(A.7), a time-invariant linear filter is defined as a convolution of the form

yt =∞∑

j=−∞ajxt−j (A.9)

for each t = 0,±1,±2, . . ., where xt is a weakly stationary input series withmean µx and autocovariance function γx(h), and aj , for j = 0,±1,±2, . . .are constants satisfying

∞∑j=−∞

|aj | <∞. (A.10)

Page 522: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

A.1 Convergence Modes 509

The output series yt defines a filtering or smoothing of the input series thatchanges the character of the time series in a predictable way. We need toknow the conditions under which the outputs yt in (A.9) and the linearprocess (1.29) exist.

Considering the sequence

ynt =n∑

j=−najxt−j , (A.11)

n = 1, 2, . . ., we need to show first that ynt has a mean square limit. ByTheorem A.1, it is enough to show that

E |ynt − ymt |2 → 0

as m,n→∞. For n > m > 0,

E |ynt − ymt |2

= E

∣∣∣∣∣∣∑

m<|j|≤n

ajxt−j

∣∣∣∣∣∣2

=∑

m<|j|≤n

∑m≤|k|≤n

ajakE(xt−jxt−k)

≤∑

m<|j|≤n

∑m≤|k|≤n

|aj ||ak||E(xt−jxt−k)|

≤∑

m<|j|≤n

∑m≤|k|≤n

|aj ||ak|(E|xt−j |2)1/2(E|xt−k|2)1/2

= γx(0)

∑m≤|j|≤n

|aj |

2

→ 0

as m,n→∞, because γx(0) is a constant and {aj} is absolutely summable(the second inequality follows from the Cauchy–Schwarz inequality).

Although we know that the sequence {ynt } given by (A.11) converges inmean square, we have not established its mean square limit. It should beobvious, however, that ynt

ms→ yt as n→∞, where yt is given by (A.9).1

Finally, we may use (A.5) and (A.7) to establish the mean, µy and auto-covariance function, γy(h) of yt. In particular we have,

µy = µx

∞∑j=−∞

aj , (A.12)

1 If S denotes the mean square limit of ynt , then using Fatou’s Lemma, E|S−yt|2 =E lim infn→∞ |S− ynt |2 ≤ lim infn→∞E|S− ynt |2 = 0, which establishes that yt isthe mean square limit of ynt .

Page 523: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

510 Appendix A: Large Sample Theory

and

γy(h) = E∞∑

j=−∞

∞∑k=−∞

aj(xt+h−j − µx)aj(xt−k − µx)

=∞∑

j=−∞

∞∑k=−∞

ajγx(h− j + k)ak. (A.13)

A second important kind of convergence is convergence in probability.

Definition A.2 The sequence {xn}, for n = 1, 2, . . ., converges in proba-bility to a random variable x, denoted by

xnp→ x, (A.14)

if and only ifP{|xn − x| > ε} → 0 (A.15)

for all ε > 0, as n→∞.

An immediate consequence of the Tchebycheff inequality, (A.2), is that

P{|xn − x| ≥ ε} ≤E(|xn − x|2)

ε2,

so convergence in mean square implies convergence in probability, i.e.,

xnms→ x ⇒ xn

p→ x. (A.16)

This result implies, for example, that the filter (A.9) exists as a limit inprobability because it converges in mean square [it is also easily establishedthat (A.9) exists with probability one]. We mention, at this point, the usefulWeak Law of Large Numbers which states that, for an independent identicallydistributed sequence xn of random variables with mean µ, we have

xnp→ µ (A.17)

as n→∞, where xn = n−1∑nt=1 xt is the usual sample mean.

We also will make use of the following concepts.

Definition A.3 For order in probability we write

xn = op(an) (A.18)

if and only ifxnan

p→ 0. (A.19)

The term boundedness in probability, written xn = Op(an), means thatfor every ε > 0, there exists a δ(ε) > 0 such that

P

{∣∣∣∣xnan∣∣∣∣ > δ(ε)

}≤ ε (A.20)

for all n.

Page 524: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

A.1 Convergence Modes 511

Under this convention, e.g., the notation for xnp→ x becomes xn − x =

op(1). The definitions can be compared with their nonrandom counterparts,namely, for a fixed sequence xn = o(1) if xn → 0 and xn = O(1) if xn, forn = 1, 2, . . . is bounded. Some handy properties of op(·) and Op(·) are asfollows.

(i) If xn = op(an) and yn = op(bn), then xnyn = op(anbn) and xn + yn =op(max(an, bn)).

(ii) If xn = op(an) and yn = Op(bn), then xnyn = op(anbn).(iii) Statement (i) is true if Op(·) replaces op(·).

Example A.3 Convergence and Order in Probability for the SampleMean

For the sample mean, xn, of iid random variables with mean µ and varianceσ2, by the Tchebycheff inequality,

P{|xn − µ| > ε} ≤ E[(xn − µ)2]

ε2

=σ2

nε2→ 0,

as n → ∞. It follows that xnp→ µ, or xn − µ = op(1). To find the rate, it

follows that, for δ(ε) > 0,

P{√

n |xn − µ| > δ(ε)}≤ σ2/n

δ2(ε)/n=

σ2

δ2(ε)

by Tchebycheff’s inequality, so taking ε = σ2/δ2(ε) shows that δ(ε) = σ/√ε

does the job andxn − µ = Op(n

−1/2).

For k × 1 random vectors xxxn, convergence in probability, written xxxnp→ xxx

or xxxn−xxx = op(1) is defined as element-by-element convergence in probability,or equivalently, as convergence in terms of the Euclidean distance

‖xxxn − xxx‖p→ 0, (A.21)

where ‖aaa‖ =∑j a

2j for any vector aaa. In this context, we note the result that

if xxxnp→ xxx and g(xxxn) is a continuous mapping,

g(xxxn)p→ g(xxx). (A.22)

Furthermore, if xxxn − aaa = Op(δn) with δn → 0 and g(·) is a func-tion with continuous first derivatives continuous in a neighborhood of aaa =(a1, a2, . . . , ak)′, we have the Taylor series expansion in probability

g(xxxn) = g(aaa) +∂g(xxx)

∂xxx

∣∣∣∣′xxx=aaa

(xxxn − aaa) +Op(δn), (A.23)

Page 525: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

512 Appendix A: Large Sample Theory

where∂g(xxx)

∂xxx

∣∣∣∣xxx=aaa

=

(∂g(xxx)

∂x1

∣∣∣∣xxx=aaa

, . . . ,∂g(xxx)

∂xk

∣∣∣∣xxx=aaa

)′denotes the vector of partial derivatives with respect to x1, x2, . . . , xk, evalu-ated at aaa. This result remains true if Op(δn) is replaced everywhere by op(δn).

Example A.4 Expansion for the Logarithm of the Sample Mean

With the same conditions as Example A.3, consider g(xn) = log xn, whichhas a derivative at µ, for µ > 0. Then, because xn − µ = Op(n

−1/2) fromExample A.3, the conditions for the Taylor expansion in probability, (A.23),are satisfied and we have

log xn = logµ+ µ−1(xn − µ) +Op(n−1/2).

The large sample distributions of sample mean and sample autocorrelationfunctions defined earlier can be developed using the notion of convergence indistribution.

Definition A.4 A sequence of k×1 random vectors {xxxn} is said to convergein distribution, written

xxxnd→ xxx (A.24)

if and only ifFn(xxx)→ F (xxx) (A.25)

at the continuity points of distribution function F (·).

Example A.5 Convergence in Distribution

Consider a sequence {xn} of iid normal random variables with meanzero and variance 1/n. Now, using the standard normal cdf, say Φ(z) =1√2π

∫ z−∞ exp

{− 1

2u2}du, we have Fn(z) = Φ(

√nz), so

Fn(z)→

0 z < 0,

1/2 z = 0

1 z > 0

and we may take

F (z) =

{0 z < 0,

1 z ≥ 0,

because the point where the two functions differ is not a continuity point ofF (z).

The distribution function relates uniquely to the characteristic functionthrough the Fourier transform, defined as a function with vector argumentλλλ = (λ1, λ2, . . . , λk)′, say

Page 526: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

A.1 Convergence Modes 513

φ(λλλ) = E(exp{iλλλ′xxx}) =

∫exp{iλλλ′xxx} dF (xxx). (A.26)

Hence, for a sequence {xxxn} we may characterize convergence in distributionof Fn(·) in terms of convergence of the sequence of characteristic functionsφn(·), i.e.,

φn(λλλ)→ φ(λλλ) ⇔ Fn(xxx)d→ F (xxx), (A.27)

where ⇔ means that the implication goes both directions. In this connection,we have

Proposition A.1 The Cramer–Wold device. Let {xxxn} be a sequence ofk × 1 random vectors. Then, for every ccc = (c1, c2, . . . , ck)′ ∈ Rk

ccc′xxxnd→ ccc′xxx ⇔ xxxn

d→ xxx. (A.28)

Proposition A.1 can be useful because sometimes it easier to show theconvergence in distribution of ccc′xxxn than xxxn directly.

Convergence in probability implies convergence in distribution, namely,

xxxnp→ xxx ⇒ xxxn

d→ xxx, (A.29)

but the converse is only true when xxxnd→ ccc, where ccc is a constant vector. If

xxxnd→ xxx and yyyn

d→ ccc are two sequences of random vectors and ccc is a constantvector,

xxxn + yyynd→ xxx+ ccc and yyy′nxxxn

d→ ccc′xxx. (A.30)

For a continuous mapping h(xxx),

xxxnd→ xxx ⇒ h(xxxn)

d→ h(xxx). (A.31)

A number of results in time series depend on making a series of approx-imations to prove convergence in distribution. For example, we have that if

xxxnd→ xxx can be approximated by the sequence yyyn in the sense that

yyyn − xxxn = op(1), (A.32)

then we have that yyynd→ xxx, so the approximating sequence yyyn has the same

limiting distribution as xxx. We present the following Basic Approximation The-orem (BAT) that will be used later to derive asymptotic distributions for thesample mean and ACF.

Theorem A.2 [Basic Approximation Theorem (BAT)] Let xxxn for n =1, 2, . . . , and yyymn for m = 1, 2, . . . , be random k × 1 vectors such that

(i) yyymnd→ yyym as n→∞ for each m;

(ii) yyymd→ yyy as m→∞;

Page 527: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

514 Appendix A: Large Sample Theory

(iii) limm→∞ lim supn→∞ P{|xxxn − yyymn| > ε} = 0 for every ε > 0.

Then, xxxnd→ yyy.

As a practical matter, the BAT condition (iii) is implied by the Tchebycheffinequality if

(iii′) E{|xxxn−yyymn|2} → 0 (A.33)

as m,n→∞, and (iii′) is often much easier to establish than (iii).The theorem allows approximation of the underlying sequence in two steps,

through the intermediary sequence yyymn, depending on two arguments. In thetime series case, n is generally the sample length andm is generally the numberof terms in an approximation to the linear process of the form (A.11).

Proof. The proof of the theorem is a simple exercise in using the characteristicfunctions and appealing to (A.27). We need to show

|φxxxn − φyyy | → 0,

where we use the shorthand notation φ ≡ φ(λλλ) for ease. First,

|φxxxn − φyyy | ≤ |φxxxn − φyyymn |+ |φyyymn − φyyym |+ |φyyym − φyyy |. (A.34)

By the condition (ii) and (A.27), the last term converges to zero, and bycondition (i) and (A.27), the second term converges to zero and we only needconsider the first term in (A.34). Now, write∣∣φxxxn − φyyymn ∣∣ =

∣∣∣E(eiλλλ′xxxn − eiλλλ

′yyymn)∣∣∣

≤ E∣∣∣eiλλλ′xxxn(1− eiλλλ′(yyymn−xxxn))∣∣∣

= E∣∣∣1− eiλλλ′(yyymn−xxxn)

∣∣∣= E

{∣∣∣1− eiλλλ′(yyymn−xxxn)∣∣∣ I{|yyymn − xxxn| < δ}

}+ E

{∣∣∣1− eiλλλ′(yyymn−xxxn)∣∣∣ I{|yyymn − xxxn| ≥ δ}},

where δ > 0 and I{A} denotes the indicator function of the set A. Then, givenλλλ and ε > 0, choose δ(ε) > 0 such that∣∣∣1− eiλλλ′(yyymn−xxxn)

∣∣∣ < ε

if |yyymn−xxxn| < δ, and the first term is less than ε, an arbitrarily small constant.For the second term, note that∣∣∣1− eiλλλ′(yyymn−xxxn)

∣∣∣ ≤ 2

and we have

E{∣∣∣1− eiλλλ′(yyymn−xxxn)∣∣∣ I{|yyymn − xxxn| ≥ δ}} ≤ 2P

{|yyymn − xxxn| ≥ δ

},

which converges to zero as n→∞ by property (iii). ut

Page 528: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

A.2 Central Limit Theorems 515

A.2 Central Limit Theorems

We will generally be concerned with the large-sample properties of estimatorsthat turn out to be normally distributed as n→∞.

Definition A.5 A sequence of random variables {xn} is said to be asymp-totically normal with mean µn and variance σ2

n if, as n→∞,

σ−1n (xn − µn)d→ z,

where z has the standard normal distribution. We shall abbreviate this as

xn ∼ AN(µn, σ2n), (A.35)

where ∼ will denote is distributed as.

We state the important Central Limit Theorem, as follows.

Theorem A.3 Let x1, . . . , xn be independent and identically distributed withmean µ and variance σ2. If xn = (x1 + · · ·+xn)/n denotes the sample mean,then

xn ∼ AN(µ, σ2/n). (A.36)

Often, we will be concerned with a sequence of k × 1 vectors {xxxn}. Thefollowing property is motivated by the Cramer–Wold device, Property A.1.

Proposition A.2 A sequence of random vectors is asymptotically normal,i.e.,

xxxn ∼ AN(µµµn, Σn), (A.37)

if and only ifccc′xxxn ∼ AN(ccc′µµµn, ccc

′Σnccc) (A.38)

for all ccc ∈ Rk and Σn is positive definite.

In order to begin to consider what happens for dependent data in the lim-iting case, it is necessary to define, first of all, a particular kind of dependenceknown as M -dependence. We say that a time series xt is M -dependent if theset of values xs, s ≤ t is independent of the set of values xs, s ≥ t+M + 1, sotime points separated by more than M units are independent. A central limittheorem for such dependent processes, used in conjunction with the BasicApproximation Theorem, will allow us to develop large-sample distributionalresults for the sample mean x and the sample ACF ρx(h) in the stationarycase.

In the arguments that follow, we often make use of the formula for thevariance of xn in the stationary case, namely,

var xn = n−1(n−1)∑

u=−(n−1)

(1− |u|

n

)γ(u), (A.39)

Page 529: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

516 Appendix A: Large Sample Theory

which was established in (1.33) on page 28. We shall also use the fact that,for

∞∑u=−∞

|γ(u)| <∞,

we would have, by dominated convergence,2

n var xn →∞∑

u=−∞γ(u), (A.40)

because |(1− |u|/n)γ(u)| ≤ |γ(u)| and (1− |u|/n)γ(u) → γ(u). We may nowstate the M-Dependent Central Limit Theorem as follows.

Theorem A.4 If xt is a strictly stationary M-dependent sequence of randomvariables with mean zero and autocovariance function γ(·) and if

VM =M∑

u=−Mγ(u), (A.41)

where VM 6= 0,xn ∼ AN(0, VM/n). (A.42)

Proof. To prove the theorem, using Theorem A.2, the Basic ApproximationTheorem, we may construct a sequence of variables ymn approximating

n1/2xn = n−1/2n∑t=1

xt

in the dependent case and then simply verify conditions (i), (ii), and (iii) ofTheorem A.2. For m > 2M , we may first consider the approximation

ymn = n−1/2[(x1 + · · ·+ xm−M ) + (xm+1 + · · ·+ x2m−M )+ (x2m+1 + · · ·+ x3m−M ) + · · ·+ (x(r−1)m+1 + · · ·+ xrm−M )]

= n−1/2(z1 + z2 + · · ·+ zr),

where r = [n/m], with [n/m] denoting the greatest integer less than or equalto n/m. This approximation contains only part of n1/2xn, but the random

2 Dominated convergence technically relates to convergent sequences (with respectto a sigma-additive measure µ) of measurable functions fn → f bounded by anintegrable function g,

∫g dµ <∞. For such a sequence,∫

fn dµ→∫f dµ.

For the case in point, take fn(u) = (1 − |u|/n)γ(u) for |u| < n and as zero for|u| ≥ n. Take µ(u) = 1, u = ±1,±2, . . . to be counting measure.

Page 530: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

A.2 Central Limit Theorems 517

variables z1, z2, . . . , zr are independent because they are separated by morethan M time points, e.g., m+ 1− (m−M) = M + 1 points separate z1 andz2. Because of strict stationarity, z1, z2, . . . , zr are identically distributed withzero means and variances

Sm−M =∑|u|≤M

(m−M − |u|)γ(u)

by a computation similar to that producing (A.39). We now verify the condi-tions of the Basic Approximation Theorem hold.

(i) Applying the Central Limit Theorem to the sum ymn gives

ymn = n−1/2r∑i=1

zi = (n/r)−1/2r−1/2r∑i=1

zi.

Because (n/r)−1/2 → m1/2 and

r−1/2r∑i=1

zid→ N(0, Sm−M ),

it follows from (A.30) that

ymnd→ ym ∼ N(0, Sm−M/m).

as n→∞, for a fixed m.(ii) Note that as m → ∞, Sm−M/m → VM using dominated convergence,

where VM is defined in (A.41). Hence, the characteristic function of ym,say,

φm(λ) = exp

{−1

2λ2

Sm−Mm

}→ exp

{−1

2λ2 VM

},

as m → ∞, which is the characteristic function of a random variabley ∼ N(0, VM ) and the result follows because of (A.27).

(iii) To verify the last condition of the BAT theorem,

n1/2xn − ymn = n−1/2[(xm−M+1 + · · ·+ xm)+ (x2m−M+1 + · · ·+ x2m)+ (x(r−1)m−M+1 + · · ·+ x(r−1)m)...

+ (xrm−M+1 + · · ·+ xn)]= n−1/2(w1 + w2 + · · ·+ wr),

so the error is expressed as a scaled sum of iid variables with varianceSM for the first r − 1 variables and

Page 531: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

518 Appendix A: Large Sample Theory

var(wr) =∑|u|≤m−M

(n− [n/m]m+M − |u|

)γ(u)

≤∑|u|≤m−M (m+M − |u|)γ(u).

Hence,var [n1/2x− ymn] = n−1[(r − 1)SM + var wr],

which converges to m−1SM as n→∞. Because m−1SM → 0 as m→∞,the condition of (iii) holds by the Tchebycheff inequality.

ut

A.3 The Mean and Autocorrelation Functions

The background material in the previous two sections can be used to developthe asymptotic properties of the sample mean and ACF used to evaluate sta-tistical significance. In particular, we are interested in verifying Property 1.1.

We begin with the distribution of the sample mean xn, noting that (A.40)suggests a form for the limiting variance. In all of the asymptotics, we will usethe assumption that xt is a linear process, as defined in Definition 1.12, butwith the added condition that {wt} is iid. That is, throughout this section weassume

xt = µx +∞∑

j=−∞ψjwt−j (A.43)

where wt ∼ iid(0, σ2w), and the coefficients satisfy

∞∑j=−∞

|ψj | <∞. (A.44)

Before proceeding further, we should note that the exact sampling dis-tribution of xn is available if the distribution of the underlying vectorxxx = (x1, x2, . . . , xn)′ is multivariate normal. Then, xn is just a linear com-bination of jointly normal variables that will have the normal distribution

xn ∼ N

µx, n−1 ∑|u|<n

(1− |u|

n

)γx(u)

, (A.45)

by (A.39). In the case where xt are not jointly normally distributed, we havethe following theorem.

Theorem A.5 If xt is a linear process of the form (A.43) and∑j ψj 6= 0,

thenxn ∼ AN(µx, n

−1V ), (A.46)

where

Page 532: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

A.3 The Mean and Autocorrelation Functions 519

V =∞∑

h=−∞

γx(h) = σ2w

( ∞∑j=−∞

ψj

)2

(A.47)

and γx(·) is the autocovariance function of xt.

Proof. To prove the above, we can again use the Basic Approximation Theo-rem A.2 by first defining the strictly stationary 2m-dependent linear processwith finite limits

xmt =m∑

j=−mψjwt−j

as an approximation to xt to use in the approximating mean

xn,m = n−1n∑t=1

xmt .

Then, takeymn = n1/2(xn,m − µx)

as an approximation to n1/2(xn − µx).

(i) Applying Theorem A.4, we have

ymnd→ ym ∼ N(0, Vm),

as n→∞, where

Vm =2m∑

h=−2m

γx(h) = σ2w

( m∑j=−m

ψj

)2

.

To verify the above, we note that for the general linear process withinfinite limits, (1.30) implies that

∞∑h=−∞

γx(h) = σ2w

∞∑h=−∞

∞∑j=−∞

ψj+hψj = σ2w

( ∞∑j=−∞

ψj

)2

,

so taking the special case ψj = 0, for |j| > m, we obtain Vm.(ii) Because Vm → V in (A.47) as m → ∞, we may use the same charac-

teristic function argument as under (ii) in the proof of Theorem A.4 tonote that

ymd→ y ∼ N(0, V ),

where V is given by (A.47).(iii) Finally,

Page 533: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

520 Appendix A: Large Sample Theory

var[n1/2(xn − µx)− ymn

]= n var

n−1 n∑t=1

∑|j|>m

ψjwt−j

= σ2

w

∑|j|>m

ψj

2

→ 0

as m→∞.

ut

In order to develop the sampling distribution of the sample autocovariancefunction, γx(h), and the sample autocorrelation function, ρx(h), we need todevelop some idea as to the mean and variance of γx(h) under some reasonableassumptions. These computations for γx(h) are messy, and we consider acomparable quantity

γx(h) = n−1n∑t=1

(xt+h − µx)(xt − µx) (A.48)

as an approximation. By Problem 1.30,

n1/2[γx(h)− γx(h)] = op(1),

so that limiting distributional results proved for n1/2γx(h) will hold forn1/2γx(h) by (A.32).

We begin by proving formulas for the variance and for the limiting varianceof γx(h) under the assumptions that xt is a linear process of the form (A.43),satisfying (A.44) with the white noise variates wt having variance σ2

w as before,but also required to have fourth moments satisfying

E(w4t ) = ησ4

w <∞, (A.49)

where η is some constant. We seek results comparable with (A.39) and (A.40)for γx(h). To ease the notation, we will henceforth drop the subscript x fromthe notation.

Using (A.48), E[γ(h)] = γ(h). Under the above assumptions, we show nowthat, for p, q = 0, 1, 2, . . .,

cov [γ(p), γ(q)] = n−1(n−1)∑

u=−(n−1)

(1− |u|

n

)Vu, (A.50)

where

Vu = γ(u)γ(u+ p− q) + γ(u+ p)γ(u− q)+ (η − 3)σ4

w

∑i

ψi+u+qψi+uψi+pψi. (A.51)

Page 534: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

A.3 The Mean and Autocorrelation Functions 521

The absolute summability of the ψj can then be shown to imply the absolutesummability of the Vu.3 Thus, the dominated convergence theorem implies

n cov [γ(p), γ(q)] →∞∑

u=−∞Vu

= (η − 3)γ(p)γ(q) (A.52)

+∞∑

u=−∞

[γ(u)γ(u+ p− q) + γ(u+ p)γ(u− q)

].

To verify (A.50) is somewhat tedious, so we only go partially through thecalculations, leaving the repetitive details to the reader. First, rewrite (A.43)as

xt = µ+

∞∑i=−∞

ψt−iwi,

so that

E[γ(p)γ(q)] = n−2∑s,t

∑i,j,k,`

ψs+p−iψs−jψt+q−k ψt−`E(wiwjwkw`).

Then, evaluate, using the easily verified properties of the wt series

E(wiwjwkw`) =

ησ4

w if i = j = k = `

σ4w if i = j 6= k = `

0 if i 6= j, i 6= k and i 6= `.

To apply the rules, we break the sum over the subscripts i, j, k, ` into fourterms, namely,∑

i,j,k,`

=∑

i=j=k=`

+∑

i=j 6=k=`

+∑

i=k 6=j=`

+∑

i=`6=j=k

= S1 + S2 + S3 + S4.

Now,

S1 = ησ4w

∑i

ψs+p−iψs−iψt+q−iψt−i = ησ4w

∑i

ψi+s−t+pψi+s−tψi+qψi,

where we have let i′ = t− i to get the final form. For the second term,

S2 =∑

i=j 6=k=`

ψs+p−iψs−jψt+q−kψt−`E(wiwjwkw`)

=∑i6=k

ψs+p−iψs−iψt+q−kψt−kE(w2i )E(w2

k).

Then, using the fact that

3 Note:∑∞j=−∞ |aj | <∞ and

∑∞j=−∞ |bj| <∞ implies

∑∞j=−∞ |ajbj | <∞.

Page 535: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

522 Appendix A: Large Sample Theory∑i6=k

=∑i,k

−∑i=k

,

we have

S2 = σ4w

∑i,k

ψs+p−iψs−iψt+q−kψt−k − σ4w

∑i

ψs+p−iψs−iψt+q−iψt−i

= γ(p)γ(q)− σ4w

∑i

ψi+s−t+pψi+s−tψi+qψi,

letting i′ = s− i, k′ = t− k in the first term and i′ = s− i in the second term.Repeating the argument for S3 and S4 and substituting into the covarianceexpression yields

E[γ(p)γ(q)] = n−2∑s,t

[γ(p)γ(q) + γ(s− t)γ(s− t+ p− q)

+ γ(s− t+ p)γ(s− t− q)

+ (η − 3)σ4w

∑i

ψi+s−t+pψi+s−tψi+qψi

].

Then, letting u = s − t and subtracting E[γ(p)]E[γ(q)] = γ(p)γ(q) from thesummation leads to the result (A.51). Summing (A.51) over u and applyingdominated convergence leads to (A.52).

The above results for the variances and covariances of the approximat-ing statistics γ(·) enable proving the following central limit theorem for theautocovariance functions γ(·).

Theorem A.6 If xt is a stationary linear process of the form (A.43) satisfy-ing the fourth moment condition (A.49), then, for fixed K,

γ(0)γ(1)

...γ(K)

∼ ANγ(0)γ(1)

...γ(K)

, n−1V

,where V is the matrix with elements given by

vpq = (η − 3)γ(p)γ(q)

+∞∑

u=−∞

[γ(u)γ(u− p+ q) + γ(u+ q)γ(u− p)

]. (A.53)

Proof. It suffices to show the result for the approximate autocovariance (A.48)for γ(·) by the remark given below it (see also Problem 1.30). First, define thestrictly stationary (2m+K)-dependent (K + 1)× 1 vector

Page 536: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

A.3 The Mean and Autocorrelation Functions 523

yyymt =

(xmt − µ)2

(xmt+1 − µ)(xmt − µ)...

(xmt+K − µ)(xmt − µ)

,

where

xmt = µ+m∑

j=−mψjwt−j

is the usual approximation. The sample mean of the above vector is

yyymn = n−1n∑t=1

yyymt =

γmn(0)γmn(1)

...γmn(K)

,

where

γmn(h) = n−1n∑t=1

(xmt+h − µ)(xmt − µ)

denotes the sample autocovariance of the approximating series. Also,

Eyyymt =

γm(0)γm(1)

...γm(K)

,

where γm(h) is the theoretical covariance function of the series xmt . Then,consider the vector

yyymn = n1/2[yyymn − E(yyymn)]

as an approximation to

yyyn = n1/2

γ(0)γ(1)

...γ(K)

−γ(0)γ(1)

...γ(K)

,

where E(yyymn) is the same as E(yyymt ) given above. The elements of the vectorapproximation yyymn are clearly n1/2(γmn(h)− γm(h)). Note that the elementsof yyyn are based on the linear process xt, whereas the elements of yyymn arebased on the m-dependent linear process xmt . To obtain a limiting distributionfor yyyn, we apply the Basic Approximation Theorem A.2 using yyymn as ourapproximation. We now verify (i), (ii), and (iii) of Theorem A.2.

Page 537: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

524 Appendix A: Large Sample Theory

(i) First, let ccc be a (K+1)×1 vector of constants, and apply the central limittheorem to the (2m+K)-dependent series ccc′yyymn using the Cramer–Wolddevice (A.28). We obtain

ccc′yyymn = n1/2ccc′[yyymn − E(yyymn)]d→ ccc′yyym ∼ N(0, ccc′Vmccc),

as n → ∞, where Vm is a matrix containing the finite analogs of theelements vpq defined in (A.53).

(ii) Note that, since Vm → V as m→∞, it follows that

ccc′yyymd→ ccc′yyy ∼ N(0, ccc′V ccc),

so, by the Cramer–Wold device, the limiting (K + 1) × 1 multivariatenormal variable is N(000, V ).

(iii) For this condition, we can focus on the element-by-element componentsof

P{|yyyn − yyymn| > ε

}.

For example, using the Tchebycheff inequality, the h-th element of theprobability statement can be bounded by

nε−2var (γ(h)− γm(h))

= ε−2 {n var γ(h) + n var γm(h)− 2n cov[γ(h), γm(h)]} .

Using the results that led to (A.52), we see that the preceding expressionapproaches

(vhh + vhh − 2vhh)/ε2 = 0,

as m,n→∞.

ut

To obtain a result comparable to Theorem A.6 for the autocorrelationfunction ACF, we note the following theorem.

Theorem A.7 If xt is a stationary linear process of the form (1.29) satisfyingthe fourth moment condition (A.49), then for fixed K, ρ(1)

...ρ(K)

∼ AN ρ(1)

...ρ(K)

, n−1W

,where W is the matrix with elements given by

wpq =

∞∑u=−∞

[ρ(u+ p)ρ(u+ q) + ρ(u− p)ρ(u+ q) + 2ρ(p)ρ(q)ρ2(u)

Page 538: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

A.3 The Mean and Autocorrelation Functions 525

− 2ρ(p)ρ(u)ρ(u+ q)− 2ρ(q)ρ(u)ρ(u+ p)

]=∞∑u=1

[ρ(u+ p) + ρ(u− p)− 2ρ(p)ρ(u)]

× [ρ(u+ q) + ρ(u− q)− 2ρ(q)ρ(u)], (A.54)

where the last form is more convenient.

Proof. To prove the theorem, we use the delta method4 for the limiting dis-tribution of a function of the form

ggg(x0, x1, . . . , xK) = (x1/x0, . . . , xK/x0)′,

where xh = γ(h), for h = 0, 1, . . . ,K. Hence, using the delta method andTheorem A.6,

ggg (γ(0), γ(1), . . . , γ(K)) = (ρ(1), . . . , ρ(K))′

is asymptotically normal with mean vector (ρ(1), . . . , ρ(K))′ and covariancematrix

n−1W = n−1DVD′,

where V is defined by (A.53) and D is the (K + 1) × K matrix of partialderivatives

D =1

x20

−x1 x0 0 . . . 0−x2 0 x0 . . . 0

......

.... . .

...−xK 0 0 . . . x0,

Substituting γ(h) for xh, we note that D can be written as the patternedmatrix

D =1

γ(0)

(−ρρρ IK

),

where ρρρ = (ρ(1), ρ(2), . . . , ρ(K))′ is the K × 1 matrix of autocorrelations andIK is the K ×K identity matrix. Then, it follows from writing the matrix Vin the partitioned form

V =

(v00 vvv′1vvv1 V22

)that

W = γ−2(0)[v00ρρρρρρ

′ − ρρρvvv′1 − vvv1ρρρ′ + V22],

where vvv1 = (v10, v20, . . . , vK0)′ and V22 = {vpq; p, q = 1, . . . ,K}. Hence,

4 The delta method states that if a k-dimensional vector sequence xxxn ∼AN(µµµ, a2nΣ), with an → 0, and ggg(xxx) is an r× 1 continuously differentiable vectorfunction of xxx, then ggg(xxxn) ∼ AN(ggg(µµµ), a2nDΣD

′) where D is the r×k matrix with

elements dij = ∂gi(xxx)∂xj

∣∣µµµ.

Page 539: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

526 Appendix A: Large Sample Theory

wpq = γ−2(0)[vpq − ρ(p)v0q − ρ(q)vp0 + ρ(p)ρ(q)v00

]=

∞∑u=−∞

[ρ(u)ρ(u− p+ q) + ρ(u− p)ρ(u+ q) + 2ρ(p)ρ(q)ρ2(u)

− 2ρ(p)ρ(u)ρ(u+ q)− 2ρ(q)ρ(u)ρ(u− p)].

Interchanging the summations, we get the wpq specified in the statement ofthe theorem, finishing the proof. ut

Specializing the theorem to the case of interest in this chapter, we notethat if {xt} is iid with finite fourth moment, then wpq = 1 for p = q andis zero otherwise. In this case, for h = 1, . . . ,K, the ρ(h) are asymptoticallyindependent and jointly normal with

ρ(h) ∼ AN(0, n−1). (A.55)

This justifies the use of (1.36) and the discussion below it as a method fortesting whether a series is white noise.

For the cross-correlation, it has been noted that the same kind of ap-proximation holds and we quote the following theorem for the bivariate case,which can be proved using similar arguments (see Brockwell and Davis, 1991,p. 410).

Theorem A.8 If

xt =∞∑

j=−∞αjwt−j,1

and

yt =∞∑

j=−∞βjwt−j,2

are two linear processes of the form with absolutely summable coefficients andthe two white noise sequences are iid and independent of each other with vari-ances σ2

1 and σ22, then for h ≥ 0,

ρxy(h) ∼ AN(ρxy(h), n−1

∑j

ρx(j)ρy(j)

)(A.56)

and the joint distribution of (ρxy(h), ρxy(k))′ is asymptotically normal withmean vector zero and

cov (ρxy(h), ρxy(k)) = n−1∑j

ρx(j)ρy(j + k − h). (A.57)

Again, specializing to the case of interest in this chapter, as long as atleast one of the two series is white (iid) noise, we obtain

ρxy(h) ∼ AN(0, n−1

), (A.58)

which justifies Property 1.2.

Page 540: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

Appendix BTime Domain Theory

B.1 Hilbert Spaces and the Projection Theorem

Most of the material on mean square estimation and regression can be em-bedded in a more general setting involving an inner product space that isalso complete (that is, satisfies the Cauchy condition). Two examples of innerproducts are E(xy∗), where the elements are random variables, and

∑xiy∗i ,

where the elements are sequences. These examples include the possibility ofcomplex elements, in which case, ∗ denotes the conjugation. We denote an in-ner product, in general, by the notation 〈x, y〉. Now, define an inner productspace by its properties, namely,

(i) 〈x, y〉 = 〈y, x〉∗(ii) 〈x+ y, z〉 = 〈x, z〉+ 〈y, z〉(iii) 〈αx, y〉 = α 〈x, y〉(iv) 〈x, x〉 = ‖x‖2 ≥ 0(v) 〈x, x〉 = 0 iff x = 0.

We introduced the notation ‖ · ‖ for the norm or distance in property (iv).The norm satisfies the triangle inequality

‖x+ y‖ ≤ ‖x‖+ ‖y‖ (B.1)

and the Cauchy–Schwarz inequality

| 〈x, y〉 |2 ≤ ‖x‖2‖y‖2, (B.2)

which we have seen before for random variables in (A.35). Now, a Hilbertspace, H, is defined as an inner product space with the Cauchy property. Inother words, H is a complete inner product space. This means that everyCauchy sequence converges in norm; that is, xn → x ∈ H if an only if ‖xn −xm‖ → 0 as m,n → ∞. This is just the L2 completeness Theorem A.1 forrandom variables.

Page 541: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

528 Appendix B: Time Domain Theory

For a broad overview of Hilbert space techniques that are useful in statisti-cal inference and in probability, see Small and McLeish (1994). Also, Brockwelland Davis (1991, Chapter 2) is a nice summary of Hilbert space techniquesthat are useful in time series analysis. In our discussions, we mainly use theprojection theorem (Theorem B.1) and the associated orthogonality principleas a means for solving various kinds of linear estimation problems.

Theorem B.1 (Projection Theorem) Let M be a closed subspace of theHilbert space H and let y be an element in H. Then, y can be uniquely repre-sented as

y = y + z, (B.3)

where y belongs to M and z is orthogonal to M; that is, 〈z, w〉 = 0 for all win M. Furthermore, the point y is closest to y in the sense that, for any w inM, ‖y − w‖ ≥ ‖y − y‖, where equality holds if and only if w = y.

We note that (B.3) and the statement following it yield the orthogonalityproperty

〈y − y, w〉 = 0 (B.4)

for any w belonging to M, which can sometimes be used easily to find anexpression for the projection. The norm of the error can be written as

‖y − y‖2 = 〈y − y, y − y〉= 〈y − y, y〉 − 〈y − y, y〉= 〈y − y, y〉 (B.5)

because of orthogonality.Using the notation of Theorem B.1, we call the mapping PMy = y, for

y ∈ H, the projection mapping of H onto M. In addition, the closed span ofa finite set {x1, . . . , xn} of elements in a Hilbert space, H, is defined to bethe set of all linear combinations w = a1x1 + · · · + anxn, where a1, . . . , anare scalars. This subspace of H is denoted by M = sp{x1, . . . , xn}. By theprojection theorem, the projection of y ∈ H onto M is unique and given by

PMy = a1x1 + · · ·+ anxn,

where {a1, . . . , an} are found using the orthogonality principle

〈y − PMy, xj〉 = 0 j = 1, . . . , n.

Evidently, {a1, . . . , an} can be obtained by solving

n∑i=1

ai 〈xi, xj〉 = 〈y, xj〉 j = 1, . . . , n. (B.6)

When the elements of H are vectors, this problem is the linear regressionproblem.

Page 542: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

B.1 Hilbert Spaces and the Projection Theorem 529

Example B.1 Linear Regression Analysis

For the regression model introduced in §2.2, we want to find the regressioncoefficients βi that minimize the residual sum of squares. Consider the vec-tors yyy = (y1, . . . , yn)′ and zzzi = (z1i, . . . , zni)

′, for i = 1, . . . , q and the innerproduct

〈zzzi, yyy〉 =n∑t=1

ztiyt = zzz′i yyy.

We solve the problem of finding a projection of the observed yyy on the linearspace spanned by β1zzz1 + · · · + βqzzzq, that is, linear combinations of the zzzi.The orthogonality principle gives⟨

yyy −q∑i=1

βizzzi, zzzj

⟩= 0

for j = 1, . . . , q. Writing the orthogonality condition, as in (B.6), in vectorform gives

yyy′zzzj =

q∑i=1

βizzz′izzzj j = 1, . . . , q, (B.7)

which can be written in the usual matrix form by letting Z = (zzz1, . . . , zzzq),which is assumed to be full rank. That is, (B.7) can be written as

yyy′Z = βββ′(Z ′Z), (B.8)

where βββ = (β1, . . . , βq)′. Transposing both sides of (B.8) provides the solu-

tion for the coefficients,βββ = (Z ′Z)−1Z ′yyy.

The mean-square error in this case would be∥∥∥yyy− q∑i=1

βizzzi

∥∥∥2 =⟨yyy−

q∑i=1

βizzzi , yyy⟩

= 〈yyy, yyy〉−q∑i=1

βi 〈zzzi , yyy〉 = yyy′yyy− βββ′Z ′yyy,

which is in agreement with §2.2.

The extra generality in the above approach hardly seems necessary inthe finite dimensional case, where differentiation works perfectly well. It isconvenient, however, in many cases to regard the elements of H as infinitedimensional, so that the orthogonality principle becomes of use. For example,the projection of the process {xt; t = 0 ± 1,±2, . . .} on the linear manifoldspanned by all filtered convolutions of the form

xt =∞∑

k=−∞

akxt−k

would be in this form.There are some useful results, which we state without proof, pertaining to

projection mappings.

Page 543: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

530 Appendix B: Time Domain Theory

Theorem B.2 Under established notation and conditions:

(i) PM(ax+ by) = aPMx+ bPMy, for x, y ∈ H, where a and b are scalars.(ii) If ||yn − y|| → 0, then PMyn → PMy, as n→∞.

(iii) w ∈ M if and only if PMw = w. Consequently, a projection mappingcan be characterized by the property that P 2

M = PM, in the sense that,for any y ∈ H, PM(PMy) = PMy.

(iv) Let M1 and M2 be closed subspaces of H. Then, M1 ⊆M2 if and onlyif PM1

(PM2y) = PM1

y for all y ∈ H.(v) Let M be a closed subspace of H and let M⊥ denote the orthogonal

complement of M. Then, M⊥ is also a closed subspace of H, and forany y ∈ H, y = PMy + PM⊥y.

Part (iii) of Theorem B.2 leads to the well-known result, often used inlinear models, that a square matrix M is a projection matrix if and onlyif it is symmetric and idempotent (that is, M2 = M). For example, usingthe notation of Example B.1 for linear regression, the projection of yyy ontosp{zzz1, . . . , zzzq}, the space generated by the columns of Z, is PZ(yyy) = Zβββ =Z(Z ′Z)−1Z ′yyy. The matrix M = Z(Z ′Z)−1Z ′ is an n × n, symmetric andidempotent matrix of rank q (which is the dimension of the space that Mprojects yyy onto). Parts (iv) and (v) of Theorem B.2 are useful for establishingrecursive solutions for estimation and prediction.

By imposing extra structure, conditional expectation can be defined as aprojection mapping for random variables in L2 with the equivalence relationthat, for x, y ∈ L2, x = y if Pr(x = y) = 1. In particular, for y ∈ L2, ifM is aclosed subspace of L2 containing 1, the conditional expectation of y givenMis defined to be the projection of y ontoM, namely, EMy = PMy. This meansthat conditional expectation, EM, must satisfy the orthogonality principle ofthe Projection Theorem and that the results of Theorem B.2 remain valid(the most widely used tool in this case is item (iv) of the theorem). If we letM(x) denote the closed subspace of all random variables in L2 that can bewritten as a (measurable) function of x, then we may define, for x, y ∈ L2,the conditional expectation of y given x as E(y|x) = EM(x)y. This idea maybe generalized in an obvious way to define the conditional expectation of ygiven xxx = (x1, . . . , xn); that is E(y|xxx) = EM(xxx)y. Of particular interest tous is the following result which states that, in the Gaussian case, conditionalexpectation and linear prediction are equivalent.

Theorem B.3 Under established notation and conditions, if (y, x1, . . . , xn) ismultivariate normal, then

E(y∣∣ x1, . . . , xn) = Psp{1,x1,...,xn}y.

Proof. First, by the projection theorem, the conditional expectation of y givenxxx = {x1, . . . , xn} is the unique element EM(xxx)y that satisfies the orthogonalityprinciple,

E{(y − EM(xxx)y

)w}

= 0 for all w ∈M(xxx).

Page 544: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

B.2 Causal Conditions for ARMA Models 531

We will show that y = Psp{1,x1,...,xn}y is that element. In fact, by the projec-tion theorem, y satisfies

〈y − y, xi〉 = 0 for i = 0, 1, . . . , n,

where we have set x0 = 1. But 〈y − y, xi〉 = cov(y − y, xi) = 0, implying thaty− y and (x1, . . . , xn) are independent because the vector (y− y, x1, . . . , xn)′

is multivariate normal. Thus, if w ∈M(xxx), then w and y− y are independentand, hence, 〈y − y, w〉 = E{(y − y)w} = E(y − y)E(w) = 0, recalling that0 = 〈y − y, 1〉 = E(y − y). ut

In the Gaussian case, conditional expectation has an explicit form. Letyyy = (y1, . . . , ym)′, xxx = (x1, . . . , xn)′, and suppose the (m + n) × 1 vector(yyy′, xxx′)′ is normal: (

yyyxxx

)∼ N

[(µµµyµµµx

),

(Σyy ΣyxΣxy Σxx

)],

then yyy|xxx is normal with

µµµy|x = µµµy +ΣyxΣ−1xx (xxx− µµµx) (B.9)

Σy|x = Σyy −ΣyxΣ−1xxΣxy, (B.10)

where Σxx is assumed to be nonsingular.

B.2 Causal Conditions for ARMA Models

In this section, we prove Property 3.1 of §3.2 pertaining to the causality ofARMA models. The proof of Property 3.2, which pertains to invertibility ofARMA models, is similar.

Proof of Property 3.1. Suppose first that the roots of φ(z), say, z1, . . . , zp,lie outside the unit circle. We write the roots in the following order, 1 < |z1| ≤|z2| ≤ · · · ≤ |zp|, noting that z1, . . . , zp are not necessarily unique, and put|z1| = 1 + ε, for some ε > 0. Thus, φ(z) 6= 0 as long as |z| < |z1| = 1 + ε and,hence, φ−1(z) exists and has a power series expansion,

1

φ(z)=∞∑j=0

ajzj , |z| < 1 + ε.

Now, choose a value δ such that 0 < δ < ε, and set z = 1 + δ, which is insidethe radius of convergence. It then follows that

φ−1(1 + δ) =∞∑j=0

aj(1 + δ)j <∞. (B.11)

Page 545: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

532 Appendix B: Time Domain Theory

Thus, we can bound each of the terms in the sum in (B.11) by a constant,say, |aj(1+δ)j | < c, for c > 0. In turn, |aj | < c(1+δ)−j , from which it followsthat

∞∑j=0

|aj | <∞. (B.12)

Hence, φ−1(B) exists and we may apply it to both sides of the ARMA model,φ(B)xt = θ(B)wt, to obtain

xt = φ−1(B)φ(B)xt = φ−1(B)θ(B)wt.

Thus, putting ψ(B) = φ−1(B)θ(B), we have

xt = ψ(B)wt =∞∑j=0

ψjwt−j ,

where the ψ-weights, which are absolutely summable, can be evaluated byψ(z) = φ−1(z)θ(z), for |z| ≤ 1.

Now, suppose xt is a causal process; that is, it has the representation

xt =

∞∑j=0

ψjwt−j ,

∞∑j=0

|ψj | <∞.

In this case, we writext = ψ(B)wt,

and premultiplying by φ(B) yields

φ(B)xt = φ(B)ψ(B)wt. (B.13)

In addition to (B.13), the model is ARMA, and can be written as

φ(B)xt = θ(B)wt. (B.14)

From (B.13) and (B.14), we see that

φ(B)ψ(B)wt = θ(B)wt. (B.15)

Now, let

a(z) = φ(z)ψ(z) =∞∑j=0

ajzj |z| ≤ 1

and, hence, we can write (B.15) as

∞∑j=0

ajwt−j =

q∑j=0

θjwt−j . (B.16)

Page 546: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

B.3 Large Sample Distribution of AR Estimators 533

Next, multiply both sides of (B.16) by wt−h, for h = 0, 1, 2, . . . , and takeexpectation. In doing this, we obtain

ah = θh, h = 0, 1, . . . , q

ah = 0, h > q. (B.17)

From (B.17), we conclude that

φ(z)ψ(z) = a(z) = θ(z), |z| ≤ 1. (B.18)

If there is a complex number in the unit circle, say z0, for which φ(z0) = 0,then by (B.18), θ(z0) = 0. But, if there is such a z0, then φ(z) and θ(z) havea common factor which is not allowed. Thus, we may write ψ(z) = θ(z)/φ(z).In addition, by hypothesis, we have that |ψ(z)| <∞ for |z| ≤ 1, and hence

|ψ(z)| =∣∣∣∣ θ(z)φ(z)

∣∣∣∣ <∞, for |z| ≤ 1. (B.19)

Finally, (B.19) implies φ(z) 6= 0 for |z| ≤ 1; that is, the roots of φ(z) lie outsidethe unit circle. ut

B.3 Large Sample Distribution of the AR(p) ConditionalLeast Squares Estimators

In §3.6 we discussed the conditional least squares procedure for estimatingthe parameters φ1, φ2, . . . , φp and σ2

w in the AR(p) model

xt =

p∑k=1

φkxt−k + wt,

where we assume µ = 0, for convenience. Write the model as

xt = φφφ′xxxt−1 + wt, (B.20)

where xxxt−1 = (xt−1, xt−2, . . . , xt−p)′ is a p × 1 vector of lagged values, and

φφφ = (φ1, φ2, . . . , φp)′ is the p×1 vector of regression coefficients. Assuming ob-

servations are available at x1, . . . , xn, the conditional least squares procedureis to minimize

Sc(φφφ) =n∑

t=p+1

(xt − φφφ′xxxt−1

)2with respect to φφφ. The solution is

φφφ =

(n∑

t=p+1

xxxt−1xxx′t−1

)−1 n∑t=p+1

xxxt−1xt (B.21)

Page 547: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

534 Appendix B: Time Domain Theory

for the regression vector φφφ; the conditional least squares estimate of σ2w is

σ2w =

1

n− p

n∑t=p+1

(xt − φφφ

′xxxt−1

)2. (B.22)

As pointed out following (3.115), Yule–Walker estimators and least squaresestimators are approximately the same in that the estimators differ only byinclusion or exclusion of terms involving the endpoints of the data. Hence, it iseasy to show the asymptotic equivalence of the two estimators; this is why, forAR(p) models, (3.103) and (3.131), are equivalent. Details on the asymptoticequivalence can be found in Brockwell and Davis (1991, Chapter 8).

Here, we use the same approach as in Appendix A, replacing the lowerlimits of the sums in (B.21) and (B.22) by one and noting the asymptoticequivalence of the estimators

φφφ =

(n∑t=1

xxxt−1xxx′t−1

)−1 n∑t=1

xxxt−1xt (B.23)

and

σ2w =

1

n

n∑t=1

(xt − φφφ

′xxxt−1

)2(B.24)

to those two estimators. In (B.23) and (B.24), we are acting as if we are ableto observe x1−p, . . . , x0 in addition to x1, . . . , xn. The asymptotic equivalenceis then seen by arguing that for n sufficiently large, it makes no differencewhether or not we observe x1−p, . . . , x0. In the case of (B.23) and (B.24), weobtain the following theorem.

Theorem B.4 Let xt be a causal AR(p) series with white (iid) noise wt sat-isfying E(w4

t ) = ησ4w. Then,

φφφ ∼ AN

(φφφ, n−1σ2

wΓ−1p

), (B.25)

where Γp = {γ(i − j)}pi,j=1 is the p × p autocovariance matrix of the vectorxxxt−1. We also have, as n→∞,

n−1n∑t=1

xxxt−1xxx′t−1

p→ Γp and σ2w

p→ σ2w. (B.26)

Proof. First, (B.26) follows from the fact that E(xxxt−1xxx′t−1) = Γp, recalling that

from Theorem A.6, second-order sample moments converge in probability totheir population moments for linear processes in which wt has a finite fourthmoment. To show (B.25), we can write

φφφ =

(n∑t=1

xxxt−1xxx′t−1

)−1 n∑t=1

xxxt−1(xxx′t−1φφφ+ wt)

Page 548: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

B.3 Large Sample Distribution of AR Estimators 535

= φφφ+

(n∑t=1

xxxt−1xxx′t−1

)−1 n∑t=1

xxxt−1wt,

so that

n1/2(φφφ− φφφ) =

(n−1

n∑t=1

xxxt−1xxx′t−1

)−1n−1/2

n∑t=1

xxxt−1wt

=

(n−1

n∑t=1

xxxt−1xxx′t−1

)−1n−1/2

n∑t=1

uuut,

where uuut = xxxt−1wt. We use the fact that wt and xxxt−1 are independent to writeEuuut = E(xxxt−1)E(wt) = 000, because the errors have zero means. Also,

Euuutuuu′t = Exxxt−1wtwtxxx

′t−1 = Exxxt−1xxx

′t−1Ew

2t = σ2

wΓp.

In addition, we have, for h > 0,

Euuut+huuu′t = Exxxt+h−1wt+hwtxxx

′t−1 = Exxxt+h−1wtxxx

′t−1Ewt+h = 0.

A similar computation works for h < 0.Next, consider the mean square convergent approximation

xmt =m∑j=0

ψjwt−j

for xt, and define the (m+p)-dependent process uuumt = wt(xmt−1, x

mt−2, . . . , x

mt−p)

′.Note that we need only look at a central limit theorem for the sum

ynm = n−1/2n∑t=1

λλλ′uuumt ,

for arbitrary vectors λλλ = (λ1, . . . , λp)′, where ynm is used as an approximation

to

Sn = n−1/2n∑t=1

λλλ′uuut.

First, apply the m-dependent central limit theorem to ynm as n→∞ for fixed

m to establish (i) of Theorem A.2. This result shows ynmd→ ym, where ym is

asymptotically normal with covariance λλλ′Γ(m)p λλλ, where Γ

(m)p is the covariance

matrix of uuumt . Then, we have Γ(m)p → Γp, so that ym converges in distribution

to a normal random variable with mean zero and variance λλλ′Γpλλλ and we haveverified part (ii) of Theorem A.2. We verify part (iii) of Theorem A.2 by notingthat

E[(Sn − ynm)2] = n−1n∑t=1

λλλ′E[(uuut − uuumt )(uuut − uuumt )′]λλλ

Page 549: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

536 Appendix B: Time Domain Theory

clearly converges to zero as n,m→∞ because

xt − xmt =∞∑

j=m+1

ψjwt−j

form the components of uuut − uuumt .

Now, the form for√n(φφφ− φφφ) contains the premultiplying matrix(n−1

n∑t=1

xxxt−1xxx′t−1

)−1p→ Γ−1p ,

because (A.22) can be applied to the function that defines the inverse of thematrix. Then, applying (A.30), shows that

n1/2(φφφ− φφφ

)d→ N

(0, σ2

wΓ−1p ΓpΓ

−1p

),

so we may regard it as being multivariate normal with mean zero and covari-ance matrix σ2

wΓ−1p .

To investigate σ2w, note

σ2w = n−1

n∑t=1

(xt − φφφ

′xxxt−1

)2= n−1

n∑t=1

x2t − n−1n∑t=1

xxx′t−1xt

(n−1

n∑t=1

xxxt−1xxx′t−1

)−1n−1

n∑t=1

xxxt−1xt

p→ γ(0)− γγγ′pΓ−1p γγγp= σ2

w,

and we have that the sample estimator converges in probability to σ2w, which

is written in the form of (3.66). ut

The arguments above imply that, for sufficiently large n, we may considerthe estimator φφφ in (B.21) as being approximately multivariate normal withmean φφφ and variance–covariance matrix σ2

wΓ−1p /n. Inferences about the pa-

rameter φφφ are obtained by replacing the σ2w and Γp by their estimates given

by (B.22) and

Γp = n−1n∑

t=p+1

xxxt−1xxx′t−1,

respectively. In the case of a nonzero mean, the data xt are replaced by xt− xin the estimates and the results of Theorem A.2 remain valid.

Page 550: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

B.4 The Wold Decomposition 537

B.4 The Wold Decomposition

The ARMA approach to modeling time series is generally implied by theassumption that the dependence between adjacent values in time is best ex-plained in terms of a regression of the current values on the past values. Thisassumption is partially justified, in theory, by the Wold decomposition.

In this section we assume that {xt; t = 0,±1,±2, . . .} is a stationary,mean-zero process. Using the notation of §B.1, we define

Mxn = sp{xt, −∞ < t ≤ n}, with Mx

−∞ =∞⋂

n=−∞Mx

n,

andσ2x = E

(xn+1 − PMx

nxn+1

)2.

We say that xt is a deterministic process if and only if σ2x = 0. That is, a

deterministic process is one in which its future is perfectly predictable fromits past; a simple example is the process given in (4.1). We are now ready topresent the decomposition.

Theorem B.5 (The Wold Decomposition) Under the conditions and no-tation of this section, if σ2

x > 0, then xt can be expressed as

xt =∞∑j=0

ψjwt−j + vt

where

(i)∑∞j=0 ψ

2j <∞ (ψ0 = 1)

(ii) {wt} is white noise with variance σ2w

(iii) wt ∈Mxt

(iv) cov(ws, vt) = 0 for all s, t = 0,±1,±2, . . . .(v) vt ∈Mx

−∞(vi) {vt} is deterministic.

The proof of the decomposition follows from the theory of §B.1 by definingthe unique sequences:

wt = xt − PMxt−1xt,

ψj = σ−2w 〈xt, wt−j〉 = σ−2w E(xtwt−j),

vt = xt −∞∑j=0

ψjwt−j .

Although every stationary process can be represented by the Wold decom-position, it does not mean that the decomposition is the best way to describethe process. In addition, there may be some dependence structure among the

Page 551: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

538 Appendix B: Time Domain Theory

{wt}; we are only guaranteed that the sequence is an uncorrelated sequence.The theorem, in its generality, falls short of our needs because we would preferthe noise process, {wt}, to be white independent noise. But, the decomposi-tion does give us the confidence that we will not be completely off the markby fitting ARMA models to time series data.

Page 552: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

Appendix CSpectral Domain Theory

C.1 Spectral Representation Theorem

In this section, we present a spectral representation for the process xt itself,which allows us to think of a stationary process as a random sum of sines andcosines as described in (4.3). In addition, we present results that justify rep-resenting the autocovariance function γx(h) of the weakly stationary processxt in terms of a non-negative spectral density function. The spectral densityfunction essentially measures the variance or power in a particular kind ofperiodic oscillation in the function. We denote this spectral density of vari-ance function by f(ω), where the variance is measured as a function of thefrequency of oscillation ω, measured in cycles per unit time.

First, we consider developing a representation for the autocovariance func-tion of a stationary, possibly complex, series xt with zero mean and autocovari-ance function γx(h) = E(xt+hx

∗t ). We prove the representation for arbitrary

non-negative definite functions γ(h) and then simply note the autocovariancefunction is non-negative definite, because, for any set of complex constants,at, t = 0± 1,±2, . . ., we may write, for any finite subset,

E

∣∣∣∣∣n∑s=1

a∗sxs

∣∣∣∣∣2

=n∑s=1

n∑t=1

a∗sγ(s− t)at ≥ 0.

The representation is stated in terms of non-negative definite functions anda spectral distribution function F (ω) that is monotone nondecreasing, andcontinuous from the right, taking the values F (−1/2) = 0 and F (1/2) = σ2 =γx(0) at ω = −1/2 and 1/2, respectively.

Theorem C.1 A function γ(h), for h = 0,±1,±2, . . . is non-negative definiteif and only if it can be expressed as

γ(h) =

∫ 1/2

−1/2exp{2πiωh}dF (ω), (C.1)

Page 553: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

540 Appendix C: Spectral Domain Theory

where F (·) is nondecreasing. The function F (·) is right continuous, bounded in[−1/2, 1/2], and uniquely determined by the conditions F (−1/2) = 0, F (1/2) =γ(0).

Proof. If γ(h) has the representation (C.1), then

n∑s=1

n∑t=1

a∗sγ(s− t)at =

∫ 1/2

−1/2

n∑s=1

n∑t=1

a∗s at e2πiω(s−t)dF (ω)

=

∫ 1/2

−1/2

∣∣∣∣∣n∑t=1

at e−2πiωt

∣∣∣∣∣2

dF (ω) ≥ 0

and γ(h) is non-negative definite.Conversely, suppose γ(h) is a non-negative definite function. Define the

non-negative function

fn(ω) = n−1n∑s=1

n∑t=1

e−2πiωsγ(s− t)e2πiωt

= n−1(n−1)∑

h=−(n−1)

(n− |h|)e−2πiωhγ(h) ≥ 0

(C.2)

Now, let Fn(ω) be the distribution function corresponding to fn(ω)I(−1/2,1/2],where I(·) denotes the indicator function of the interval in the subscript. Notethat Fn(ω) = 0, ω ≤ −1/2 and Fn(ω) = Fn(1/2) for ω ≥ 1/2. Then,∫ 1/2

−1/2e2πiωhdFn(ω) =

∫ 1/2

−1/2e2πiωhfn(ω) dω

=

{(1− |h|/n)γ(h), |h| < n0, elsewhere.

We also have

Fn(1/2) =

∫ 1/2

−1/2fn(ω) dω

=

∫ 1/2

−1/2

∑|h|<n

(1− |h|/n)γ(h)e−2πiωhdω = γ(0).

Now, by Helly’s first convergence theorem (Bhat, 1985, p. 157), there exists asubsequence Fnk converging to F , and by the Helly-Bray Lemma (see Bhat,p. 157), this implies∫ 1/2

−1/2e2πiωhdFnk(ω)→

∫ 1/2

−1/2e2πiωhdF (ω)

Page 554: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

C.1 Spectral Representation Theorem 541

and, from the right-hand side of the earlier equation,

(1− |h|/nk)γ(h)→ γ(h)

as nk →∞, and the required result follows. ut

Next, present the version of the Spectral Representation Theorem in termsof a mean-zero, stationary process, xt. We refer the reader to Hannan (1970,§2.3) for details. This version allows us to think of a stationary process asbeing generated (approximately) by a random sum of sines and cosines suchas described in (4.3).

Theorem C.2 If xt is a mean-zero stationary process, with spectral distri-bution F (ω) as given in Theorem C.1, then there exists a complex-valuedstochastic process Z(ω), on the interval ω ∈ [−1/2, 1/2], having stationaryuncorrelated increments, such that xt can be written as the stochastic integral

xt =

∫ 1/2

−1/2exp(−2πitω)dZ(ω),

where, for −1/2 ≤ ω1 ≤ ω2 ≤ 1/2,

var {Z(ω2)− Z(ω1)} = F (ω2)− F (ω1).

An uncorrelated increment process such as Z(ω) is a mean-zero, finitevariance, continuous-time stochastic process for which events occurring in non-overlapping intervals are uncorrelated; for example, consider Brownian motiondefined in Definition 5.1. The integral in this representation is a stochasticintegral. To understand its meaning, let ω0, ω1, . . . , ωn be a partition of theinterval [−1/2, 1/2]. Define

In =n∑j=1

exp(−2πitωj)[Z(ωj)− Z(ωj−1)].

Then, I =∫ 1/2

−1/2 exp(−2πitωj)dZ(ω) is defined to be the mean square limit of

In as n→∞, assuming it exists.In general, the spectral distribution function can be a mixture of discrete

and continuous distributions. The special case of greatest interest is the ab-solutely continuous case, namely, when dF (ω) = f(ω)dω, and the resultingfunction is the spectral density considered in §4.3. What made the proof ofTheorem C.1 difficult was that, after we defined

fn(ω) =

(n−1)∑h=−(n−1)

(1− |h|

n

)γ(h)e−2πiωh

in (C.2), we could not simply allow n→∞ because γ(h) may not be absolutelysummable. If, however, γ(h) is absolutely summable we may define f(ω) =limn→∞ fn(ω), and we have the following result.

Page 555: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

542 Appendix C: Spectral Domain Theory

Theorem C.3 If γ(h) is the autocovariance function of a stationary process,xt, with

∞∑h=−∞

|γ(h)| <∞, (C.3)

then the spectral density of xt is given by

f(ω) =

∞∑h=−∞

γ(h)e−2πiωh. (C.4)

We may extend the representation to the vector case xxxt = (xt1, . . . , xtp)′

by considering linear combinations of the form

yt =

p∑j=1

a∗jxtj ,

which will be stationary with autocovariance functions of the form

γy(h) =

p∑j=1

p∑k=1

a∗jγjk(h)ak,

where γjk(h) is the usual cross-covariance function between xtj and xtk. Todevelop the spectral representation of γjk(h) from the representations of theunivariate series, consider the linear combinations

yt1 = xtj + xtk and yt2 = xtj + ixtk,

which are both stationary series with respective covariance functions

γ1(h) = γjj(h) + γjk(h) + γkj(h) + γkk(h)

=

∫ 1/2

−1/2e2πiωhdG1(ω),

γ2(h) = γjj(h) + iγkj(h)− iγjk(h) + γkk(h)

=

∫ 1/2

−1/2e2πiωhdG2(ω).

Introducing the spectral representations for γjj(h) and γkk(h) yields

γjk(h) =

∫ 1/2

−1/2e2πiωhdFjk(ω),

with

Page 556: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

C.2 Distribution of the DFT and Periodogram 543

Fjk(ω) =1

2

[G1(ω) + iG2(ω)− (1 + i)

(Fjj(ω) + Fkk(ω)

)].

Now, under the summability condition

∞∑h=−∞

|γjk(h)| <∞,

we have the representation

γjk(h) =

∫ 1/2

−1/2e2πiωhfjk(ω)dω,

where the cross-spectral density function has the inverse Fourier representa-tion

fjk(ω) =∞∑

h=−∞

γjk(h)e−2πiωh.

The cross-covariance function satisfies γjk(h) = γkj(−h), which impliesfjk(ω) = fkj(−ω) using the above representation.

Then, defining the autocovariance function of the general vector processxxxt as the p× p matrix

Γ (h) = E[(xxxt+h − µµµx)(xxxt − µµµx)′],

and the p× p spectral matrix as f(ω) = {fjk(ω); j, k = 1, . . . , p}, we have therepresentation in matrix form, written as

Γ (h) =

∫ 1/2

−1/2e2πiωhf(ω) dω, (C.5)

and the inverse result

f(ω) =

∞∑h=−∞

Γ (h)e−2πiωh. (C.6)

which appears as Property 4.6 in §4.6. Theorem C.2 can also be extended tothe multivariate case.

C.2 Large Sample Distribution of the DFT andSmoothed Periodogram

We have previously introduced the DFT, for the stationary zero-mean processxt, observed at t = 1, . . . , n as

d(ω) = n−1/2n∑t=1

xt e−2πiωt, (C.7)

Page 557: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

544 Appendix C: Spectral Domain Theory

as the result of matching sines and cosines of frequency ω against the seriesxt. We will suppose now that xt has an absolutely continuous spectrum f(ω)corresponding to the absolutely summable autocovariance function γ(h). Ourpurpose in this section is to examine the statistical properties of the complexrandom variables d(ωk), for ωk = k/n, k = 0, 1, . . . , n− 1 in providing a basisfor the estimation of f(ω). To develop the statistical properties, we examinethe behavior of

Sn(ω, ω) = E |d(ω)|2

= n−1E

[ n∑s=1

xs e−2πiωsn∑t=1

xte2πiωt

]

= n−1n∑s=1

n∑t=1

e−2πiωse2πiωtγ(s− t)

=

n−1∑h=−(n−1)

(1− |h|/n)γ(h)e−2πiωh, (C.8)

where we have let h = s− t. Using dominated convergence,

Sn(ω, ω)→∞∑

h=−∞

γ(h)e−2πiωh = f(ω),

as n→∞, making the large sample variance of the Fourier transform equal tothe spectrum evaluated at ω. We have already seen this result in Theorem C.3.For exact bounds it is also convenient to add an absolute summability assump-tion for the autocovariance function, namely,

θ =

∞∑h=−∞

|h||γ(h)| <∞. (C.9)

Example C.1 Condition (C.9) Verified for an AR(1)

For a causal AR(1) series, xt = φxt−1 + wt, we may write θ in (C.9) as

θ =σ2w

1− φ2∞∑

h=−∞

|h|φ|h|.

The condition that θ is finite is equivalent to summability of

∞∑h=1

hφh = φ∞∑h=1

hφh−1 = φ∂

∂φ

∞∑h=1

φh =φ

(1− φ)2,

and hence,

θ =2σ2

(1− φ)3(1 + φ)<∞.

Page 558: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

C.2 Distribution of the DFT and Periodogram 545

To elaborate further, we derive two approximation lemmas.

Lemma C.1 For Sn(ω, ω) as defined in (C.8) and θ in (C.9) finite, we have

|Sn(ω, ω)− f(ω)| ≤ θ

n(C.10)

orSn(ω, ω) = f(ω) +O(n−1). (C.11)

Proof. To prove the lemma, write

n|Sn(ω, ω)− fx(ω)| =

∣∣∣∣∣∣∑|u|<n

(n− |u|)γ(u)e−2πiωu − n∞∑

u=−∞γ(u)e−2πiωu

∣∣∣∣∣∣=

∣∣∣∣∣∣−n∑|u|≥n

γ(u)e−2πiωu −∑|u|<n

|u|γ(u)e−2πiωu

∣∣∣∣∣∣≤∑|u|≥n

|u||γ(u)|+∑|u|<n

|u||γ(u)|

= θ,

which establishes the lemma. ut

Lemma C.2 For ωk = k/n, ω` = `/n, ωk − ω` 6= 0,±1,±2,±3, . . ., and θ in(C.9), we have

|Sn(ωk, ω`)| ≤θ

n= O(n−1), (C.12)

whereSn(ωk, ω`) = E{d(ωk)d(ω`)}. (C.13)

Proof. Write

n|Sn(ωk, ω`)| =−1∑

u=−(n−1)

γ(u)n∑

v=−(u−1)

e−2πi(ωk−ω`)ve−2πiωku

+n−1∑u=0

γ(u)n−u∑v=1

e−2πi(ωk−ω`)ve−2πiωku.

Now, for the first term, with u < 0,

n∑v=−(u−1)

e−2πi(ωk−ω`)v =

( n∑v=1

−−u∑v=1

)e−2πi(ωk−ω`)v

= 0−−u∑v=1

e−2πi(ωk−ω`)v.

Page 559: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

546 Appendix C: Spectral Domain Theory

For the second term with u ≥ 0,

n−u∑v=1

e−2πi(ωk−ω`)v =

( n∑v=1

−n∑

v=n−u+1

)e−2πi(ωk−ω`)v

= 0−n∑

v=n−u+1

e−2πi(ωk−ω`)v.

Consequently,

n|Sn(ωk, ω`)| =

∣∣∣∣∣∣−−1∑

u=−(n−1)

γ(u)−u∑v=1

e−2πi(ωk−ω`)ve−2πiωku

−n−1∑u=1

γ(u)n∑

v=n−u+1

e−2πi(ωk−ω`)ve−2πiωku

∣∣∣∣∣≤

0∑u=−(n−1)

(−u)|γ(u)|+n−1∑u=1

u|γ(u)|

=

(n−1)∑u=−(n−1)

|u| |γ(u)|.

Hence, we have

Sn(ωk, ω`) ≤θ

n,

and the asserted relations of the lemma follow. ut

Because the DFTs are approximately uncorrelated, say, of order 1/n, whenthe frequencies are of the form ωk = k/n, we shall compute at those frequen-cies. The behavior of f(ω) at neighboring frequencies, however, will often beof interest and we shall use Lemma C.3 below to handle such cases.

Lemma C.3 For |ωk − ω| ≤ L/2n and θ in (C.9), we have

|f(ωk)− f(ω)| ≤ πθL

n(C.14)

orf(ωk)− f(ω) = O(L/n). (C.15)

Proof. Write the difference

|f(ωk)− f(ω)| =∣∣∣ ∞∑h=−∞

γ(h)(

e−2πiωkh − e−2πiωh)∣∣∣

≤∞∑

h=−∞

|γ(h)|∣∣e−πi(ωk−ω)h − eπi(ωk−ω)h

∣∣= 2

∞∑h=−∞

|γ(h)|∣∣sin[π(ωk − ω)h]

∣∣

Page 560: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

C.2 Distribution of the DFT and Periodogram 547

≤ 2π|ωk − ω|∞∑

h=−∞

|h||γ(h)|

≤ πθL

n

because | sinx| ≤ |x|. ut

The main use of the properties described by Lemmas C.1 and C.2 is inidentifying the covariance structure of the DFT, say,

d(ωk) = n−1/2n∑t=1

xt e−2πiωkt

= dc(ωk)− ids(ωk),

where

dc(ωk) = n−1/2n∑t=1

xt cos(2πωkt)

and

ds(ωk) = n−1/2n∑t=1

xt sin(2πωkt)

are the cosine and sine transforms, respectively, of the observed series, de-fined previously in (4.23) and (4.24). For example, assuming zero means forconvenience, we will have

E[dc(ωk)dc(ω`)]

=1

4n−1

n∑s=1

n∑t=1

γ(s− t)(e2πiωks + e−2πiωks

)(e2πiω`t + e−2πiω`t

)=

1

4

[Sn(−ωk, ω`) + Sn(ωk, ω`) + Sn(ω`, ωk) + Sn(ωk,−ω`)

].

Lemmas C.1 and C.2 imply, for k = `,

E[dc(ωk)dc(ω`)] =1

4

[O(n−1) + f(ωk) +O(n−1)

+f(ωk) +O(n−1) +O(n−1)]

=1

2f(ωk) +O(n−1). (C.16)

For k 6= `, all terms are O(n−1). Hence, we have

E[dc(ωk)dc(ω`)] =

{12f(ωk) +O(n−1), k = `O(n−1), k 6= `.

(C.17)

A similar argument gives

Page 561: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

548 Appendix C: Spectral Domain Theory

E[ds(ωk)ds(ω`)] =

{12f(ωk) +O(n−1), k = `,O(n−1), k 6= `

(C.18)

and we also have E[ds(ωk)dc(ω`)] = O(n−1) for all k, `. We may summarizethe results of Lemmas C.1–C.3 as follows.

Theorem C.4 For a stationary mean zero process with autocovariance func-tion satisfying (C.9) and frequencies ωk:n, such that |ωk:n − ω| < 1/n, areclose to some target frequency ω, the cosine and sine transforms (4.23) and(4.24) are approximately uncorrelated with variances equal to (1/2)f(ω), andthe error in the approximation can be uniformly bounded by πθL/n.

Now, consider estimating the spectrum in a neighborhood of some targetfrequency ω, using the periodogram estimator

I(ωk:n) = |d(ωk:n)|2 = d2c(ωk:n) + d2s(ωk:n),

where we take |ωk:n − ω| ≤ n−1 for each n. In case the series xt is Gaussianwith zero mean, (

dc(ωk:n)ds(ωk:n)

)d→ N

{(00

),

1

2

(f(ω) 0

0 f(ω)

)},

and we have that2 I(ωk:n)

f(ω)

d→ χ22,

where χ2ν denotes a chi-squared random variable with ν degrees of freedom,

as usual. Unfortunately, the distribution does not become more concentratedas n→∞, because the variance of the periodogram estimator does not go tozero.

We develop a fix for the deficiencies mentioned above by considering theaverage of the periodogram over a set of frequencies in the neighborhood ofω. For example, we can always find a set of L = 2m + 1 frequencies of theform {ωj:n + k/n; k = 0,±1,±2, . . . ,m}, for which

f(ωj:n + k/n) = f(ω) +O(Ln−1)

by Lemma C.3. As n increases, the values of the separate frequencies change.Now, we can consider the smoothed periodogram estimator, f(ω), given in

(4.56); this case includes the averaged periodogram, f(ω). First, we note that(C.9), θ =

∑∞h=−∞ |h||γ(h)| < ∞, is a crucial condition in the estimation of

spectra. In investigating local averages of the periodogram, we will require acondition on the rate of (C.9), namely

n∑h=−n

|h||γ(h)| = O(n−1/2). (C.19)

Page 562: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

C.2 Distribution of the DFT and Periodogram 549

One can show that a sufficient condition for (C.19) is that the time series isthe linear process given by,

xt =

∞∑j=−∞

ψjwt−j ,

∞∑j=0

√j |ψj | <∞ (C.20)

where wt ∼ iid(0, σ2w) and wt has finite fourth moment,

E(w4t ) = ησ4

w <∞.

We leave it to the reader (see Problem 4.40 for more details) to show (C.20)implies (C.19). If wt ∼ wn(0, σ2

w), then (C.20) implies (C.19), but we willrequire the noise to be iid in the following lemma.

Lemma C.4 Suppose xt is the linear process given by (C.20), and let I(ωj)be the periodogram of the data {x1, . . . , xn}. Then

cov (I(ωj), I(ωk)) =

2f2(ωj) + o(1) ωj = ωk = 0, 1/2

f2(ωj) + o(1) ωj = ωk 6= 0, 1/2

O(n−1) ωj 6= ωk.

The proof of Lemma C.4 is straightforward but tedious, and details maybe found in Fuller (1976, Theorem 7.2.1) or in Brockwell and Davis (1991,Theorem 10.3.2). For demonstration purposes, we present the proof of thelemma for the pure white noise case; i.e., xt = wt, in which case f(ω) ≡ σ2

w.By definition, the periodogram in this case is

I(ωj) = n−1n∑s=1

n∑t=1

wswte2πiωj(t−s),

where ωj = j/n, and hence

E{I(ωj)I(ωk)} = n−2n∑s=1

n∑t=1

n∑u=1

n∑v=1

E(wswtwuwv)e2πiωj(t−s)e2πiωk(u−v).

Now when all the subscripts match, E(wswtwuwv) = ησ4w, when the sub-

scripts match in pairs (e.g., s = t 6= u = v), E(wswtwuwv) = σ4w, otherwise,

E(wswtwuwv) = 0. Thus,

E{I(ωj)I(ωk)} = n−1(η − 3)σ4w + σ4

w

(1 + n−2[A(ωj + ωk) +A(ωk − ωj)]

),

where

A(λ) =

∣∣∣∣∣n∑t=1

e2πiλt

∣∣∣∣∣2

.

Noting that EI(ωj) = n−1∑nt=1E(w2

t ) = σ2w, we have

Page 563: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

550 Appendix C: Spectral Domain Theory

cov{I(ωj), I(ωk)} = E{I(ωj)I(ωk)} − σ4w

= n−1(η − 3)σ4w + n−2σ4

w[A(ωj + ωk) +A(ωk − ωj)].

Thus we conclude that

var{I(ωj)} = n−1(η − 3)σ4w + σ4

w for ωj 6= 0, 1/2

var{I(ωj)} = n−1(η − 3)σ4w + 2σ4

w for ωj = 0, 1/2

cov{I(ωj), I(ωk)} = n−1(η − 3)σ4w for ωj 6= ωk,

which establishes the result in this case. We also note that if wt is Gaussian,then η = 3 and the periodogram ordinates are independent. Using Lemma C.4,we may establish the following fundamental result.

Theorem C.5 Suppose xt is the linear process given by (C.20). Then, with

f(ω) defined in (4.56) and corresponding conditions on the weights hk, wehave, as n→∞,

(i) E(f(ω)

)→ f(ω)

(ii)(∑m

k=−m h2k

)−1cov

(f(ω), f(λ)

)→ f2(ω) for ω = λ 6= 0, 1/2.

In (ii), replace f2(ω) by 0 if ω 6= λ and by 2f2(ω) if ω = λ = 0 or 1/2.

Proof. (i): First, recall (4.28)

E [I(ωj:n)] =n−1∑

h=−(n−1)

(n− |h|n

)γ(h)e−2πiωj:nh

def= fn(ωj:n).

But since fn(ωj:n) → f(ω) uniformly, and |f(ωj:n) − f(ωj:n + k/n)| → 0 bythe continuity of f , we have

Ef(ω) =

m∑k=−m

hkEI(ωj:n + k/n) =

m∑k=−m

hkfn(ωj:n + k/n)

=m∑

k=−m

hk [f(ω) + o(1)]→ f(ω),

because∑mk=−m hk = 1.

(ii): First, suppose we have ωj:n → ω1 and ω`:n → ω2, and ω1 6= ω2. Then, forn large enough to separate the bands, using Lemma C.4, we have

Page 564: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

C.2 Distribution of the DFT and Periodogram 551

∣∣∣cov(f(ω1), f(ω2)

)∣∣∣ =

∣∣∣∣∣∣∑|k|≤m

∑|r|≤m

hk hrcov[I(ωj:n+k/n), I(ω`:n+r/n)

]∣∣∣∣∣∣=

∣∣∣∣∣∣∑|k|≤m

∑|r|≤m

hk hr O(n−1)

∣∣∣∣∣∣≤ c

n

∑|k|≤m

hk

2

(where c is a constant)

≤ cL

n

∑|k|≤m

h2k

,

which establishes (ii) for the case of different frequencies. The case of thesame frequencies, i.e., ω = λ, is established in a similar manner to the abovearguments. ut

Theorem C.5 justifies the distributional properties used throughout §4.5and Chapter 7. We may extend the results of this section to vector series ofthe form xxxt = (xt1, . . . , xtp)

′, when the cross-spectrum is given by

fij(ω) =∞∑

h=−∞

γij(h)e−2πiωh = cij(ω)− iqij(ω), (C.21)

where

cij(ω) =∞∑

h=−∞

γij(h) cos(2πωh) (C.22)

and

qij(ω) =∞∑

h=−∞

γij(h) sin(2πωh) (C.23)

denote the cospectrum and quadspectrum, respectively. We denote the DFTof the series xtj by

dj(ωk) = n−1/2n∑t=1

xtj e−2πiωkt

= dcj(ωk)− idsj(ωk),

where dcj and dsj are the cosine and sine transforms of xtj , for j = 1, 2, . . . , p.We bound the covariance structure as before and summarize the results asfollows.

Theorem C.6 The covariance structure of the multivariate cosine and sinetransforms, subject to

Page 565: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

552 Appendix C: Spectral Domain Theory

θij =

∞∑h=−∞

|h||γij(h)| <∞, (C.24)

is given by

E[dci(ωk)dcj(ω`)] =

{12cij(ωk) +O(n−1), k = `O(n−1), k 6= `.

(C.25)

E[dci(ωk)dsj(ω`)] =

{− 1

2qij(ωk) +O(n−1), k = `O(n−1), k 6= `

(C.26)

E[dsi(ωk)dcj(ω`)] =

{12qij(ωk) +O(n−1), k = `O(n−1), k 6= `

(C.27)

E[dsi(ωk)dsj(ω`)] =

{12cij(ωk) +O(n−1), k = `O(n−1), k 6= `.

(C.28)

Proof. We define

Sijn (ωk, ω`) =n∑s=1

n∑t=1

γij(s− t)e−2πiωkse2πiω`t. (C.29)

Then, we may verify the theorem with manipulations like

E[dci(ωk)dsj(ωk)]

=1

4i

n∑s=1

n∑t=1

γij(s− t)(e2πiωks + e−2πiωks)(e2πiωkt − e−2πiωkt)

=1

4i

[Sijn (−ωk, ωk) + Sijn (ωk, ωk)− Sijn (ωk, ωk)− Sijn (ωk,−ωk)

]=

1

4i

[cij(ωk)− iqij(ωk)− (cij(ωk) + iqij(ωk)) +O(n−1)

]= −1

2qij(ωk) +O(n−1),

where we have used the fact that the properties given in Lemmas C.1–C.3 canbe verified for the cross-spectral density functions fij(ω), i, j = 1, . . . , p. ut

Now, if the underlying multivariate time series xxxt is a normal process, itis clear that the DFTs will be jointly normal and we may define the vectorDFT, ddd(ωk) = (d1(ωk), . . . , dp(ωk))′ as

ddd(ωk) = n−1/2n∑t=1

xxxt e−2πωkt

= dddc(ωk)− iddds(ωk), (C.30)

where

Page 566: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

C.2 Distribution of the DFT and Periodogram 553

dddc(ωk) = n−1/2n∑t=1

xxxt cos(2πωkt) (C.31)

and

ddds(ωk) = n−1/2n∑t=1

xxxt sin(2πωkt) (C.32)

are the cosine and sine transforms, respectively, of the observed vector seriesxxxt. Then, constructing the vector of real and imaginary parts (ddd′c(ωk), ddd′s(ωk))′,we may note it has mean zero and 2p× 2p covariance matrix

Σ(ωk) =1

2

(C(ωk) −Q(ωk)Q(ωk) C(ωk)

)(C.33)

to order n−1 as long as ωk−ω = O(n−1). We have introduced the p×pmatricesC(ωk) = {cij(ωk)} and Q = {qij(ωk)}. The complex random variable ddd(ωk)has covariance

S(ωk) = E[ddd(ωk)ddd∗(ωk)]

= E

[(dddc(ωk)− iddds(ωk)

)(dddc(ωk)− iddds(ωk)

)∗]= E[dddc(ωk)dddc(ωk)′] + E[ddds(ωk)ddds(ωk)′]

−i(E[ddds(ωk)dddc(ωk)′]− E[dddc(ωk)ddds(ωk)′]

)= C(ωk)− iQ(ωk). (C.34)

If the process xxxt has a multivariate normal distribution, the complex vectorddd(ωk) has approximately the complex multivariate normal distribution withmean zero and covariance matrix S(ωk) = C(ωk) − iQ(ωk) if the real andimaginary parts have the covariance structure as specified above. In the nextsection, we work further with this distribution and show how it adapts to thereal case. If we wish to estimate the spectral matrix S(ω), it is natural to takea band of frequencies of the form ωk:n + `/n, for ` = −m, . . . ,m as before, sothat the estimator becomes (4.91) of §4.6. A discussion of further propertiesof the multivariate complex normal distribution is deferred.

It is also of interest to develop a large sample theory for cases in whichthe underlying distribution is not necessarily normal. If xt is not necessarilya normal process, some additional conditions are needed to get asymptoticnormality. In particular, introduce the notion of a generalized linear process

yyyt =∞∑

r=−∞Arwwwt−r, (C.35)

wherewwwt is a p×1 vector white noise process with p×p covariance E[wwwtwww′t] = G

and the p× p matrices of filter coefficients At satisfy

∞∑t=−∞

tr{AtA′t} =

∞∑t=−∞

‖At‖2 <∞. (C.36)

Page 567: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

554 Appendix C: Spectral Domain Theory

In particular, stable vector ARMA processes satisfy these conditions. Forgeneralized linear processes, we state the following general result from Han-nan (1970, p.224).

Theorem C.7 If xxxt is generated by a generalized linear process with a con-tinuous spectrum that is not zero at ω and ωk:n + `/n are a set of frequencieswithin L/n of ω, the joint density of the cosine and sine transforms (C.31)and (C.32) converges to that of L independent 2p × 1 normal vectors withcovariance matrix Σ(ω) with structure given by (C.33). At ω = 0 or ω = 1/2,the distribution is real with covariance matrix 2Σ(ω).

The above result provides the basis for inference involving the Fouriertransforms of stationary series because it justifies approximations to the like-lihood function based on multivariate normal theory. We make extensive useof this result in Chapter 7, but will still need a simple form to justify the dis-tributional result for the sample coherence given in (4.97). The next sectiongives an elementary introduction to the complex normal distribution.

C.3 The Complex Multivariate Normal Distribution

The multivariate normal distribution will be the fundamental tool for express-ing the likelihood function and determining approximate maximum likelihoodestimators and their large sample probability distributions. A detailed treat-ment of the multivariate normal distribution can be found in standard textssuch as Anderson (1984). We will use the multivariate normal distribution ofthe p× 1 vector x = (x1, x2, . . . , xp)

′, as defined by its density function

p(xxx) = (2π)−p/2|Σ|−1/2 exp{−1

2(xxx− µµµ)′Σ−1(xxx− µµµ)

}, (C.37)

which can be shown to have mean vector E[xxx] = µµµ = (µ1, . . . , µp)′ and covari-

ance matrixΣ = E[(xxx− µµµ)(xxx− µµµ)′]. (C.38)

We use the notation xxx ∼ Np(µµµ,Σ) for densities of the form (C.37) and notethat linearly transformed multivariate normal variables of the form yyy = Axxx,with A a q×p matrix q ≤ p, will also be multivariate normal with distribution

yyy ∼ Nq(Aµµµ,AΣA′). (C.39)

Often, the partitioned multivariate normal, based on the vector xxx =(xxx′1, xxx

′2)′, split into two p1 × 1 and p2 × 1 components xxx1 and xxx2, respec-

tively, will be used where p = p1 + p2. If the mean vector µµµ = (µµµ′1, µµµ′2)′ and

covariance matrices

Σ =

(Σ11 Σ12

Σ21 Σ22

)(C.40)

Page 568: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

C.3 The Complex Multivariate Normal Distribution 555

are also compatibly partitioned, the marginal distribution of any subset ofcomponents is multivariate normal, say,

xxx1 ∼ Np1{µµµ1, Σ11},

and that the conditional distribution xxx2 given xxx1 is normal with mean

E[xxx2|xxx1] = µµµ2 +Σ21Σ−111 (xxx1 − µµµ1) (C.41)

and conditional covariance

cov[xxx2|xxx1] = Σ22 −Σ21Σ−111 Σ12. (C.42)

In the previous section, the real and imaginary parts of the DFT had apartitioned covariance matrix as given in (C.33), and we use this result to saythe complex p× 1 vector

zzz = xxx1 − ixxx2 (C.43)

has a complex multivariate normal distribution, with mean vector µµµz = µµµ1 −iµµµ2 and p× p covariance matrix

Σz = C − iQ (C.44)

if the real multivariate 2p × 1 normal vector xxx = (xxx′1, xxx′2)′ has a real mul-

tivariate normal distribution with mean vector µµµ = (µµµ′1, µµµ′2)′ and covariance

matrix

Σ =1

2

(C −QQ C

). (C.45)

The restrictions C ′ = C and Q′ = −Q are necessary for the matrix Σ to bea covariance matrix, and these conditions then imply Σz = Σ∗z is Hermitian.The probability density function of the complex multivariate normal vector zzzcan be expressed in the concise form

pzzz(zzz) = π−p|Σz|−1 exp{−(zzz − µµµz)∗Σ−1z (zzz − µµµz)}, (C.46)

and this is the form that we will often use in the likelihood. The resultfollows from showing that pxxx(xxx1, xxx2) = pzzz(zzz) exactly, using the fact thatthe quadratic and Hermitian forms in the exponent are equal and that|Σx| = |Σz|2. The second assertion follows directly from the fact that thematrix Σx has repeated eigenvalues, λ1, λ2, . . . , λp corresponding to eigenvec-tors (α′1, α

′2)′ and the same set, λ1, λ2, . . . , λp corresponding to (α′2,−α′1)′.

Hence

|Σx| =p∏i=1

λ2i = |Σz|2.

For further material relating to the complex multivariate normal distribution,see Goodman (1963), Giri (1965), or Khatri (1965).

Page 569: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

556 Appendix C: Spectral Domain Theory

Example C.2 A Complex Normal Random Variable

To fix ideas, consider a very simple complex random variable

z = <(z)− i=(z) = z1 − iz2,

where z1 ∼ N(0, 12σ2) independent of z2 ∼ N(0, 12σ

2). Then the joint densityof (z1, z2) is

f(z1, z2) ∝ σ−1 exp

(− z

21

σ2

)×σ−1 exp

(− z

22

σ2

)= σ−2 exp

{−(z21 + z22σ2

)}.

More succinctly, we write z ∼ Nc(0, σ2), and

f(z) ∝ σ−2 exp

(−z∗ z

σ2

).

In Fourier analysis, z1 would be the cosine transform of the data at a funda-mental frequency (excluding the end points) and z2 the corresponding sinetransform. If the process is Gaussian, z1 and z2 are independent normalswith zero means and variances that are half of the spectral density at theparticular frequency. Consequently, the definition of the complex normaldistribution is natural in the context of spectral analysis.

Example C.3 A Bivariate Complex Normal Distribution

Consider the joint distribution of the complex random variables u1 = x1 −ix2 and u2 = y1 − iy2, where the partitioned vector (x1, x2, y1, y2)′ has areal multivariate normal distribution with mean (0, 0, 0, 0)′ and covariancematrix

Σ =1

2

cxx 0 cxy −qxy

0 cxx qxy cxy

cxy qxy cyy 0−qxy cyx 0 cyy

. (C.47)

Now, consider the conditional distribution of yyy = (y1, y2)′, given xxx =(x1, x2)′. Using (C.41), we obtain

E(yyy∣∣ xxx) =

(x1 −x2x2 x1

)(b1b2

), (C.48)

where

(b1, b2) =

(cyxcxx

,qyxcxx

). (C.49)

It is natural to identify the cross-spectrum

fxy = cxy − iqxy, (C.50)

so that the complex variable identified with the pair is just

Page 570: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

C.3 The Complex Multivariate Normal Distribution 557

b = b1 − ib2 =cyx − iqyx

cxx=fyxfxx

,

and we identify it as the complex regression coefficient. The conditionalcovariance follows from (C.42) and simplifies to

cov(yyy∣∣ xxx) =

1

2fy·x I2, (C.51)

where I2 denotes the 2× 2 identity matrix and

fy·x = cyy −c2xy + q2xycxx

= fyy −|fxy|2

fxx(C.52)

Example C.3 leads to an approach for justifying the distributional re-sults for the function coherence given in (4.97). That equation suggests thatthe result can be derived using the regression results that lead to the F-statistics in §2.2. Suppose that we consider L values of the sine and co-sine transforms of the input xt and output yt, which we will denote bydx,c(ωk + `/n), dx,s(ωk + `/n), dy,c(ωk + `/n), dy,s(ωk + `/n), sampled atL = 2m+ 1 frequencies, ` = −m, . . . ,m, in the neighborhood of some targetfrequency ω. Suppose these cosine and sine transforms are re-indexed and de-noted by dx,cj , dx,sj , dy,cj , dy,sj , for j = 1, 2, . . . , L, producing 2L real randomvariables with a large sample normal distribution that have limiting covariancematrices of the form (C.47) for each j. Then, the conditional normal distri-bution of the 2× 1 vector dy,cj , dy,sj given dx,cj , dx,sj , given in Example C.3,shows that we may write, approximately, the regression model(

dy,cjdy,sj

)=

(dx,cj −dx,sjdx,sj dx,cj

)(b1b2

)+

(VcjVsj

),

where Vcj , Vsj are approximately uncorrelated with approximate variances

E[V 2cj ] = E[V 2

sj ] = (1/2)fy·x.

Now, construct, by stacking, the 2L× 1 vectors yyyc = (dy,c1, . . . , dy,cL)′, yyys =(dy,s1, . . . , dy,sL)′, xxxc = (dx,c1, . . . , dx,cL)′ and xxxs = (dx,s1, . . . , dx,sL)′, andrewrite the regression model as(

yyycyyys

)=

(xxxc −xxxsxxxs xxxc

)(b1b2

)+

(vvvcvvvs

)where vvvc and vvvs are the error stacks. Finally, write the overall model as theregression model in Chapter 2, namely,

yyy = Zbbb+ vvv,

making the obvious identifications in the previous equation. Conditional on Z,the model becomes exactly the regression model considered in Chapter 2 where

Page 571: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

558 Appendix C: Spectral Domain Theory

there are q = 2 regression coefficients and 2L observations in the observationvector yyy. To test the hypothesis of no regression for that model, we use anF-Statistic that depends on the difference between the residual sum of squaresfor the full model, say,

SSE = yyy′yyy − yyy′Z(Z ′Z)−1Z ′yyy (C.53)

and the residual sum of squares for the reduced model, SSE0 = yyy′yyy. Then,

F2,2L−2 = (L− 1)SSE0 − SSE

SSE(C.54)

has the F-distribution with 2 and 2L− 2 degrees of freedom. Also, it followsby substitution for yyy that

SSE0 = yyy′yyy = yyy′cyyyc + yyy′syyys =L∑j=1

(d2y,cj + d2y,sj) = Lfy(ω),

which is just the sample spectrum of the output series. Similarly,

Z ′Z =

(Lfx 0

0 Lfx

)and

Z ′yyy =

((xxx′cyyyc + xxx′syyys)(xxx′cyyys − xxx′syyyc)

)

=

∑Lj=1(dx,cjdy,cj + dx,sjdy,sj)∑Lj=1(dx,cjdy,sj − dx,sjdy,cj)

=

(LcyxLqyx

).

together imply that

yyy′Z(Z ′Z)−1Z ′yyy = L |fxy|2/fx.

Substituting into (C.54) gives

F2,2L−2 = (L− 1)|fxy|2/fx(

fy − |fxy|2/fx) ,

which converts directly into the F-statistic (4.97), using the sample coherencedefined in (4.96).

Page 572: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

Appendix RR Supplement

R.1 First Things First

If you do not already have R, point your browser to the Comprehensive RArchive Network (CRAN), http://cran.r-project.org/ and download andinstall it. The installation includes help files and some user manuals. You canfind helpful tutorials by following CRAN’s link to Contributed Documentation.If you are new to R/S-PLUS, then R for Beginners by Emmanuel Paradis isa great introduction. There is also a lot of advice out there in cyberspace, butsome of it will be outdated because R goes through many revisions.

Next, point your browser to http://www.stat.pitt.edu/stoffer/tsa3/,the website for the text, or one of its mirrors, download tsa3.rda and putit in a convenient place (e.g., the working directory of R).1 This file containsthe data sets and scripts that are used in the text. Then, start R and issuethe command

1 load("tsa3.rda")

Once you have loaded tsa3.rda, all the files will stay in R as long as yousave the workspace when you close R (details in §R.2). If you don’t save theworkspace, you will have to reload it. To see what is included in tsa3.rda

type

2 ls() # to get a listing of your objects, and

3 tsa3.version # to check the version

Periodically check that your version matches the latest version number (year-month-day) on the website. Please note that tsa3.rda is subject to change.

You are free to use the data or to alter the scripts in any way you see fit.We only have two requests: reference the text if you use something from it,and contact us if you find any errors or have suggestions for improvement ofthe code.

1 See §R.2, page 567, on how to get the current working directory and how tochange it, or page 569 on how to read files from other directories.

Page 573: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

560 Appendix R: R Supplement

R.1.1 Included Data Sets

The data sets included in tsa3.rda, listed by the chapter in which they arefirst presented, are as follows.

Chapter 1jj - Johnson & Johnson quarterly earnings per share, 84 quarters (21

years) measured from the first quarter of 1960 to the last quarterof 1980.

EQ5 - Seismic trace of an earthquake [two phases or arrivals along thesurface, the primary wave (t = 1, . . . , 1024) and the shear wave(t = 1025, . . . , 2048)] recorded at a seismic station.

EXP6 - Seismic trace of an explosion (similar details as EQ5).gtemp - Global mean land-ocean temperature deviations (from 1951-1980

average), measured in degrees centigrade, for the years 1880-2009;data taken from http://data.giss.nasa.gov/gistemp/graphs/

fmri1 - A data frame that consists of fmri bold signals at eight locations(in columns 2-9, column 1 is time period), when a stimulus wasapplied for 32 seconds and then stopped for 32 seconds. The signalperiod is 64 seconds and the sampling rate was one observationevery 2 seconds for 256 seconds (n = 128).

soi - Southern Oscillation Index (SOI) for a period of 453 months rangingover the years 1950-1987.

rec - Recruitment (number of new fish) associated with SOI.speech - A small .1 second (1000 points) sample of recorded speech for the

phrase aaa · · ·hhh.nyse - Returns of the New York Stock Exchange (NYSE) from February

2, 1984 to December 31, 1991.soiltemp - A 64× 36 matrix of surface soil temperatures.

Chapter 2oil - Crude oil, WTI spot price FOB (in dollars per barrel), weekly

data from 2000 to mid-2010. For definitions and more details, seehttp://tonto.eia.doe.gov/dnav/pet/pet_pri_spt_s1_w.htm.

gas - New York Harbor conventional regular gasoline weekly spot priceFOB (in cents per gallon) over the same time period as oil.

varve - Sedimentary deposits from one location in Massachusetts for 634years, beginning nearly 12,000 years ago.

cmort - Average weekly cardiovascular mortality in Los Angeles County;508 six-day smoothed averages obtained by filtering daily valuesover the 10 year period 1970-1979.

tempr - Temperature series corresponding to cmort.part - Particulate series corresponding to cmort.so2 - Sulfur dioxide series corresponding to cmort.

Chapter 3prodn - Monthly Federal Reserve Board Production Index (1948-1978, n =

372 months).unemp - Unemployment series corresponding to prodn.ar1boot - Data used in Example 3.35 on page 137.

Page 574: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

R.1.1 Included Data Sets 561

gnp - Quarterly U.S. GNP from 1947(1) to 2002(3), n = 223 observations.birth - Monthly live births (adjusted) in thousands for the United States,

1948-1979.

Chapter 4sunspotz - Biannual smoothed (12-month moving average) number of sunspots

from June 1749 to December 1978; n = 459. The “z” on the endis to distinguish this series from the one included with R (calledsunspots).

salt - Salt profiles taken over a spatial grid set out on an agriculturalfield, 64 rows at 17-ft spacing.

saltemp - Temperature profiles corresponding to salt.

Chapter 5arf - 1000 simulated observations from an ARFIMA(1, 1, 0) model with

φ = .75 and d = .4.flu - Monthly pneumonia and influenza deaths per 10,000 people in the

United States for 11 years, 1968 to 1978.sales - Sales (with lead, a leading indicator), 150 months; taken from Box

& Jenkins (1970).lead - See sales.econ5 - Data set containing quarterly U.S. unemployment, GNP, consump-

tion, and government and private investment, from 1948-III to 1988-II.

Chapter 6ar1miss - Data for Problem 6.14 on page 403.gtemp2 - Similar to gtemp but the data are based only on surface air tem-

perature data obtained from meteorological stations.qinfl - Quarterly inflation rate in the Consumer Price Index from 1953-I

to 1980-II, n = 110 observations; from Newbold and Bos (1985).qintr - Quarterly interest rate recorded for Treasury bills over the same

period as qinfl.WBC - Measurements made for 91 days on the three variables, log(white

blood count) [WBC], log(platelet) [PLT] and hematocrit [HCT]; takenfrom Jones (1984).

PLT - See WBC.HCT - See WBC.

Chapter 7beamd - Infrasonic signals from a nuclear explosion. This is a data frame

consisting of three columns (which are not time series objects) thatare data from different channels. The series names are sensor1,

sensor2, sensor3. See Example 7.2 on page 421 for more infor-mation.

bnrf1ebv - Nucleotide sequence of the BNRF1 gene of the Epstein-Barr virus(EBV): 1=A, 2=C, 3=G, 4=T. The data are used in §7.9.

bnrf1hvs - Nucleotide sequence of the BNRF1 gene of the herpes virus saimiri(HVS); same codes as for EBV.

Page 575: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

562 Appendix R: R Supplement

fmri - Data (as a vector list) from an fMRI experiment in pain, listedby location and stimulus. The specfic locations of the brain wherethe signal was measured were [1] Cortex 1: Primary Somatosensory,Contralateral, [2] Cortex 2: Primary Somatosensory, Ipsilateral, [3]Cortex 3: Secondary Somatosensory, Contralateral, [4] Cortex 4:Secondary Somatosensory, Ipsilateral, [5] Caudate, [6] Thalamus 1:Contralateral, [7] Thalamus 2: Ipsilateral, [8] Cerebellum 1: Con-tralateral and [9] Cerebellum 2: Ipsilateral. The stimuli (and num-ber of subjects in each condition) are [1] Awake-Brush (5 subjects),[2] Awake-Heat (4 subjects), [3] Awake-Shock (5 subjects), [4] Low-Brush (3 subjects), [5] Low-Heat (5 subjects), and [6] Low-Shock(4 subjects). Issue the command summary(fmri) for further details.As an example, fmri$L1T6 (location 1, treatment 6) will show thedata for the four subjects receiving the Low-Shock treatment at theCortex 1 location; note that fmri[[6]] will display the same data.See Examples 7.7–7.9 for examples.

climhyd - Lake Shasta inflow data; see Example 7.1. This is a data frame withcolumn names: Temp, DewPt, CldCvr, WndSpd, Precip, Inflow.

eqexp - This is a data frame of the earthquake and explosion seismic seriesused throughout the text. The matrix has 17 columns, the firsteight are earthquakes, the second eight are explosions, and the lastcolumn is the Novaya Zemlya series. The column names are: EQ1,EQ2,...,EQ8; EX1, EX2,...,EX8; NZ.

R.1.2 Included Scripts

The following scripts are included in tsa3.rda. At the end of the descriptionof each script, a text example that demonstrates the use of the script is given.

lag.plot2(series1, series2, max.lag=0, corr=TRUE, smooth=TRUE)

Produces a grid of scatterplots of one series versus another. If (xt, yt) is a vector

time series, then lag.plot2(x,y,m) will generate a grid of scatterplots of xt−hversus yt for h = 0, 1, ...,m, along with the cross-correlation values (corr=TRUE)

and a lowess fit (smooth=TRUE) assuming xt is in x and yt is in y. Note that the first

series, xt, is the one that gets lagged. If you just want the scatterplots and nothing

else, then use lag.plot2(x,y,m,corr=FALSE,smooth=FALSE). See Example 2.7 on

page 64 for a demonstration.

lag.plot1(series, max.lag=1, corr=TRUE, smooth=TRUE)

Produces a grid of scatterplots of a series versus lagged values of the series. Sim-

ilar to lag.plot2, the call lag.plot1(x,m) will generate a grid of scatterplots of

xt−h versus xt for h = 1, ...,m, along with the autocorrelation values (corr=TRUE)

and a lowess fit (smooth=TRUE). The defaults are the same as lag.plot2; if you

don’t want either the correlation values or the lowess fit, you can either use

lag.plot1(x,m,corr=FALSE,smooth=FALSE) or R’s lag.plot. See Example 2.7

on page 64 for a demonstration.

Page 576: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

R.1.2 Included Scripts 563

acf2(series, max.lag=NULL)

Produces a simultaneous plot (and a printout) of the sample ACF and PACF on

the same scale. If x contains n observations, acf2(x) will print and plot the ACF

and PACF of x to the default lag of√n + 10 (unless n is smaller than 50). The

number of lags may be specified, e.g., acf2(x, 33). See Example 3.17 on page 108.

sarima(series, p, d, q, P=0, D=0, Q=0, S=-1, details=TRUE,

tol=sqrt(.Machine$double.eps), no.constant=FALSE)

Fits ARIMA models including diagnostics in a short command. If your time se-ries is in x and you want to fit an ARIMA(p, d, q) model to the data, the ba-sic call is sarima(x,p,d,q). The results are the parameter estimates, standarderrors, AIC, AICc, BIC (as defined in Chapter 2) and diagnostics. To fit a sea-sonal ARIMA model, the basic call is sarima(x,p,d,q,P,D,Q,S). So, for exam-ple, sarima(x,2,1,0) will fit an ARIMA(2, 1, 0) model to the series in x, andsarima(x,2,1,0,0,1,1,12) will fit a seasonal ARIMA(2, 1, 0) × (0, 1, 1)12 modelto the series in x. If you want to look at the innovations (i.e., the residuals) fromthe fit, they’re stored in innov.

There are three additional options that can be included in the call.• details turns on/off the output from the nonlinear optimization routine,

which is optim. The default is TRUE, use details=FALSE to turn off the output;e.g., sarima(x,2,1,0,details=FALSE).

• tol controls the relative tolerance (reltol) used to assess convergence in sarima

and sarima.for. The default is tol=sqrt(.Machine$double.eps), the R de-fault. For details, see the help file for optim under the control arguments.For example, sarima(rec,2,0,0,tol=.0001) will speed up the convergence.If there are many parameters to estimate (e.g., seasonal models), the analysismay take a long time using the default.

• no.constant can control whether or not sarima includes a constant in themodel. In particular, with sarima, if there is no differencing (d = 0 and D = 0)you get the mean estimate. If there’s differencing of order one (either d = 1 orD = 1, but not both), a constant term is included in the model; this may beoverridden by setting this to TRUE; e.g., sarima(x,1,1,0,no.constant=TRUE).In any other situation, no constant or mean term is included in the model. Theidea is that if you difference more than once (d+D > 1), any drift is likely tobe removed.

See Examples 3.38, 3.39, 3.40, 3.42, and 3.46 on pages 145–159 for demonstra-

tions.

sarima.for(series, n.ahead, p, d, q, P=0, D=0, Q=0, S=-1,

tol=sqrt(.Machine$double.eps), no.constant=FALSE)

Gives ARIMA forecasts. Similar to sarima, to forecast n.ahead time points from

an ARIMA fit to the data in x, the form is sarima.for(x, n.ahead, p, d, q)

or sarima.for(x, n.ahead, p, d, q, P, D, Q, S) for a seasonal model. For

example, sarima.for(x,5,1,0,1) will forecast five time points ahead for an

ARMA(1,1) fit to x. The output prints the forecasts and the standard errors

of the forecasts, and supplies a graphic of the forecast with ±2 prediction error

Page 577: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

564 Appendix R: R Supplement

bounds. The options tol and no.constant are also available. See Example 3.46

on page 159.

spec.arma(ar=0, ma=0, var.noise=1, n.freq=500, ...)

Gives the ARMA spectrum (on a log scale), tests for causality, invertibility, and

common zeros. The basic call is spec.arma(ar, ma) where ar and ma are vectors

containing the model parameters. Use log="no" if you do not want the plot on

a log scale. If the model is not causal or invertible an error message is given. If

there are common zeros, a spectrum will be displayed and a warning will be given;

e.g., spec.arma(ar= .9, ma= -.9) will yield a warning and the plot will be the

spectrum of white noise. The variance of the noise can be changed with var.noise.

Finally, the frequencies and the spectral density ordinates are returned invisibly,

e.g., spec.arma(ar=.9)$freq and spec.arma(ar=.9)$spec, if you’re interested in

the actual values. See Example 4.6 on page 184.

LagReg(input, output, L=c(3,3), M=20, threshold=0, inverse=FALSE)

Performs lagged regression as discussed in Chapter 4, §4.10. For a bivariate series,

input is the input series and output is the output series. The degree of smoothing

for the spectral estimate is given by L; see spans in the help file for spec.pgram.

The number of terms used in the lagged regression approximation is given by M,

which must be even. The threshold value is the cut-off used to set small (in abso-

lute value) regression coefficients equal to zero (it is easiest to run LagReg twice,

once with the default threshold of zero, and then again after inspecting the result-

ing coefficients and the corresponding values of the CCF). Setting inverse=TRUE

will fit a forward-lagged regression; the default is to run a backward-lagged regres-

sion. The script is based on code that was contributed by Professor Doug Wiens,

Department of Mathematical and Statistical Sciences, University of Alberta. See

Example 4.24 on page 244 for a demonstration.

SigExtract(series, L=c(3,3), M=50, max.freq=.05)

Performs signal extraction and optimal filtering as discussed in Chapter 4, §4.11.

The basic function of the script, and the default setting, is to remove frequencies

above 1/20 (and, in particular, the seasonal frequency of 1 cycle every 12 time

points). The time series to be filtered is series, and its sampling frequency is set

to unity (∆ = 1). The values of L and M are the same as in LagReg and max.freq

denotes the truncation frequency, which must be larger than 1/M. The filtered

series is returned silently; e.g., f.x = SigExtract(x) will store the extracted signal

in f.x. The script is based on code that was contributed by Professor Doug Wiens,

Department of Mathematical and Statistical Sciences, University of Alberta. See

Example 4.25 on page 249 for a demonstration.

Kfilter0(n, y, A, mu0, Sigma0, Phi, cQ, cR)

Returns the filtered values in Property 6.1 on page 326 for the state-space model,

(6.1)–(6.2). In addition, the script returns the evaluation of the likelihood at the

given parameter values and the innovation sequence. The inputs are n: number of

Page 578: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

R.1.2 Included Scripts 565

observations; y: data matrix; A: observation matrix (assumed constant); mu0: initial

state mean; Sigma0: initial state variance-covariance matrix; Phi: state transition

matrix; cQ: Cholesky decomposition of Q [cQ=chol(Q)]; cR: Cholesky decompo-

sition of R [cR=chol(R)]. Note: The script requires only that Q or R may be

reconstructed as t(cQ)%*%cQ or t(cR)%*%cR, which offers a little more flexibility

than requiring Q or R to be positive definite. For demonstrations, see Example 6.6

on page 336, Example 6.8 on page 342, and Example 6.10 on page 350.

Ksmooth0(n, y, A, mu0, Sigma0, Phi, cQ, cR)

Returns both the filtered values in Property 6.1 on page 326 and the smoothed

values in Property 6.2 on page 330 for the state-space model, (6.1)–(6.2). The

inputs are the same as Kfilter0. For demonstrations, see Example 6.5 on page 331,

and Example 6.10 on page 350.

EM0(n, y, A, mu0, Sigma0, Phi, cQ, cR, max.iter=50, tol=.01)

Estimation of the parameters in the model (6.1)–(6.2) via the EM algorithm. Most

of the inputs are the same as for Ksmooth0 and the script uses Ksmooth0. To control

the number of iterations, use max.iter (set to 50 by default) and to control the

relative tolerance for determining convergence, use tol (set to .01 by default). For

a demonstration, see Example 6.8 on page 342.

Kfilter1(n, y, A, mu0, Sigma0, Phi, Ups, Gam, cQ, cR, input)

Returns the filtered values in Property 6.1 on page 326 for the state-space model,

(6.3)–(6.4). In addition, the script returns the evaluation of the likelihood at the

given parameter values and the innovation sequence. The inputs are n: number of

observations; y: data matrix; A: observation matrix, an array with dim=c(q,p,n);

mu0: initial state mean; Sigma0: initial state variance-covariance matrix; Phi:

state transition matrix; Ups: state input matrix; Gam: observation input matrix;

cQ: Cholesky decomposition of Q; cR: Cholesky decomposition of R [the note in

Kfilter0 applies here]; input: matrix of inputs having the same row dimension

as y. Set Ups or Gam or input to 0 (zero) if they are not used. For demonstrations,

see Example 6.7 on page 338 and Example 6.9 on page 348.

Ksmooth1(n, y, A, mu0, Sigma0, Phi, Ups, Gam, cQ, cR, input)

Returns both the filtered values in Property 6.1 on page 326 and the smoothed

values in Property 6.2 on page 330 for the state-space model, (6.3)–(6.4). The

inputs are the same as Kfilter1. See Example 6.7 on page 338 and Example 6.9

on page 348.

EM1(n, y, A, mu0, Sigma0, Phi, Ups, Gam, cQ, cR, input, max.iter=50,

tol=.01)

Estimation of the parameters in the model (6.3)–(6.4) via the EM algorithm. Most

of the inputs are the same as for Ksmooth1 and the script uses Ksmooth1. To control

the number of iterations, use max.iter (set to 50 by default) and to control the

relative tolerance for determining convergence, use tol (set to .01 by default). For

a demonstration, see Example 6.12 on page 357.

Page 579: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

566 Appendix R: R Supplement

Kfilter2(n, y, A, mu0, Sigma0, Phi, Ups, Gam, Theta, cQ, cR, S, input)

Returns the filtered values in Property 6.5 on page 354 for the state-space model,

(6.97)–(6.99). In addition, the script returns the evaluation of the likelihood at

the given parameter values and the innovation sequence. The inputs are similar

to Kfilter1, except that the noise covariance matrix, S must be included. For

demonstrations, see Example 6.11 on page 356 and Example 6.13 on page 361.

Ksmooth2(n, y, A, mu0, Sigma0, Phi, Ups, Gam, Theta, cQ, cR, S, input)

This is the smoother companion to Kfilter2.

SVfilter(n, y, phi0, phi1, sQ, alpha, sR0, mu1, sR1)

Performs the special case switching filter for the stochastic volatility model,

(6.173), (6.175)–(6.176). The state parameters are phi0, phi1, sQ [φ0, φ1, σw],

and alpha, sR0, mu1, sR1 [α, σ0, µ1, σ1] are observation equation parameters as

presented in Section 6.9. See Example 6.18 page 380 and Example 6.19 page 383.

mvspec(x, spans = NULL, kernel = NULL, taper = 0, pad = 0, fast = TRUE,

demean = TRUE, detrend = FALSE, plot = FALSE,

na.action = na.fail, ...)

This is spec.pgram with a few changes in the defaults and written so you can

extract the estimate of the multivariate spectral matrix as fxx. For example, if x

contains a p-variate time series (i.e., the p columns of x are time series), and you

issue the command spec = mvspec(x, spans=3) say, then spec$fxx is an array

with dimensions dim=c(p,p,nfreq), where nfreq is the number of frequencies

used. If you print spec$fxx, you will see nfreq p × p spectral matrix estimates.

See Example 7.12 on page 461 for a demonstration.

FDR(pvals, qlevel=0.001)

Computes the basic false discovery rate given a vector of p-values; see Example 7.4

on page 427 for a demonstration.

stoch.reg(data, cols.full, cols.red, alpha, L, M, plot.which)

Performs frequency domain stochastic regression discussed in §7.3. Enter the entire

data matrix (data), and then the corresponding columns of input series in the full

model (cols.full) and in the reduced model (cols.red; use NULL if there are

no inputs under the reduced model). The response variable should be the last

column of the data matrix, and this need not be specified among the inputs.

Other arguments are alpha (test size), L (smoothing), M (number of points in the

discretization of the integral) and plot.which = coh or F.stat, to plot either the

squared-coherencies or the F -statistics. The coefficients of the impulse response

function are returned and plotted. The script is based on code that was contributed

by Professor Doug Wiens, Department of Mathematical and Statistical Sciences,

University of Alberta. See Example 7.1 on page 417 for a demonstration.

Page 580: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

R.2 Getting Started 567

R.2 Getting Started

If you are experienced with R/S-PLUS you can skip this section, and perhapsthe rest of this appendix. Otherwise, it is essential to have R up and runningbefore you start this tutorial. The best way to use the rest of this appendixis to start up R and enter the example code as it is presented. Also, you canuse the results and help files to get a better understanding of how R works(or doesn’t work). The character # is used for comments.

The convention throughout the text is that R code is in typewriter font

with a small line number in the left margin. Get comfortable, then start herup and try some simple tasks.

1 2+2 # addition

[1] 5

2 5*5 + 2 # multiplication and addition

[1] 27

3 5/5 - 3 # division and subtraction

[1] -2

4 log(exp(pi)) # log, exponential, pi

[1] 3.141593

5 sin(pi/2) # sinusoids

[1] 1

6 exp(1)^(-2) # power

[1] 0.1353353

7 sqrt(8) # square root

[1] 2.828427

8 1:5 # sequences

[1] 1 2 3 4 5

9 seq(1, 10, by=2) # sequences

[1] 1 3 5 7 9

10 rep(2,3) # repeat 2 three times

[1] 2 2 2

Next, we’ll use assignment to make some objects:

1 x <- 1 + 2 # put 1 + 2 in object x

2 x = 1 + 2 # same as above with fewer keystrokes

3 1 + 2 -> x # same

4 x # view object x

[1] 3

5 (y = 9*3) # put 9 times 3 in y and view the result

[1] 27

6 (z = rnorm(5,0,1)) # put 5 standard normals into z and print z

[1] 0.96607946 1.98135811 -0.06064527 0.31028473 0.02046853

To list your objects, remove objects, get help, find out which directory iscurrent (or to change it) or to quit, use the following commands:

Page 581: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

568 Appendix R: R Supplement

1 ls() # list all objects

[1] "dummy" "mydata" "x" "y" "z"

2 ls(pattern = "my") # list every object that contains "my"

[1] "dummy" "mydata"

3 rm(dummy) # remove object "dummy"

4 help.start() # html help and documentation (use it)

5 help(exp) # specific help (?exp is the same)

6 getwd() # get working directory

7 setwd("/TimeSeries/") # change working directory to TimeSeries

8 q() # end the session (keep reading)

When you quit, R will prompt you to save an image of your current workspace.Answering “yes” will save all the work you have done so far, and load it upwhen you next start R. Our suggestion is to answer “yes” even though youwill also be loading irrelevant past analyses every time you start R. Keep inmind that you can remove items via rm(). If you do not save the workspace,you will have to reload tsa3.rda as described in §R.1.

To create your own data set, you can make a data vector as follows:

1 mydata = c(1,2,3,2,1)

Now you have an object called mydata that contains five elements. R calls theseobjects vectors even though they have no dimensions (no rows, no columns);they do have order and length:

2 mydata # display the data

[1] 1 2 3 2 1

3 mydata[3] # the third element

[1] 3

4 mydata[3:5] # elements three through five

[1] 3 2 1

5 mydata[-(1:2)] # everything except the first two elements

[1] 3 2 1

6 length(mydata) # number of elements

[1] 5

7 dim(mydata) # no dimensions

NULL

8 mydata = as.matrix(mydata) # make it a matrix

9 dim(mydata) # now it has dimensions

[1] 5 1

It is worth pointing out R’s recycling rule for doing arithmetic. The rule isextremely helpful for shortening code, but it can also lead to mistakes if youare not careful. Here are some examples.

1 x = c(1, 2, 3, 4); y = c(2, 4, 6, 8); z = c(10, 20)

2 x*y # it’s 1*2, 2*4, 3*6, 4*8

[1] 2 8 18 32

3 x/z # it’s 1/10, 2/20, 3/10, 4/20

[1] 0.1 0.1 0.3 0.2

Page 582: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

R.2 Getting Started 569

4 x+z # guess

[1] 11 22 13 24

If you have an external data set, you can use scan or read.table toinput the data. For example, suppose you have an ascii (text) data file calleddummy.dat in a directory called TimeSeries in your root directory, and thefile looks like this:

1 2 3 2 19 0 2 1 0

1 dummy = scan("dummy.dat") # if TimeSeries is the working directory

2 (dummy = scan("/TimeSeries/dummy.dat")) # if not, do this

Read 10 items

[1] 1 2 3 2 1 9 0 2 1 0

3 (dummy = read.table("/TimeSeries/dummy.dat"))

V1 V2 V3 V4 V5

1 2 3 2 1

9 0 2 1 0

There is a difference between scan and read.table. The former produced adata vector of 10 items while the latter produced a data frame with namesV1 to V5 and two observations per variate. In this case, if you want to list (oruse) the second variate, V2, you would use

4 dummy$V2

[1] 2 0

and so on. You might want to look at the help files ?scan and ?read.table

now. Data frames (?data.frame) are “used as the fundamental data structureby most of R’s modeling software.” Notice that R gave the columns of dummygeneric names, V1, ..., V5. You can provide your own names and then usethe names to access the data without the use of $ as in line 4 above.

5 colnames(dummy) = c("Dog", "Cat", "Rat", "Pig", "Man")

6 attach(dummy)

7 Cat

[1] 2 0

R is case sensitive, thus cat and Cat are different. Also, cat is a reservedname (?cat) in R, so using "cat" instead of "Cat" may cause problems later.You may also include a header in the data file to avoid using line 5 above;type ?read.table for further information.

It can’t hurt to learn a little about programming in R because you will seesome of it along the way. Consider a simple program that we will call crazyto produce a graph of a sequence of sample means of increasing sample sizesfrom a Cauchy distribution with location parameter zero. The code is:

1 crazy <- function(num) {

2 x <- rep(NA, num)

3 for (n in 1:num) x[n] <- mean(rcauchy(n))

4 plot(x, type="l", xlab="sample size", ylab="sample mean")

5 }

Page 583: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

570 Appendix R: R Supplement

0 20 40 60 80 100

−40

−20

020

40

sample size

sam

ple

mea

n

Fig. R.1. Crazy example.

The first line creates the function crazy and gives it one argument, num, thatis the sample size that will end the sequence. Line 2 makes a vector, x, ofnum missing values NA, that will be used to store the sample means. Line 3generates n random Cauchy variates [rcauchy(n)], finds the mean of thosevalues, and puts the result into x[n], the n-th value of x. The process isrepeated in a “do loop” num times so that x[1] is the sample mean from asample of size one, x[2] is the sample mean from a sample of size two, and soon, until finally, x[num] is the sample mean from a sample of size num. Afterthe do loop is complete, the fourth line generates a graphic (see Figure R.1).The fifth line closes the function. To use crazy with a limit sample size of100, for example, type

6 crazy(100)

and you will get a graphic that looks like Figure R.1You may want to use one of the R packages. In this case you have to first

download the package and then install it. For example,

1 install.packages(c("wavethresh", "tseries"))

will download and install the packages wavethresh that we use in Chapter 4and tseries that we use in Chapter 5; you will be asked to choose the closestmirror to you. To use a package, you have to load it at each start up of R, forexample:

2 library(wavethresh) # load the wavethresh package

A good way to get help for a package is to use html help

3 help.start()

and follow the Packages link.Finally, a word of caution: TRUE and FALSE are reserved words, whereas T

and F are initially set to these. Get in the habit of using the words rather thanthe letters T or F because you may get into trouble if you do something likeF = qf(p=.01, df1=3, df2=9), so that F is no longer FALSE, but a quantileof the specified F -distribution.

Page 584: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

R.3 Time Series Primer 571

R.3 Time Series Primer

In this section, we give a brief introduction on using R for time series. Weassume that tsa3.rda has been loaded. To create a time series object, usethe command ts. Related commands are as.ts to coerce an object to a timeseries and is.ts to test whether an object is a time series.

First, make a small data set:

1 (mydata = c(1,2,3,2,1)) # make it and view it

[1] 1 2 3 2 1

Now make it a time series:

2 (mydata = as.ts(mydata))

Time Series:

Start = 1

End = 5

Frequency = 1

[1] 1 2 3 2 1

Make it an annual time series that starts in 1950:

3 (mydata = ts(mydata, start=1950))

Time Series:

Start = 1950

End = 1954

Frequency = 1

[1] 1 2 3 2 1

Now make it a quarterly time series that starts in 1950-III:

4 (mydata = ts(mydata, start=c(1950,3), frequency=4))

Qtr1 Qtr2 Qtr3 Qtr4

1950 1 2

1951 3 2 1

5 time(mydata) # view the sampled times

Qtr1 Qtr2 Qtr3 Qtr4

1950 1950.50 1950.75

1951 1951.00 1951.25 1951.50

To use part of a time series object, use window():

6 (x = window(mydata, start=c(1951,1), end=c(1951,3)))

Qtr1 Qtr2 Qtr3

1951 3 2 1

Next, we’ll look at lagging and differencing. First make a simple series, xt:

1 x = ts(1:5)

Now, column bind (cbind) lagged values of xt and you will notice that lag(x)is forward lag, whereas lag(x, -1) is backward lag (we display the time seriesattributes in a single row of the output to save space).

2 cbind(x, lag(x), lag(x,-1))

Page 585: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

572 Appendix R: R Supplement

Time Series: Start = 0 End = 6 Frequency = 1

x lag(x) lag(x, -1)

0 NA 1 NA

1 1 2 NA

2 2 3 1

3 3 4 2 <- in this row, for example, x is 3,

4 4 5 3 lag(x) is ahead at 4, and

5 5 NA 4 lag(x,-1) is behind at 2

6 NA NA 5

Compare cbind and ts.intersect:

3 ts.intersect(x, lag(x,1), lag(x,-1))

Time Series: Start = 2 End = 4 Frequency = 1

x lag(x, 1) lag(x, -1)

2 2 3 1

3 3 4 2

4 4 5 3

To difference a series, ∇xt = xt − xt−1, use

1 diff(x)

but note that

2 diff(x, 2)

is not second order differencing, it is xt − xt−2. For second order differencing,that is, ∇2xt, do this:

3 diff(diff(x))

and so on for higher order differencing.

For graphing time series, there a few standard plotting mechanisms thatwe use repeatedly. If x is a time series, then plot(x) will produce a time plot.If x is not a time series object, then plot.ts(x) will coerce it into a time plotas will ts.plot(x). There are differences, which we explore in the following.It would be a good idea to skim the graphical parameters help file (?par)while you are here.2 See Figure R.2 for the resulting graphic.

1 x = -5:5 # x is NOT a time series object

2 y = 5*cos(x) # neither is y

3 op = par(mfrow=c(3,2)) # multifigure setup: 3 rows, 2 cols

4 plot(x, main="plot(x)")

5 plot(x, y, main="plot(x,y)")

6 plot.ts(x, main="plot.ts(x)")

7 plot.ts(x, y, main="plot.ts(x,y)")

8 ts.plot(x, main="ts.plot(x)")

9 ts.plot(ts(x), ts(y), col=1:2, main="ts.plot(x,y)")

10 par(op) # reset the graphics parameters [see footnote]

2 In the plot example, the parameter set up uses op = par(...) and ends withpar(op); these lines are used to reset the graphic parameters to their previoussettings. Please make a note of this because we do not display these commandsad nauseam in the text.

Page 586: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

R.3 Time Series Primer 573

2 4 6 8 10

−40

24

plot(x)

Index

x

−4 −2 0 2 4

−40

24

plot(x,y)

x

y

plot.ts(x)

Time

x

2 4 6 8 10

−40

24

−4 −2 0 2 4

−40

24

plot.ts(x,y)

xy

1

2

3

4

5

6

7

8

9

10

11

ts.plot(x)

Time

x

2 4 6 8 10

−40

24

ts.plot(x,y)

Time

2 4 6 8 10

−40

24

Fig. R.2. Demonstration of different R graphic tools for plotting time series.

We will also make use of regression via lm(). First, suppose we want to fita simple linear regression, y = α+βx+ ε. In R, the formula is written as y~x:

1 set.seed(1999) # so you can reproduce the result

2 x = rnorm(10,0,1)

3 y = x + rnorm(10,0,1)

4 summary(fit <- lm(y~x))

Residuals:

Min 1Q Median 3Q Max

-0.8851 -0.3867 0.1325 0.3896 0.6561

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 0.2576 0.1892 1.362 0.2104

x 0.4577 0.2016 2.270 0.0529

Residual standard error: 0.58 on 8 degrees of freedom

Multiple R-squared: 0.3918, Adjusted R-squared: 0.3157

F-statistic: 5.153 on 1 and 8 DF, p-value: 0.05289

5 plot(x, y) # draw a scatterplot of the data (not shown)

6 abline(fit) # add the fitted line to the plot (not shown)

Page 587: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

574 Appendix R: R Supplement

All sorts of information can be extracted from the lm object, which we calledfit. For example,

7 resid(fit) # will display the residuals (not shown)

8 fitted(fit) # will display the fitted values (not shown)

9 lm(y ~ 0 + x) # will exclude the intercept (not shown)

You have to be careful if you use lm() for lagged values of a time se-ries. If you use lm(), then what you have to do is “tie” the series togetherusing ts.intersect. If you do not tie the series together, they will not bealigned properly. Please read the warning Using time series in the lm() helpfile [help(lm)]. Here is an example regressing Chapter 2 data, weekly car-diovascular mortality (cmort) on particulate pollution (part) at the presentvalue and lagged four weeks (part4). First, we create a data frame called ded

that consists of the three series:

1 ded = ts.intersect(cmort, part, part4=lag(part,-4), dframe=TRUE)

Now the series are all aligned and the regression will work.

2 fit = lm(mort~part+part4, data=ded, na.action=NULL)

3 summary(fit)

Call: lm(formula=mort~part+part4,data=ded,na.action=NULL)

Residuals:

Min 1Q Median 3Q Max

-22.7429 -5.3677 -0.4136 5.2694 37.8539

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 69.01020 1.37498 50.190 < 2e-16

part 0.15140 0.02898 5.225 2.56e-07

part4 0.26297 0.02899 9.071 < 2e-16

Residual standard error: 8.323 on 501 degrees of freedom

Multiple R-Squared: 0.3091, Adjusted R-squared: 0.3063

F-statistic: 112.1 on 2 and 501 DF, p-value: < 2.2e-16

There was no need to rename lag(part,-4) to part4, it’s just an exampleof what you can do. There is a package called dynlm that makes it easy to fitlagged regressions. The basic advantage of dynlm is that it avoids having tomake a data frame; that is, line 1 would be avoided.

In Problem 2.1, you are asked to fit a regression model

xt = βt+ α1Q1(t) + α2Q2(t) + α3Q3(t) + α4Q4(t) + wt

where xt is logged Johnson & Johnson quarterly earnings (n = 84), and Qi(t)is the indicator of quarter i = 1, 2, 3, 4. The indicators can be made usingfactor.

1 trend = time(jj) - 1970 # helps to ‘center’ time

2 Q = factor(rep(1:4, 21)) # make (Q)uarter factors

3 reg = lm(log(jj)~0 + trend + Q, na.action=NULL) # no intercept

4 model.matrix(reg) # view the model matrix

Page 588: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

R.3 Time Series Primer 575

trend Q1 Q2 Q3 Q4

1 -10.00 1 0 0 0

2 -9.75 0 1 0 0

3 -9.50 0 0 1 0

4 -9.25 0 0 0 1

. . . . . .

. . . . . .

83 10.50 0 0 1 0

84 10.75 0 0 0 1

5 summary(reg) # view the results (not shown)

The workhorse for ARIMA simulations is arima.sim. Here are some ex-amples; no output is shown here so you’re on your own.

1 x = arima.sim(list(order=c(1,0,0),ar=.9),n=100)+50 # AR(1) w/mean 50

2 x = arima.sim(list(order=c(2,0,0),ar=c(1,-.9)),n=100) # AR(2)

3 x = arima.sim(list(order=c(1,1,1),ar=.9,ma=-.5),n=200) # ARIMA(1,1,1)

Next, we’ll discuss ARIMA estimation. This gets a bit tricky because R isnot useR friendly when it comes to fitting ARIMA models. Much of the storyis spelled out in the “R Issues” page of the website for the text. In Chapter3, we use the scripts acf2, sarima, and sarima.for that are included withtsa3.Rda. But we will also show you how to use the scripts included with R.

First, we’ll fit an ARMA(1,1) model to some simulated data (with diag-nostics and forecasting):

1 set.seed(666)

2 x = 50 + arima.sim(list(order=c(1,0,1), ar=.9, ma=-.5), n=200)

3 acf(x); pacf(x) # display sample ACF and PACF ... or ...

4 acf2(x) # use our script (no output shown)

5 (x.fit = arima(x, order = c(1, 0, 1))) # fit the model

Call: arima(x = x, order = c(1, 0, 1))

Coefficients:

ar1 ma1 intercept

0.8340 -0.432 49.8960

s.e. 0.0645 0.111 0.2452

sigma^2 estimated as 1.070: log likelihood = -290.79, aic = 589.58

Note that the reported intercept estimate is an estimate of the mean andnot the constant. That is, the fitted model is

xt − 49.896 = .834(xt−1 − 49.896) + wt

where σ2w = 1.070. Diagnostics can be accomplished as follows,

4 tsdiag(x.fit, gof.lag=20) # ?tsdiag for details (don’t use this!!)

but the Ljung-Box-Pierce test is not correct because it does not take intoaccount the fact that the residuals are from a fitted model. If the analysisis repeated using the sarima script, a partial output would look like thefollowing (sarima will also display the correct diagnostics as a graphic; e.g.,see Figure 3.17 on page 151):

Page 589: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

576 Appendix R: R Supplement

1 sarima(x, 1, 0, 1)

Coefficients:

ar1 ma1 xmean

0.8340 -0.432 49.8960

s.e. 0.0645 0.111 0.2452

sigma^2 estimated as 1.070: log likelihood = -290.79, aic = 589.58

$AIC [1] 1.097494 $AICc [1] 1.108519 $BIC [1] 0.1469684

Finally, to obtain and plot the forecasts, you can use the following R code:

1 x.fore = predict(x.fit, n.ahead=10)

2 U = x.fore$pred + 2*x.fore$se # x.fore$pred holds predicted values

3 L = x.fore$pred - 2*x.fore$se # x.fore$se holds stnd errors

4 miny = min(x,L); maxy = max(x,U)

5 ts.plot(x, x.fore$pred, col=1:2, ylim=c(miny, maxy))

6 lines(U, col="blue", lty="dashed")

7 lines(L, col="blue", lty="dashed")

Using the script sarima.for, you can accomplish the same task in one line.

1 sarima.for(x, 10, 1, 0, 1)

Example 3.46 on page 159 uses this script.We close this appendix with a quick spectral analysis. This material is

covered in detail in Chapter 4, so we will not discuss this example in muchdetail here. We will simulate an AR(2) and then estimate the spectrum vianonparametric and parametric methods. No graphics are shown, but we haveconfidence that you are proficient enough in R to display them yourself.

1 x = arima.sim(list(order=c(2,0,0), ar=c(1,-.9)), n=2^8) # some data

2 (u = polyroot(c(1,-1,.9))) # x is AR(2) w/complex roots

[1] 0.5555556+0.8958064i 0.5555556-0.8958064i

3 Arg(u[1])/(2*pi) # dominant frequency around .16

[1] 0.1616497

4 par(mfcol=c(4,1))

5 plot.ts(x)

6 spec.pgram(x,spans=c(3,3),log="no") # nonparametric spectral estimate

7 spec.ar(x, log="no") # parametric spectral estimate

8 spec.arma(ar=c(1,-.9), log="no") # true spectral density

The script spec.arma is included in tsa3.rda. Also, see spectrum as analternative to spec.pgram. Finally, note that R tapers and logs by default, so ifyou simply want the periodogram of a series, the command is spec.pgram(x,taper=0, fast=FALSE, detrend=FALSE, log="no"). If you just asked forspec.pgram(x), you would not get the raw periodogram because the dataare detrended, possibly padded, and tapered, even though the title of theresulting graphic would say Raw Periodogram. An easier way to get a rawperiodogram is:

9 per = abs(fft(x))^2/length(x)

This final example points out the importance of knowing the defaults for theR scripts you use.

Page 590: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

References

Akaike, H. (1969). Fitting autoregressive models for prediction. Ann. Inst. Stat.Math., 21, 243-247.

Akaike, H. (1973). Information theory and an extension of the maximum likelihoodprincipal. In 2nd Int. Symp. Inform. Theory, 267-281. B.N. Petrov and F. Csake,eds. Budapest: Akademia Kiado.

Akaike, H. (1974). A new look at statistical model identification. IEEE Trans. Au-tomat. Contr., AC-19, 716-723.

Alagon, J. (1989). Spectral discrimination for two groups of time series. J. TimeSeries Anal., 10, 203-214.

Alspach, D.L. and H.W. Sorensen (1972). Nonlinear Bayesian estimation using Gaus-sian sum approximations. IEEE Trans. Automat. Contr., AC-17, 439-447.

Anderson, B.D.O. and J.B. Moore (1979). Optimal Filtering. Englewood Cliffs, NJ:Prentice-Hall.

Anderson, T.W. (1978). Estimation for autoregressive moving average models in thetime and frequency domain. Ann. Stat., 5, 842-865.

Anderson, T.W. (1984). An Introduction to Multivariate Statistical Analysis, 2nded. New York: Wiley.

Ansley, C.F. and P. Newbold (1980). Finite sample properties of estimators forautoregressive moving average processes. J. Econ., 13, 159-183.

Ansley, C.F. and R. Kohn (1982). A geometrical derivation of the fixed intervalsmoothing algorithm. Biometrika, 69 , 486-487.

Antognini, J.F., M.H. Buonocore, E.A. Disbrow, and E. Carstens (1997). Isofluraneanesthesia blunts cerebral responses to noxious and innocuous stimuli: a fMRIstudy. Life Sci., 61, PL349-PL354.

Bandettini, A., A. Jesmanowicz, E.C. Wong, and J.S. Hyde (1993). Processingstrategies for time-course data sets in functional MRI of the human brain. Mag-netic Res. Med., 30, 161-173.

Bar-Shalom, Y. (1978). Tracking methods in a multi-target environment. IEEETrans. Automat. Contr., AC-23, 618-626.

Page 591: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

578 Appendix R: R Supplement

Bar-Shalom, Y. and E. Tse (1975). Tracking in a cluttered environment with prob-abilistic data association. Automatica, 11, 4451-4460.

Bazza, M., R.H. Shumway, and D.R. Nielsen (1988). Two-dimensional spectral anal-ysis of soil surface temperatures. Hilgardia, 56, 1-28.

Bedrick, E.J. and C.-L. Tsai (1994). Model selection for multivariate regression insmall samples. Biometrics, 50, 226-231.

Benjamini, Y. and Y. Hochberg (1995). Controlling the false discovery rate: a prac-tical and powerful approach to multiple testing. J. Roy. Statist. Soc. Ser. B,289-300.

Beran, J. (1994). Statistics for Long Memory Processes. New York: Chapman andHall.

Berk, K.N. (1974). Consistent autoregressive spectral estimates. Ann. Stat., 2, 489-502.

Besag, J. (1974). Spatial interaction and the statistical analysis of lattice systems(with discussion). J. R. Stat. Soc. B, 36, 192-236.

Bhat, R.R. (1985). Modern Probability Theory, 2nd ed. New York: Wiley.

Bhattacharya, A. (1943). On a measure of divergence between two statistical popu-lations. Bull. Calcutta Math. Soc., 35, 99-109.

Billingsley, P. (1999). Convergence of Probability Measures, (2nd edition). New York:Wiley.

Blackman, R.B. and J.W. Tukey (1959). The Measurement of Power Spectra fromthe Point of View of Communications Engineering. New York: Dover.

Blight, B.J.N. (1974). Recursive solutions for the estimation of a stochastic param-eter J. Am. Stat. Assoc., 69, 477-481

Bloomfield, P. (1976). Fourier Analysis of Time Series: An Introduction. New York:Wiley.

Bloomfield, P. (2000). Fourier Analysis of Time Series: An Introduction, 2nd ed.New York: Wiley.

Bloomfield, P. and J.M. Davis (1994). Orthogonal rotation of complex principalcomponents. Int. J. Climatol., 14, 759-775.

Bogart, B. P., M. J. R. Healy, and J.W. Tukey (1962). The Quefrency Analysis ofTime Series for Echoes: Cepstrum, Pseudo-Autocovariance, Cross-Cepstrum andSaphe Cracking. In Proc. of the Symposium on Time Series Analysis, pp. 209-243,Brown University, Providence, USA.

Bollerslev, T. (1986). Generalized autoregressive conditional heteroscedasticity. J.Econ., 31, 307- 327.

Box, G.E.P. and D.A. Pierce (1970). Distributions of residual autocorrelations inautoregressive integrated moving average models. J. Am. Stat. Assoc., 72, 397-402.

Box, G.E.P. and G.M. Jenkins (1970). Time Series Analysis, Forecasting, and Con-trol. Oakland, CA: Holden-Day.

Box, G.E.P. and G.C. Tiao (1973). Bayesian Inference in Statistical Analysis. NewYork: Wiley.

Page 592: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

R.3 Time Series Primer 579

Box, G.E.P., G.M. Jenkins and G.C. Reinsel (1994). Time Series Analysis, Fore-casting, and Control, 3rd ed. Englewood Cliffs, NJ: Prentice Hall.

Breiman, L. and J. Friedman (1985). Estimating optimal transformations for multi-ple regression and correlation (with discussion). J. Am. Stat. Assoc., 80, 580-619.

Brillinger, D.R. (1973). The analysis of time series collected in an experimentaldesign. In Multivariate Analysis-III., pp. 241-256. P.R. Krishnaiah ed. New York:Academic Press.

Brillinger, D.R. (1975). Time Series: Data Analysis and Theory. New York: Holt,Rinehart & Winston Inc.

Brillinger, D.R. (1980). Analysis of variance and problems under time series models.In Handbook of Statistics, Vol I, pp. 237-278. P.R. Krishnaiah and D.R. Brillinger,eds. Amsterdam: North Holland.

Brillinger, D.R. (1981, 2001). Time Series: Data Analysis and Theory, 2nd ed. SanFrancisco: Holden-Day. Republished in 2001 by the Society for Industrial andApplied Mathematics, Philadelphia.

Brockwell, P.J. and R.A. Davis (1991). Time Series: Theory and Methods, 2nd ed.New York: Springer-Verlag.

Bruce, A. and H-Y. Gao (1996). Applied Wavelet Analysis with S-PLUS. New York:Springer-Verlag.

Caines, P.E. (1988). Linear Stochastic Systems. New York: Wiley.

Carlin, B.P., N.G. Polson, and D.S. Stoffer (1992). A Monte Carlo approach tononnormal and nonlinear state-space modeling. J. Am. Stat. Assoc., 87, 493-500.

Carter, C. K. and R. Kohn (1994). On Gibbs sampling for state space models.Biometrika, 81, 541-553.

Chan, N.H. (2002). Time Series: Applications to Finance. New York: Wiley.

Chernoff, H. (1952). A measure of asymptotic efficiency for tests of a hypothesisbased on the sum of the observations. Ann. Math. Stat., 25, 573-578.

Cleveland, W.S. (1979). Robust locally weighted regression and smoothing scatter-plots. J. Am. Stat. Assoc., 74, 829-836.

Cochrane, D. and G.H. Orcutt (1949). Applications of least squares regression torelationships containing autocorrelated errors. J. Am. Stat. Assoc., 44, 32-61.

Cooley, J.W. and J.W. Tukey (1965). An algorithm for the machine computation ofcomplex Fourier series. Math. Comput., 19, 297-301.

Cressie, N.A.C. (1993). Statistics for Spatial Data. New York: Wiley.

Dahlhaus, R. (1989). Efficient parameter estimation for self-similar processes. Ann.Stat., 17, 1749-1766.

Dargahi-Noubary, G.R. and P.J. Laycock (1981). Spectral ratio discriminants andinformation theory. J. Time Series Anal., 16, 201-219.

Danielson, J. (1994). Stochastic volatility in asset prices: Estimation with simulatedmaximum likelihood. J. Econometrics, 61, 375-400.

Daubechies, I. (1992). Ten Lectures on Wavelets. Philadelphia: CBMS-NSF RegionalConference Series in Applied Mathematics.

Page 593: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

580 Appendix R: R Supplement

Davies, N., C.M. Triggs, and P. Newbold (1977). Significance levels of the Box-Pierceportmanteau statistic in finite samples. Biometrika, 64, 517-522.

Dent, W. and A.-S. Min. (1978). A Monte Carlo study of autoregressive-integrated-moving average processes. J. Econ., 7, 23-55.

Dempster, A.P., N.M. Laird and D.B. Rubin (1977). Maximum likelihood from in-complete data via the EM algorithm. J. R. Stat. Soc. B, 39, 1-38.

Ding, Z., C.W.J. Granger, and R.F. Engle (1993). A long memory property of stockmarket returns and a new model. J. Empirical Finance, 1, 83-106.

Donoho, D.L. and I.M. Johnstone (1994). Ideal spatial adaptation by wavelet shrink-age. Biometrika, 81, 425-455.

Donoho, D.L. and I.M. Johnstone (1995). Adapting to unknown smoothness viawavelet shrinkage. J. of Am. Stat. Assoc., 90, 1200-1224.

Durbin, J. (1960). Estimation of parameters in time series regression models. J. R.Stat. Soc. B, 22, 139-153.

Durbin, J. and S.J. Koopman (2001). Time Series Analysis by State Space MethodsOxford: Oxford University Press.

Efron, B. and R. Tibshirani (1994). An Introduction to the Bootstrap. New York:Chapman and Hall.

Engle, R.F. (1982). Autoregressive conditional heteroscedasticity with estimates ofthe variance of United Kingdom inflation. Econometrica, 50, 987-1007.

Engle, R.F., D. Nelson, and T. Bollerslev (1994). ARCH Models. In Handbook ofEconometrics, Vol IV, pp. 2959-3038. R. Engle and D. McFadden, eds. Amster-dam: North Holland.

Fox, R. and M.S. Taqqu (1986). Large sample properties of parameter estimates forstrongly dependent stationary Gaussian time series. Ann. Stat., 14, 517-532.

Friedman, J.H. (1984). A Variable Span Smoother. Tech. Rep. No. 5, Lab. for Com-putational Statistics, Dept. Statistics, Stanford Univ., California.

Friedman, J.H. and W. Stuetzle. (1981). Projection pursuit regression. J. Am. Stat.Assoc., 76, 817-823.

Fruhwirth-Schnatter, S. (1994). Data Augmentation and Dynamic Linear Models.J. Time Series Anal., 15, 183–202.

Fuller, W.A. (1976). Introduction to Statistical Time Series. New York: Wiley.

Fuller, W.A. (1996). Introduction to Statistical Time Series, 2nd ed. New York:Wiley.

Gelfand, A.E. and A.F.M. Smith (1990). Sampling-based approaches to calculatingmarginal densities. J. Am. Stat. Assoc., 85, 398-409.

Gelman, A., J. Carlin, H. Stern, and D. Rubin (1995). Bayesian Data Analysis.London: Chapman and Hall.

Geman, S. and D. Geman (1984). Stochastic relaxation, Gibbs distributions, andthe Bayesian restoration of images. IEEE Trans. Pattern Anal. Machine Intell.,6, 721-741.

Geweke, J.F. (1977). The dynamic factor analysis of economic time series models.In Latent Variables in Socio-Economic Models, pp 365-383. D. Aigner and A.Goldberger, eds. Amsterdam: North Holland.

Page 594: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

R.3 Time Series Primer 581

Geweke, J.F. and K.J. Singleton (1981). Latent variable models for time series: Afrequency domain approach with an application to the Permanent Income Hy-pothesis. J. Econ., 17, 287-304.

Geweke, J.F. and S. Porter-Hudak (1983). The estimation and application of long-memory time series models. J. Time Series Anal., 4, 221-238.

Gilks, W.R., S. Richardson, and D.J. Spiegelhalter (eds.) (1996). Markov ChainMonte Carlo in Practice. London: Chapman and Hall.

Giri, N. (1965). On complex analogues of T 2 and R2 tests. Ann. Math. Stat., 36,664-670.

Goldfeld, S.M. and R.E. Quandt (1973). A Markov model for switching regressions.J. Econ., 1, 3-16.

Goodman, N.R. (1963). Statistical analysis based on a certain multivariate complexGaussian distribution. Ann. Math. Stat., 34, 152-177.

Gordon, K. and A.F.M. Smith (1988). Modeling and monitoring discontinuouschanges in time series. In Bayesian Analysis of Time Series and Dynamic Models,359-392.

Gordon, K. and A.F.M. Smith (1990). Modeling and monitoring biomedical timeseries. J. Am. Stat. Assoc., 85, 328-337.

Gourieroux, C. (1997). ARCH Models and Financial Applications. New York:Springer-Verlag.

Granger, C.W. and R. Joyeux (1980). An introduction to long-memory time seriesmodels and fractional differencing. J. Time Series Anal., 1, 15-29.

Grenander, U. (1951). On empirical spectral analysis of stochastic processes. Arkivfor Mathematik, 1, 503-531.

Grenander, U. and M. Rosenblatt (1957). Statistical Analysis of Stationary TimeSeries. New York: Wiley.

Grether, D.M. and M. Nerlove (1970). Some properties of optimal seasonal adjust-ment. Econometrica, 38, 682-703.

Gupta, N.K. and R.K. Mehra (1974). Computational aspects of maximum likeli-hood estimation and reduction in sensitivity function calculations. IEEE Trans.Automat. Contr., AC-19, 774-783.

Hamilton, J.D. (1989). A new approach to the economic analysis of nonstationarytime series and the business cycle. Econometrica, 57, 357-384.

Hannan, E.J. (1970). Multiple Time Series. New York: Wiley.

Hannan, E. J. and B. G. Quinn (1979). The determination of the order of an au-toregression. J. Royal Statistical Society, B, 41, 190-195.

Hannan, E.J. and M. Deistler (1988). The Statistical Theory of Linear Systems. NewYork: Wiley.

Hansen, J., M. Sato, R. Ruedy, K. Lo, D.W. Lea, and M. Medina-Elizade (2006).Global temperature change. Proc. Natl. Acad. Sci., 103, 14288-14293.

Harrison, P.J. and C.F. Stevens (1976). Bayesian forecasting (with discussion). J.R. Stat. Soc. B, 38, 205-247.

Harvey, A.C. and P.H.J. Todd (1983). Forecasting economic time series with struc-tural and Box-Jenkins models: A case study. J. Bus. Econ. Stat., 1, 299-307.

Page 595: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

582 Appendix R: R Supplement

Harvey, A.C. and R.G. Pierse (1984). Estimating missing observations in economictime series. J. Am. Stat. Assoc., 79, 125-131.

Harvey, A.C. (1991). Forecasting, Structural Time Series Models and the KalmanFilter. Cambridge: Cambridge University Press.

Harvey, A.C. (1993). Time Series Models. Cambridge, MA: MIT Press.

Harvey A.C., E. Ruiz and N. Shephard (1994). Multivariate stochastic volatilitymodels. Rev. Economic Studies, 61, 247-264.

Haslett, J.and A.E. Raftery (1989) Space-time modelling with long-memory de-pendence: Assessing Ireland’s wind power resource (C/R: 89V38 p21-50) AppliedStatistics, 38, 1-21

Hastings, W.K. (1970). Monte Carlo sampling methods using Markov chains andtheir applications. Biometrika, 57, 97-109.

Hosking, J.R.M. (1981). Fractional differencing. Biometrika, 68, 165-176.

Hurst, H. (1951). Long term storage capacity of reservoirs. Trans. Am. Soc. CivilEng., 116, 778-808.

Hurvich, C.M. and S. Zeger (1987). Frequency domain bootstrap methods for timeseries. Tech. Report 87-115, Department of Statistics and Operations Research,Stern School of Business, New York University.

Hurvich, C.M and C.-L. Tsai (1989). Regression and time series model selection insmall samples. Biometrika, 76, 297-307.

Hurvich, C.M. and K.I. Beltrao (1993). Asymptotics for the low-requency oridnatesof the periodogram for a long-memory time series. J. Time Series Anal., 14, 455-472.

Hurvich, C.M., R.S. Deo and J. Brodsky (1998). The mean squared error of Gewekeand Porter-Hudak’s estimator of the memory parameter of a long-memory timeseries. J. Time Series Anal., 19, 19-46.

Hurvich, C.M. and R.S. Deo (1999). Plug-in selection of the number of frequenciesin regression estimates of the memory parameter of a long-memory time series.J.Time Series Anal., 20 , 331-341.

Jacquier, E., N.G. Polson, and P.E. Rossi (1994). Bayesian analysis of stochasticvolatility models. J. Bus. Econ. Stat., 12, 371-417.

Jazwinski, A.H. (1970). Stochastic Processes and Filtering Theory. New York: Aca-demic Press.

Jenkins, G.M. and D.G. Watts. (1968). Spectral Analysis and Its Applications. SanFrancisco: Holden-Day.

Johnson, R.A. and D.W. Wichern (1992). Applied Multivariate Statistical Analysis,3rd ed.. Englewood Cliffs, NJ: Prentice-Hall.

Jones, P.D. (1994). Hemispheric surface air temperature variations: A reanalysis andan update to 1993. J. Clim., 7, 1794-1802.

Jones, R.H. (1980). Maximum likelihood fitting of ARMA models to time series withmissing observations. Technometrics, 22, 389-395.

Jones, R.H. (1984). Fitting multivariate models to unequally spaced data. In TimeSeries Analysis of Irregularly Observed Data, pp. 158-188. E. Parzen, ed. LectureNotes in Statistics, 25, New York: Springer-Verlag.

Page 596: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

R.3 Time Series Primer 583

Jones, R.H. (1993). Longitudinal Data With Serial Correlation : A State-Space Ap-proach. London: Chapman and Hall.

Journel, A.G. and C.H. Huijbregts (1978). Mining Geostatistics. New York: Aca-demic Press.

Juang, B.H. and L.R. Rabiner (1985). Mixture autoregressive hidden Markov modelsfor speech signals, IEEE Trans. Acoust., Speech, Signal Process., ASSP-33, 1404-1413.

Kakizawa, Y., R. H. Shumway, and M. Taniguchi (1998). Discrimination and clus-tering for multivariate time series. J. Am. Stat. Assoc., 93, 328-340.

Kalman, R.E. (1960). A new approach to linear filtering and prediction problems.Trans ASME J. Basic Eng., 82, 35-45.

Kalman, R.E. and R.S. Bucy (1961). New results in filtering and prediction theory.Trans. ASME J. Basic Eng., 83, 95-108.

Kaufman, L. and P.J. Rousseeuw (1990). Finding Groups in Data: An Introductionto Cluster Analysis. New York: Wiley.

Kay, S.M. (1988). Modern Spectral Analysis: Theory and Applications. EnglewoodCliffs, NJ: Prentice-Hall.

Kazakos, D. and P. Papantoni-Kazakos (1980). Spectral distance measuring betweenGaussian processes. IEEE Trans. Automat. Contr., AC-25, 950-959.

Khatri, C.G. (1965). Classical statistical analysis based on a certain multivariatecomplex Gaussian distribution. Ann. Math. Stat., 36, 115-119.

Kim S., N. Shephard and S. Chib (1998). Stochastic volatility: likelihood inferenceand comparison with ARCH models. Rev. Economic Studies, 65, p.361-393.

Kitagawa, G. and W. Gersch (1984). A smoothness priors modeling of time serieswith trend and seasonality. J. Am. Stat. Assoc., 79, 378-389.

Kitagawa, G. (1987). Non-Gaussian state-space modeling of nonstationary time se-ries (with discussion). J. Am. Stat. Assoc., 82, 1032-1041, (C/R: p1041-1063; C/R:V83 p1231).

Kitagawa, G. and W. Gersch (1996). Smoothness Priors Analysis of Time Series.New York: Springer-Verlag.

Kolmogorov, A.N. (1941). Interpolation und extrapolation von stationaren zufalligenFolgen. Bull. Acad. Sci. U.R.S.S., 5, 3-14.

Krishnaiah, P.R., J.C. Lee, and T.C. Chang (1976). The distribution of likelihoodratio statistics for tests of certain covariance structures of complex multivariatenormal populations. Biometrika, 63, 543-549.

Kullback, S. and R.A. Leibler (1951). On information and sufficiency. Ann. Math.Stat., 22, 79-86.

Kullback, S. (1958). Information Theory and Statistics. Gloucester, MA: PeterSmith.

Lachenbruch, P.A. and M.R. Mickey (1968). Estimation of error rates in discriminantanalysis. Technometrices, 10, 1-11.

Lam, P.S. (1990). The Hamilton model with a general autoregressive component: Es-timation and comparison with other models of economic time series. J. MonetaryEcon., 26, 409-432.

Page 597: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

584 Appendix R: R Supplement

Lay, T. (1997). Research required to support comprehensive nuclear test ban treatymonitoring. National Research Council Report, National Academy Press, 2101Constitution Ave., Washington, DC 20055.

Levinson, N. (1947). The Wiener (root mean square) error criterion in filter designand prediction. J. Math. Phys., 25, 262-278.

Lindgren, G. (1978). Markov regime models for mixed distributions and switchingregressions. Scand. J. Stat., 5, 81-91.

Ljung, G.M. and G.E.P. Box (1978). On a measure of lack of fit in time series models.Biometrika, 65, 297-303.

Lutkepohl, H. (1985). Comparison of criteria for estimating the order of a vectorautoregressive process. J. Time Series Anal., 6, 35-52.

Lutkepohl, H. (1993). Introduction to Multiple Time Series Analysis, 2nd ed. Berlin:Springer-Verlag.

MacQueen, J.B. (1967). Some methods for classification and analysis of multivariateobservations. Proceedings of 5-th Berkeley Symposium on Mathematical Statisticsand Probability. Berkeley: University of California Press, 1:281-297

Mallows, C.L. (1973). Some comments on Cp. Technometrics, 15, 661-675.

McBratney, A.B. and R. Webster (1981). Detection of ridge and furrow pattern byspectral analysis of crop yield. Int. Stat. Rev., 49, 45-52.

McCulloch, R.E. and R.S. Tsay (1993). Bayesian inference and prediction for meanand variance shifts in autoregressive time series. J. Am. Stat. Assoc., 88, 968-978.

McDougall, A. J., D.S. Stoffer and D.E. Tyler (1997). Optimal transformations andthe spectral envelope for real-valued time series. J. Stat. Plan. Infer., 57, 195-214.

McLeod A.I. (1978). On the distribution of residual autocorrelations in Box-Jenkinsmodels. J. R. Stat. Soc. B, 40, 296-302.

McLeod, A.I. and K.W. Hipel (1978). Preservation of the rescaled adusted range, I.A reassessment of the Hurst phenomenon. Water Resour. Res., 14, 491-508.

McQuarrie, A.D.R. and C-L. Tsai (1998). Regression and Time Series Model Selec-tion, Singapore: World Scientific.

Meinhold, R.J. and N.D. Singpurwalla (1983). Understanding the Kalman filter.Am. Stat., 37, 123-127.

Meinhold, R.J. and N.D. Singpurwalla (1989). Robustification of Kalman filter mod-els. J. Am. Stat. Assoc., 84, 479-486.

Meng X.L. and Rubin, D.B. (1991). Using EM to obtain asymptotic variance–covariance matrices: The SEM algorithm. J. Am. Stat. Assoc., 86, 899-909.

Metropolis N., A.W. Rosenbluth, M.N. Rosenbluth, A. H. Teller, and E. Teller(1953). Equations of state calculations by fast computing machines. J. Chem.Phys., 21, 1087-1091.

Mickens, R.E. (1990). Difference Equations: Theory and Applicatons (2nd ed). NewYork: Springer.

Nason, G.P. (2008). Wavelet Methods in Statistics with R. New York: Springer.

Newbold, P. and T. Bos (1985). Stochastic Parameter Regression Models. BeverlyHills: Sage.

Page 598: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

R.3 Time Series Primer 585

Ogawa, S., T.M. Lee, A. Nayak and P. Glynn (1990). Oxygenation-sensititive con-trast in magnetic resonance image of rodent brain at high magnetic fields. Magn.Reson. Med., 14, 68-78.

Palma, W. (2007). Long-Memory Time Series: Theory and Methods. New York:Wiley.

Palma, W. and N.H. Chan (1997). Estimation and forecasting of long-memory timeseries with missing values. J. Forecast., 16, 395-410.

Paparoditis, E. and Politis, D.N. (1999). The local bootstrap for periodogram statis-tics. J. Time Series Anal., 20, 193-222.

Parzen, E. (1962). On estimation of a probability density and mode. Ann. Math.Stat., 35, 1065-1076.

Parzen, E. (1983). Autoregressive spectral estimation. In Time Series in the Fre-quency Domain, Handbook of Statistics, Vol. 3, pp. 211-243. D.R. Brillinger andP.R. Krishnaiah eds. Amsterdam: North Holland.

Pawitan, Y. and R.H. Shumway (1989). Spectral estimation and deconvolution fora linear time series model. J. Time Series Anal., 10, 115-129.

Pena, D. and I. Guttman (1988). A Bayesian approach to robustifying the Kalmanfilter. In Bayesian Analysis of Time Series and Dynamic Linear Models, pp. 227-254. J.C. Spall, ed. New York: Marcel Dekker.

Percival, D.B. and A.T. Walden (1993). Spectral Analysis for Physical Applications:Multitaper and Conventional Univariate Techniques Cambridge: Cambridge Uni-versity Press.

Percival, D.B. and A.T. Walden (2000). Wavelet Methods for Time Series Analysis.Cambridge: Cambridge University Press.

Phillips, P.C.B. (1987). Time series regression with a unit root. Econometrica, 55,227-301.

Phillips, P.C.B. and P. Perron (1988). Testing for unit roots in time series regression.Biometrika, 75, 335-346.

Pinsker, M.S. (1964). Information and Information Stability of Random Variablesand Processes, San Francisco: Holden Day.

Pole, P.J. and M. West (1988). Nonnormal and nonlinear dynamic Bayesian model-ing. In Bayesian Analysis of Time Series and Dynamic Linear Models, pp. 167-198. J.C. Spall, ed. New York: Marcel Dekker.

Press, W.H., S.A. Teukolsky, W. T. Vetterling, and B.P. Flannery (1993). NumericalRecipes in C: The Art of Scientific Computing, 2nd ed. Cambridge: CambridgeUniversity Press.

Priestley, M.B., T. Subba-Rao and H. Tong (1974). Applications of principal compo-nents analysis and factor analysis in the identification of multi-variable systems.IEEE Trans. Automat. Contr., AC-19, 730-734.

Priestley, M.B. and T. Subba-Rao (1975). The estimation of factor scores andKalman filtering for discrete parameter stationary processes. Int. J. Contr., 21,971-975.

Priestley, M.B. (1981). Spectral Analysis and Time Series. Vol. 1: Univariate Series;Vol 2: Multivariate Series, Prediction and Control. New York: Academic Press.

Page 599: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

586 Appendix R: R Supplement

Priestley, M.B. (1988). Nonlinear and Nonstationary Time Series Analysis. London:Academic Press.

Quandt, R.E. (1972). A new approach to estimating switching regressions. J. Am.Stat. Assoc., 67, 306-310.

Rabiner, L.R. and B.H. Juang (1986). An introduction to hidden Markov models,IEEE Acoust., Speech, Signal Process., ASSP-34, 4-16.

Rao, C.R. (1973). Linear Statistical Inference and Its Applications. New York: Wiley.

Rauch, H.E., F. Tung, and C.T. Striebel (1965). Maximum likelihood estimation oflinear dynamic systems. J. AIAA, 3, 1445-1450.

Reinsel, G.C. (1997). Elements of Multivariate Time Series Analysis, 2nd ed. NewYork: Springer-Verlag.

Renyi, A. (1961). On measures of entropy and information. In Proceedings of 4thBerkeley Symp. Math. Stat. and Probability, pp. 547-561, Berkeley: Univ. of Cal-ifornia Press.

Rissanen, J. (1978). Modeling by shortest data description. Automatica, 14, 465-471.

Robinson, P.M. (1995). Gaussian semiparametric estimation of long range depen-dence. Ann. Stat., 23, 1630-1661.

Robinson, P.M. (2003). Time Series With Long Memory. Oxford: Oxford UniversityPress.

Rosenblatt, M. (1956a). A central limit theorem and a strong mixing condition.Proc. Nat. Acad. Sci., 42, 43-47.

Rosenblatt, M. (1956b). Remarks on some nonparametric estimates of a densityfunctions. Ann. Math. Stat., 27 642-669.

Royston, P. (1982). An extension of Shapiro and Wilk’s W test for normality tolarge samples. Applied Statistics, 31, 115-124.

Said E. and D.A. Dickey (1984). Testing for unit roots in autoregressive movingaverage models of unknown order. Biometrika, 71, 599607.

Sandmann, G. and S.J. Koopman (1998). Estimation of stochastic volatility modelsvia Monte Carlo maximum likelihood. J. Econometrics, 87 , 271-301.

Sargan, J.D. (1964). Wages and prices in the United Kingdom: A study in econo-metric methodology. In Econometric Analysis for National Economic Planning,eds. P. E. Hart, G. Mills and J. K. Whitaker. London: Butterworths. reprinted inQuantitative Economics and Econometric Analysis, pp. 275-314, eds. K. F. Wallisand D. F. Hendry (1984). Oxford: Basil Blackwell.

Scheffe, H. (1959). The Analysis of Variance. New York: Wiley.

Schuster, A. (1898). On the investigation of hidden periodicities with application toa supposed 26 day period of meteorological phenomena. Terrestrial Magnetism,III, 11-41.

Schuster, A. (1906). On the periodicities of sunspots. Phil. Trans. R. Soc., Ser. A,206, 69-100.

Schwarz, F. (1978). Estimating the dimension of a model. Ann. Stat., 6, 461-464.

Schweppe, F.C. (1965). Evaluation of likelihood functions for Gaussian signals. IEEETrans. Inform. Theory, IT-4, 294-305.

Page 600: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

R.3 Time Series Primer 587

Shephard, N. (1996). Statistical aspects of ARCH and stochastic volatility. In TimeSeries Models in Econometrics, Finance and Other Fields , pp 1-100. D.R. Cox,D.V. Hinkley, and O.E. Barndorff-Nielson eds. London: Chapman and Hall.

Shumway, R.H. and W.C. Dean (1968). Best linear unbiased estimation for multi-variate stationary processes. Technometrics, 10, 523-534.

Shumway, R.H. (1970). Applied regression and analysis of variance for stationarytime series. J. Am. Stat. Assoc., 65, 1527-1546.

Shumway, R.H. (1971). On detecting a signal in N stationarily correlated noiseseries. Technometrics, 10, 523-534.

Shumway, R.H. and A.N. Unger (1974). Linear discriminant functions for stationarytime series. J. Am. Stat. Assoc., 69, 948-956.

Shumway, R.H. (1982). Discriminant analysis for time series. In Classification, Pat-tern Recognition and Reduction of Dimensionality, Handbook of Statistics Vol. 2,pp. 1-46. P.R. Krishnaiah and L.N. Kanal, eds. Amsterdam: North Holland.

Shumway, R.H. and D.S. Stoffer (1982). An approach to time series smoothing andforecasting using the EM algorithm. J. Time Series Anal., 3, 253-264.

Shumway, R.H. (1983). Replicated time series regression: An approach to signal es-timation and detection. In Time Series in the Frequency Domain, Handbook ofStatistics Vol. 3, pp. 383-408. D.R. Brillinger and P.R. Krishnaiah, eds. Amster-dam: North Holland.

Shumway, R.H. (1988). Applied Statistical Time Series Analysis. Englewood Cliffs,NJ: Prentice-Hall.

Shumway, R.H., R.S. Azari, and Y. Pawitan (1988). Modeling mortality fluctuationsin Los Angeles as functions of pollution and weather effects. Environ. Res., 45,224-241.

Shumway, R.H. and D.S. Stoffer (1991). Dynamic linear models with switching. J.Am. Stat. Assoc., 86, 763-769, (Correction: V87 p. 913).

Shumway, R.H. and K.L. Verosub (1992). State space modeling of paleoclimatic timeseries. In Pro. 5th Int. Meeting Stat. Climatol.. Toronto, pp. 22-26, June, 1992.

Shumway, R.H., S.E. Kim and R.R. Blandford (1999). Nonlinear estimation for timeseries observed on arrays. Chapter 7, S. Ghosh, ed. Asymptotics, Nonparametricsand Time Series, pp. 227-258. New York: Marcel Dekker.

Small, C.G. and D.L. McLeish (1994). Hilbert Space Methods in Probability andStatistical Inference. New York: Wiley.

Smith, A.F.M. and M. West (1983). Monitoring renal transplants: An applicationof the multiprocess Kalman filter. Biometrics, 39, 867-878.

Spliid, H. (1983). A fast estimation method for the vector autoregressive movingaverage model with exogenous variables. J. Am. Stat. Assoc., 78, 843-849.

Stoffer, D.S. (1982). Estimation of Parameters in a Linear Dynamic System withMissing Observations. Ph.D. Dissertation. Univ. California, Davis.

Stoffer, D.S., M. Scher, G. Richardson, N. Day, and P. Coble (1988). A Walsh-Fourieranalysis of the effects of moderate maternal alcohol consumption on neonatalsleep-state cycling. J. Am. Stat. Assoc., 83, 954-963.

Page 601: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

588 Appendix R: R Supplement

Stoffer, D.S. and K.D. Wall (1991). Bootstrapping state space models: Gaussianmaximum likelihood estimation and the Kalman filter. J. Am. Stat. Assoc., 86,1024-1033.

Stoffer, D.S., D.E. Tyler, and A.J. McDougall (1993). Spectral analysis for categor-ical time series: Scaling and the spectral envelope. Biometrika, 80, 611-622.

Stoffer, D.S. (1999). Detecting common signals in multiple time series using thespectral envelope. J. Am. Stat. Assoc., 94, 1341-1356.

Stoffer, D.S. and K.D. Wall (2004). Resampling in State Space Models. In State Spaceand Unobserved Component Models Theory and Applications, Chapter 9, pp. 227-258. Andrew Harvey, Siem Jan Koopman, and Neil Shephard, eds. Cambridge:Cambridge University Press.

Sugiura, N. (1978). Further analysis of the data by Akaike’s information criterionand the finite corrections, Commun. Statist, A, Theory Methods, 7, 13-26.

Taniguchi, M., M.L. Puri, and M. Kondo (1994). Nonparametric approach for non-Gaussian vector stationary processes. J. Mult. Anal., 56, 259-283.

Tanner, M. and W.H. Wong (1987). The calculation of posterior distributions bydata augmentation (with discussion). J. Am. Stat. Assoc., 82, 528-554.

Taylor, S. J. (1982). Financial returns modelled by the product of two stochasticprocesses – A study of daily sugar prices, 1961-79. In Anderson, O. D., editor,Time Series Analysis: Theory and Practice, Volume 1, pages 203–226. New York:Elsevier/North-Holland.

Tiao, G.C. and R.S. Tsay (1989). Model specification in multivariate time series(with discussion). J. Roy. Statist. Soc. B, 51, 157-213.

Tiao, G. C. and R.S. Tsay (1994). Some advances in nonlinear and adaptive modelingin time series analysis. J. Forecast., 13, 109-131.

Tiao, G.C., R.S. Tsay and T .Wang (1993). Usefulness of linear transformations inmultivariate time series analysis. Empir. Econ., 18, 567-593.

Tierney, L. (1994). Markov chains for exploring posterior distributions (with discus-sion). Ann. Stat., 22, 1701-1728.

Tong, H. (1983). Threshold Models in Nonlinear Time Series Analysis. SpringerLecture Notes in Statistics, 21. New York: Springer-Verlag.

Tong, H. (1990). Nonlinear Time Series: A Dynamical System Approach. Oxford:Oxford Univ. Press.

Tsay, Ruey S. (2002). Analysis of Financial Time Series. New York: Wiley.

Venables, W.N. and B.D. Ripley (1994). Modern Applied Statistics with S-Plus. NewYork: Springer-Verlag.

Watson, G.S. (1966). Smooth regression analysis. Sankhya, 26, 359-378.

Weiss, A.A. (1984). ARMA models with ARCH errors. J. Time Series Anal., 5,129-143.

West, M. and J. Harrison (1997). Bayesian Forecasting and Dynamic Models 2nded. New York: Springer-Verlag.

Whittle, P. (1961). Gaussian estimation in stationary time series. Bull. Int. Stat.Inst., 33, 1-26.

Page 602: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

R.3 Time Series Primer 589

Wiener, N. (1949). The Extrapolation, Interpolation and Smoothing of StationaryTime Series with Engineering Applications. New York: Wiley.

Wu, C.F. (1983). On the convergence properties of the EM algorithm. Ann. Stat.,11, 95-103.

Young, P.C. and D.J. Pedregal (1998). Macro-economic relativity: Governmentspending, private investment and unemployment in the USA. Centre for Researchon Environmental Systems and Statistics, Lancaster University, U.K.

Yule, G.U. (1927). On a method of investigating periodicities in disturbed serieswith special reference to Wolfer’s Sunspot Numbers. Phil. Trans. R. Soc. Lond.,A226, 267-298.

Page 603: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory
Page 604: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

Index

ACF, 21, 24

large sample distribution, 29, 524

multidimensional, 36

of an AR(p), 103

of an AR(1), 86

of an AR(2), 99

of an ARMA(1,1), 104

of an MA(q), 102

sample, 29

AIC, 52, 153, 213

multivariate case, 303

AICc, 53, 153

multivariate case, 303

Aliasing, 11, 176

Amplitude, 175

of a filter, 226

Analysis of Power, see ANOPOW

ANOPOW, 417, 426, 427

designed experiments, 434

AR model, 13, 84

conditional sum of squares, 126

bootstrap, 137

conditional likelihood, 125

estimation

large sample distribution, 122, 534

likelihood, 125

maximum likelihood estimation, 124

missing data, 403

operator, 85

polynomial, 94

spectral density, 184

threshold, 290

unconditional sum of squares, 125

vector, see VAR

with observational noise, 323

ARCH model

ARCH(m), 285

ARCH(1), 281

estimation, 283

GARCH, 285, 378

ARFIMA model, 268, 272

ARIMA model, 141

fractionally integrated, 272

multiplicative seasonal models, 157

multivariate, 301

ARMA model, 92

ψ-weights, 100

conditional least squares, 127

pure seasonal models

behavior of ACF and PACF, 156

unconditional least squares, 127

backcasts, 120

behavior of ACF and PACF, 108

bootstrap, 360

causality of, 94

conditional least squares, 129

forecasts, 115

mean square prediction error, 116

based on infinite past, 115

prediction intervals, 118

truncated prediction, 117

Gauss–Newton, 130

in state-space form, 356

invertibilty of, 95

large sample distribution ofestimators, 133

Page 605: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

592 Index

likelihood, 126MLE, 127multiplicative seasonal model, 155pure seasonal model, 154unconditional least squares, 129vector, see VARMA model

ARMAX model, 311, 355bootstrap, 360in state-space form, 355

ARX model, 303Autocorrelation function, see ACFAutocovariance function, 19, 24, 86

multidimensional, 36random sum of sines and cosines, 176sample, 29

Autocovariance matrix, 33sample, 34

Autoregressive Integrated MovingAverage Model, see ARIMAmodel

Autoregressive models, see AR modelAutoregressive Moving Average Models,

see ARMA model

Backcasting, 119Backshift operator, 61Bandwidth, 197

equivalent, 212Bartlett kernel, 207Beam, 423Best linear predictor, see BLPBIC, 53, 153, 213

multivariate case, 303, 306BLP, 109m-step-ahead prediction, 113

mean square prediction error, 114one-step-ahead prediction, 110definition, 109one-step-ahead prediction

mean square prediction error, 111stationary processes, 109

Bone marrow transplant series, 320, 348Bonferroni inequality, 203Bootstrap, 136, 198, 212, 360

stochastic volatility, 382Bounded in probability Op, 510Brownian motion, 278

Cauchy sequence, 527

Cauchy–Schwarz inequality, 507, 527Causal, 88, 94, 531

conditions for an AR(2), 96vector model, 312

CCF, 21, 25large sample distribution, 30sample, 30

Central Limit Theorem, 515M-dependent, 516

Cepstral analysis, 259Characteristic function, 512Chernoff information, 460Cluster analysis, 465Coherence, 217

estimation, 219hypothesis test, 219, 558multiple, 414

Completeness of L2, 508Complex normal distribution, 554Complex roots, 100Conditional least squares, 127Convergence in distribution, 512

Basic Approximation Theorem, 513Convergence in probability, 510Convolution, 221Cosine transform

large sample distribution, 543of a vector process, 409properties, 191

Cospectrum, 216of a vector process, 410

Cramer–Wold device, 513Cross-correlation function, see CCFCross-covariance function, 21

sample, 30Cross-spectrum, 216Cycle, 175

Daniell kernel, 204, 205modified, 205

Deconvolution, 430Density function, 18Designed experiments, see ANOPOWDeterministic process, 537Detrending, 48DFT, 70

inverse, 187large sample distribution, 543multidimensional, 253

Page 606: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

Index 593

of a vector process, 409likelihood, 410

Differencing, 60, 61Discrete wavelet transform, see DWTDiscriminant analysis, 451DLM, 319, 354

Bayesian approach, 387bootstap, 360innovations form, 359maximum likelihood estimation

large sample distribution, 344via EM algorithm, 341, 347via Newton-Raphson, 336

MCMC methods, 390observation equation, 319state equation, 319steady-state, 344with switching, 365

EM algorithm, 372maximum likelihood estimation, 371

DNA series, 487, 492Durbin–Levinson algorithm, 112DWT, 235Dynamic Fourier analysis, 228

Earthquake series, 9, 229, 236, 242, 407,447, 454, 461, 466

EEG sleep data, 485EM algorithm, 340

complete data likelihood, 340DLM with missing observations, 347expectation step, 340maximization step, 341

Explosion series, 9, 229, 236, 242, 407,447, 454, 461, 466

Exponentially Weighted MovingAverages, 143

Factor analysis, 475EM algorithm, 476

Federal Reserve Board Indexproduction, 159unemployment, 159

Fejer kernel, 207, 211FFT, 70Filter, 62

amplitude, 226, 227band-pass, 251design, 250

high-pass, 224, 250linear, 221low-pass, 224, 250matrix, 228optimum, 248phase, 226, 227recursive, 251seasonal adjustment, 251spatial, 253time-invariant, 508

fMRI, see Functional magneticresonance imaging series

Folding frequency, 176, 178Fourier frequency, 70, 187Fractional difference, 62, 268

fractional noise, 268Frequency bands, 182, 196Frequency response function, 221

of a first difference filter, 223of a moving average filter, 223

Functional magnetic resonance imagingseries, 8, 406, 436, 439, 443, 472,478

Fundamental frequency, 70, 178, 187

Gibbs sampler, see MCMCGlacial varve series, 63, 131, 151, 270,

280Global temperature series, 4, 58, 62, 322Gradient vector, 337, 402Growth rate, 145, 280

Harmonics, 201Hessian matrix, 337, 402Hidden Markov model, 366, 369

estimation, 371Hilbert space, 527

closed span, 528conditional expectation, 530projection mapping, 528regression, 529

Homogeneous difference equationfirst order, 97general solution, 100second order, 97

solution, 98

Impulse response function, 221Influenza series, 290, 373

Page 607: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

594 Index

Infrasound series, 421, 424, 427, 431,432

Inner product space, 527Innovations, 149, 335

standardized, 149steady-state, 343

Innovations algorithm, 114Integrated models, 141, 143, 157

forecasting, 142Interest rate and inflation rate series,

361Invertible, 92

vector model, 312

J-divergence measure, 465Johnson & Johnson quarterly earnings

series, 4, 350Joint distribution function, 17

Kalman filter, 326correlated noise, 354innovations form, 359Riccati equation, 343stability, 343with missing observations, 346with switching, 368with time-varying parameters, 327

Kalman smoother, 330, 399as a smoothing spline, 400for the lag-one covariance, 334with missing observations, 346

Kullback-Leibler information, 79, 458

LA Pollution – Mortality Study, 53, 74,76, 294, 304, 306, 357

Lag, 20, 26Lagged regression model, 296Lake Shasta series, 405, 411, 417Lead, 26Leakage, 207

sidelobe, 207Least squares estimation, see LSELikelihood

AR(1) model, 125conditional, 125innovations form, 126, 335

Linear filter, see FilterLinear process, 27, 94Ljung–Box–Pierce statistic, 150

multivariate, 309Local level model, 328, 331Long memory, 62, 268

estimation, 269estimation of d, 274spectral density, 273

LSE, 49conditional sum of squares, 126Gauss–Newton, 129unconditional, 125

MA model, 12, 90autocovariance function, 20, 102Gauss–Newton, 130mean function, 18operator, 90polynomial, 94spectral density, 183

Markov chain Monte Carlo, see MCMCMaximum likelihood estimation, see

MLEMCMC, 388

nonlinear and non-Gaussianstate-space models, 392, 395

rejection sampling, 389Mean function, 18Mean square convergence, 507Method of moments estimators, see

Yule–WalkerMinimum mean square error predictor,

108Missing data, 347MLE

ARMA model, 127conditional likelihood, 125DLM, 336state-space model, 336via EM algorithm, 340via Newton–Raphson, 127, 336via scoring, 127

Moving average model, see MA model

New York Stock Exchange, 6, 380Newton–Raphson, 127Normal distribution

marginal density, 18multivariate, 28, 554

NYSE, see New York Stock Exchange

Order in probability op, 510

Page 608: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

Index 595

Orthogonality property, 528

PACF, 106of an MA(1), 107iterative solution, 113large sample results, 122of an AR(p), 106of an AR(1), 106of an MA(q), 107

Parameter redundancy, 93Partial autocorrelation function, see

PACFParzen window, 210Period, 175Periodogram, 70, 187

disribution, 192matrix, 458scaled, 69

Phase, 175of a filter, 226

Pitch period, 6Prediction equations, 110Prewhiten, 297Principal components, 468Projection Theorem, 528

Q-test, see Ljung–Box–Pierce statisticQuadspectrum, 216

of a vector process, 410

Random sum of sines and cosines, 176,539, 541

Random walk, 14, 18, 22, 142autocovariance function, 20

Recruitment series, 7, 32, 64, 108, 118,194, 198, 205, 220, 244, 297

RegressionANOVA table, 51autocorrelated errors, 293, 356

Cochrane-Orcutt procedure, 294for jointly stationary series, 410

ANOPOW table, 417Hilbert space, 529lagged, 242model, 48multiple correlation, 52multivariate, 301, 356normal equations, 49periodic, 72

polynomial, 72random coefficients, 429spectral domain, 410stochastic, 361, 429

ridge correction, 431with deterministic inputs, 420

Return, 6, 145, 280Riesz–Fisher Theorem, 508

Scaling, 485Scatterplot matrix, 56, 64Scatterplot smoothers, 72

kernel, 73lowess, 74, 76nearest neighbors, 74splines, 75, 76

Score vector, 337SIC, 53Signal plus noise, 15, 17, 247, 422

mean function, 19Signal-to-noise ratio, 16, 248Sinc kernel, 211Sine transform

large sample distribution, 543of a vector process, 409properties, 191

Smoothing splines, 75, 400Soil surface temperature series, 35, 37,

253Southern Oscillation Index, 7, 32, 64,

194, 198, 205, 208, 213, 220, 223,244, 249, 297

Spectral density, 181autoregression, 213estimation, 196

adjusted degrees of freedom, 197bandwidth stability, 203confidence interval, 197degrees of freedom, 197large sample distribution, 196nonparametric, 212parametric, 212resolution, 203

matrix, 218linear filter, 228

of a filtered series, 222of a moving average, 183of an AR(2), 184of white noise, 183

Page 609: Time Series Analysis and Its Applications: With R Examples ... series analysis.pdf · Prof. Robert H. Shumway. To my wife, Ruth, for her support and joie de vivre, and to the memory

596 Index

wavenumber, 252Spectral distribution function, 181Spectral envelope, 485

categorical time series, 489real-valued time series, 495

Spectral Representation Theorem, 180,181, 539, 541, 542

vector process, 218, 542Speech series, 5, 32State-space model

Bayesian approach, 328, 387general, 388linear, see DLMnon-Gaussian, 388, 395nonlinear, 388, 395

MCMC methods, 392Stationary

Gaussian series, 27jointly, 25, 26strictly, 22weakly, 23

Stochastic process, 11realization, 11

Stochastic regression, 361Stochastic trend, 141Stochastic volatility model, 378

bootstrap, 382estimation, 380

Structural component model, 350, 373

Taper, 206, 208cosine bell, 207

Taylor series expansion in probability,511

Tchebycheff inequality, 507Threshold autoregressive model, 290Time series, 11

categorical, 489complex-valued, 468multidimensional, 35, 252multivariate, 21, 33two-dimensional, 252

Transfer function model, 296Transformation

Box-Cox, 62via spectral envelope, 498

Triangle inequality, 527Tukey-Hanning window, 211

U.S. GNP series, 145, 150, 153, 283, 496U.S. macroeconomic series, 481U.S. population series, 152Unconditional least squares, 127Unit root tests, 277

Augmented Dickey-Fuller test, 279Dickey-Fuller test, 279Phillips-Perron test, 279

VAR model, 303, 305estimation

large sample distribution, 311operator, 311

Variogram, 38, 45VARMA model, 311

autocovariance function, 312estimation

Spliid algorithm, 314identifiability of, 314

Varve series, 274VMA model, 311

operator, 311Volatility, 6, 280

Wavelet analysis, 231waveshrink, 241

Wavenumber spectrum, 252estimation, 253

Weak law of large numbers, 510White noise, 12

autocovariance function, 19Gaussian, 12vector, 303

Whittle likelihood, 214, 458Wold Decomposition, 537

Yule–Walkerequations, 121

vector model, 309estimators, 121

AR(2), 122MA(1), 123

large sample results, 122