38
Generalized Dynamic Semiparametric Factor Models for High Dimensional Nonstationary Time Series * Song Song , Wolfgang K. H¨ardle , Ya’acov Ritov § March 18, 2013 Abstract High dimensional nonstationary time series, which reveal both complex trends and stochastic behavior, occur in many scientific fields, e.g. macroeconomics, finance and neuro-economics, etc. To address them, we propose a generalized dynamic semiparametric factor model with a two- step estimation procedure. After choosing smoothed functional principal components as space functions (factor loadings), we extract various temporal trends by employing variable selection techniques for the time basis (common factors), and establish its non-asymptotic statistical prop- erties under the dependent scenario (β -mixing and m-dependent) with the weakly cross-correlated error term, which is not built upon any specific forms of the time and space basis. At the second step, we obtain a detrended low dimensional stochastic process exhibiting the dynamics of the original high dimensional (stochastic) objects and further justify statistical inference based on it. Crucially required for pricing weather derivatives, an analysis of temperature dynamics in China is presented to illustrate its performance together with a simulation study designed to mimic it. Keywords: semiparametric model, factor model, group Lasso, seasonality, spectral analysis, periodic, asymptotic inference, weather AMS 2000 subject classification: 62G08, 62G20, 62M10 JEL classification: C13, C14, C32, D87, G10 * Supported by Deutsche Forschungsgemeinschaft via SFB 649 “ ¨ Okonomisches Risiko”, Humboldt-Universit¨ at zu Berlin. Ya’acov Ritov’s research is supported by an ISF grant and a Humboldt Award. Humboldt-Universit¨ at zu Berlin and University of Texas, Austin. Email: [email protected] Humboldt-Universit¨ at zu Berlin § The Hebrew University of Jerusalem 1

Generalized Dynamic Semiparametric Factor Models for High …pluto.huji.ac.il/~yaacov/GDSFM.pdf · 2014. 1. 27. · Generalized Dynamic Semiparametric Factor Models for High Dimensional

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

  • Generalized Dynamic Semiparametric Factor Models for

    High Dimensional Nonstationary Time Series ∗

    Song Song †, Wolfgang K. Härdle ‡, Ya’acov Ritov§

    March 18, 2013

    Abstract

    High dimensional nonstationary time series, which reveal both complex trends and stochastic

    behavior, occur in many scientific fields, e.g. macroeconomics, finance and neuro-economics, etc.

    To address them, we propose a generalized dynamic semiparametric factor model with a two-

    step estimation procedure. After choosing smoothed functional principal components as space

    functions (factor loadings), we extract various temporal trends by employing variable selection

    techniques for the time basis (common factors), and establish its non-asymptotic statistical prop-

    erties under the dependent scenario (β-mixing and m-dependent) with the weakly cross-correlated

    error term, which is not built upon any specific forms of the time and space basis. At the second

    step, we obtain a detrended low dimensional stochastic process exhibiting the dynamics of the

    original high dimensional (stochastic) objects and further justify statistical inference based on it.

    Crucially required for pricing weather derivatives, an analysis of temperature dynamics in China

    is presented to illustrate its performance together with a simulation study designed to mimic it.

    Keywords: semiparametric model, factor model, group Lasso, seasonality, spectral analysis,

    periodic, asymptotic inference, weather

    AMS 2000 subject classification: 62G08, 62G20, 62M10

    JEL classification: C13, C14, C32, D87, G10

    ∗Supported by Deutsche Forschungsgemeinschaft via SFB 649 “Ökonomisches Risiko”, Humboldt-Universität zu

    Berlin. Ya’acov Ritov’s research is supported by an ISF grant and a Humboldt Award.†Humboldt-Universität zu Berlin and University of Texas, Austin. Email: [email protected]‡Humboldt-Universität zu Berlin§The Hebrew University of Jerusalem

    1

  • 1 Introduction

    Over the past few decades, high dimensional data analysis has attracted increasing attention in var-

    ious fields. In many occasions, we have a high dimension vector of observations evolving in time (a

    very large interrelated time process), which is also possibly controlled by an exogenous covariate. For

    example, in macroeconomic forecasting, people use very large dimensional economic and financial time

    series, Stock and Watson (2005b); In meteorology and agricultural economics, one of the primary in-

    terests is to study the fluctuations of temperatures at different nearby locations, for a recent summary,

    see Gleick et al. (2010). Such an analysis is also essential for pricing weather derivatives and hedging

    weather risks in finance, Odening et al. (2008). In neuro-economics, one uses high dimensional func-

    tional magnetic resonance imaging data (fMRI) to analyze the brain’s response to certain risk related

    stimuli as well as identifying its activation area, Worsley et al. (2002). In financial engineering, one

    studies the dynamics of the implied volatility surface for risk management, calibration and pricing

    purposes, Fengler et al. (2007). Other examples include mortality analysis, Lee and Carter (1992);

    bond portfolio risk management or derivative pricing, Nelson and Siegel (1987) and Diebold and Li

    (2006); limit order book dynamics, Hall and Hautsch (2006); yield curves, Bowsher and Meeks (2006)

    and so on.

    Empirical studies in economics and finance often involve non-stationary quantities, e.g., real con-

    sumer price index, individual consumption, exchange rates, real GDP etc. For example, the large

    panel macroeconomic data, provided by Stock and Watson (2005a), contains some complex nonsta-

    tionary behaviors, e.g. the normal seasonality, large economic cycle and upward trend representing

    economic growth, etc. But some studies have produced counterintuitive and contradictory results; see

    Campbell and Yogo (2006), Cai et al. (2009), Xiao (2009) and Wang and Phillips (2009a), Wang and

    Phillips (2009b). This may partly be attributed to methods that are not capturing nonstationarity or

    nonlinear structural relations. In fact, in the econometrics literature, the study of such non-stationary

    time series is dominated by linear at most parametric models - restricting non-stationarity to unit

    root or long-memory ARFIMA type of nonstationarity and structural relations to linear or parametric

    type of cointegration models. General processes can be characterized by certain recurrence properties.

    These processes contain stationary, long-memory and unit-root type or nearly integrated processes

    as sub-classes, and are more general than the class of locally stationary processes. As pointed out in

    the recent econometric literature, when some covariates are non-stationary, conventional statistical

    2

  • tests are invalid although the predictive power in a nonparametric regression model can be improved

    if some covariates are non-stationary. While some asymptotic results for general nonparametric esti-

    mation methods for low dimensional non-stationary time series have been obtained, semiparametric

    modeling ideas have been hardly investigated so far, particularly for high dimensional nonstationary

    time series. For the iid case, there have been many works in the literature, including, but not limited

    to Horowitz and Lee (2005), Horowitz et al. (2006), Horowitz (2006) for the moderate dimension case

    and Horowitz and Huang (2012), Huang et al. (2010) for the high dimension case.

    In such situations, dimension reduced temporal and spatial (over the space of variables) structures

    come into play. If people still use either high dimensional “static” methods which are initially designed

    for independent data or low dimensional multivariate time series techniques (on a few concentrated

    series), they might disregard potentially relevant information, e.g. either losing the time dynamics

    or the space dependence structure or being “forced” to perform “naive” aggregation, which may pro-

    duce suboptimal forecasts and would be extremely inefficient. For example, in macroeconomics, this

    potentially creates an omitted variable bias with adverse consequences both for structural analysis

    and forecasting. Christiano et al. (1999) points out that the positive reaction of prices in response

    to a monetary tightening, the so-called price puzzle, is an artefact resulting from the omission of

    forward-looking variables, such as the commodity price index. The more scattered and dynamic the

    information is, the more severe this loss experiences. To this end, an integrated solution addressing

    both issues is therefore appealing. One needs to analyze jointly time and space dynamics by simul-

    taneously fitting a time series evolution and a fine tuning of the factors involved. The solution we

    are seeking tries to understand the spatial pattern, gain strength from the different time points, and,

    at the same time, analyze the nonstationary temporal behavior of the value at each spatial point. In

    this article, we present and investigate the so called generalized dynamic semiparametric factor model

    (together with its corresponding panel version) to address this problem.

    To address these challenges, recent literature has proposed ways to impose restrictions on the

    covariance structure to limit the number of parameters to be estimated. Dynamic factor models

    introduced by Forni et al. (2000), Stock and Watson (2002a), Stock and Watson (2002b), also discussed

    at Forni et al. (2005) and Giannone et al. (2005), draw upon the idea that the intertemporal dynamics

    can be explained and represented by a few common factors (low dimensional time series). Another

    approach of this field is Park et al. (2009), where a latent L dimensional process, Z1, . . . , ZT are

    introduced, and the J-dimensional random process Yt = (Yt,1, . . . , Yt,J)>, t = 1, . . . , T , is represented

    3

  • as

    Yt,j = Zt,1m1,j + · · ·+ Zt,LmL,j + εt,j, j = 1, . . . , J, t = 1, . . . , T, (1)

    where Zt,l are the common factors depending on time, εt,j are errors or specific factors, and the

    coefficients ml,j are factor loadings. The index t = 1, . . . , T reflects the time evolution; {Zt}Tt=1 (Zt =

    (Zt,1, . . . , Zt,L)>) is assumed to be a stationary random process; and ml = (ml,1, . . . ,ml,J)

    > captures

    the spatial dependency structure. The study of the time behavior of the high-dimensional Yt is then

    simplified to the modeling of Zt which is more feasible when L� J . The model (1) reduces to a special

    case of the generalized dynamic factor model (approximate factor model) considered by Forni et al.

    (2000), Forni et al. (2005) and Hallin and Liska (2007), when Zt,l = al,1(B)Ut,1+· · ·+al,q(B)Ut,q where

    the q-dimensional vector process Ut = (Ut,1, . . . , Ut,q)> is an orthonormal white noise and B stands for

    the lag operator. In this case, the model (1) is expressed as Yt,j = m0,j +∑q

    k=1 bk,j(B)Ut,k+εt,j, where

    bk,j(B) =∑L

    l=1 al,k(B)ml,j. Less general models in the literature include the static factor models

    proposed by Stock and Watson (2002a), Stock and Watson (2002b) and the exact factor models

    suggested by Sargent and Sims (1977) and Geweke (1977).

    Our goals of modeling high dimensional nonstationary time series are achieved by using a sparse

    representation approach to regression. In fact we combine spatio-temporal modeling with the group-

    Lasso, Yuan and Lin (2006). We approximate the temporal common factors and spatial factor loadings

    both by a linear combination of series terms. Since the temporal nonstationarity behavior might

    result from different sources, the choice of basis functions is of importance. We start by introducing

    an over parameterized model, which may capture (almost) any type of temporal behaviors e.g. cyclic

    behavior plus linear or quadratic trends by utilizing series basis such as powers, trigonometrics, local

    polynomials, periodic functions or B-splines, and then selecting a sparse sub-model, using penalizing

    Lasso and group Lasso techniques.

    In practice, there might be multiple subjects each of which itself corresponds to a set of high

    dimensional time series. For example, for international economy, industrial organization, or finance

    studies, there are many countries, firms or assets’ data, each of which is high dimensional too. Thus

    we also need provide a panel version of the high dimensional time series model to address this issue.

    Compared with the work in the literature, the novelty of this article lies in the following aspects.

    • First As in economics, when the time process is not stationary, i.e. the process that has a non-

    linear, non-parametric temporal structure in time, by a skillful selection of time basis, one makes

    it a useful tool for the economic study. To achieve the successful selection, the key assumption

    4

  • is that the initially proposed time basis should not to be too dependent although the number

    can be large, i.e. we should include as many as possible orthogonal time basis functions for the

    automatic selection. From the large panel time series modeling point of view, we incorporate

    nonstationarity and nonlinearity (complex trends) into time dynamics and deviate from most

    current literature that still requires Zt to be stationary and a large number of observations

    (relative to dimensionality) for establishing asymptotic properties. Through a wide choice of

    linear and nonlinear time basis functions, we are able to handle very complex time series.

    • Second, the contribution of the paper lies in the way the time dynamics is introduced for the

    variable selection and regularization methods. The non-asymptotic theoretical properties of

    existing methods are established under an independency scenario. We extend it to a dependent

    scenario (β-mixing and m-dependent process) with the weakly cross-correlated error term (de-

    tails specified later in Assumption 3.2), and derive oracle sparsity inequalities (non-asymptotic

    risk bounds). The key assumption there is the temporal dependence level of the error term is

    controlled within some level. And this result is not built upon any specific forms of the time

    and space basis.

    • Third, when the “space” structure of the ml’s is complex, the too low dimensional parameteri-

    zations do not capture it properly. We employ a data driven method, introduced by Hall et al.

    (2006), to capture the spatial dependence structure.

    • Fourth, for the case that there might be multiple subjects each of which corresponds to a set

    of high dimensional time series, a panel version of the model with corresponding estimation

    method is provided.

    In a variety of applications, one has explanatory variables Xt,j ∈ Rd at hand, e.g. the geo

    coordinates of weather stations, the voxels (volume elements, representing values on regular grids) of

    functional magnetic resonance imaging, or the variables moneyness and time-to-maturity for implied

    volatility modeling, which may influence the factor loadings ml. An important refinement of the model

    (1) is to incorporate the existence of observable covariates Xt,j from Park et al. (2009). The factor

    loadings are then generalized to functions of Xt,j. In the following, we write Xt = (Xt,1, . . . , Xt,J)>

    and consider the generalization of (1):

    Yt,j = ZTtm(Xt,j) + εt,j, t = 1, . . . , T, (2)

    5

  • Rainfall Modelling 2-1

    Rainfall Data

    � Daily rainfall data (from RDC)� 29 Provinces, 105 stations in China� from 19510101 to 20091130

    Pricing Chinese Rain �1957196219671972197719821987199219972002200712.51313.51414.5Figure 1: Networks of China’s weather stations (left) and the moving average (of 730 nearby days)

    view of temperatures of China from Jan 1st, 1957 to Dec 31st, 2009 (right).

    where Ytj, εtj ∈ R, Xtj ∈ Rd, m : Rd → RL, Zt ∈ R1×L.

    Our motivating example is from temperature analysis for pricing weather derivatives. The data

    set is taken from the Climatic Data Center of China Meteorological Administration. It contains daily

    observations from 159 weather stations across China from Jan 1st, 1957 to Dec 31st, 2009. We would

    like to address the questions whether there is a change in time, but also to permit a different trend in

    time, in the different climate types, as shown in Figure 1 (left). Besides the well known seasonality

    effect, we may expect a climate change related trend. If we take the moving average of 730 nearby days,

    which is (159·730)−1∑+365

    s=−354∑159

    j=1 Yt+s,j with Yt,j being the temperature of the jth weather station at

    time t, Figure 1 (right) shows a “large period” (around 10 years between peaks) and an upward trend

    of China’s temperatures. Except these trends, there is also stochasticity inheriting in the remained

    time dynamics, which is essential for pricing weather derivatives and hedging weather risks. Studying

    the dynamics of temperatures in various places w.r.t. Xt,j = Xj (the three-dimensional geographical

    information of the jth weather station) simultaneously will enable us to estimate, forecast and price

    temperatures in time and space.

    The rest of the article is organized as follows. In the next section, we present more details of the

    generalized dynamic semiparametric factor model (GDSFM) together with the corresponding basis

    selection, estimation procedure and panel model. The estimate’s properties under various scenarios

    are presented in Section 3. In Section 4, the method is applied to the motivating problem: dynamic

    behavior of temperatures. In Section 5, we present the results of simulation studies that mimic the

    previous empirical example. Section 6 contains concluding remarks. All technical proofs are sketched

    6

  • in the appendix.

    2 Generalized Dynamic Semiparametric Factor Models

    We observe (Xt,j, Yt,j) for j = 1, . . . , J and t = 1, . . . , T , Ytj ∈ R, Xtj ∈ Rd, εtj ∈ R such that

    Y >tj = Z>t A∗Ψ(Xtj) + ε

    ′tj = (U

    >t Γ∗ + Z>0,t)A

    ∗Ψ(Xtj) + ε′tj,

    where A∗ and Γ∗ are the L×K and R×L (unknown) underlying coefficient matrices and Zt has two

    components Γ∗TUt + Z0,t. By Yt = (Yt,1, . . . , Yt,J)>, Xt = (Xt,1, . . . , Xt,J)

    > and ε′t = (ε′t,1, . . . , ε

    ′t,J)>,

    Ψ(Xt) = (Ψ(Xt1), . . . ,Ψ(XtJ)) (abbreviated as Ψt), we rewrite it in a compact form as

    Y >t = (U>t Γ∗ + Z>0,t)A

    ∗Ψ(Xt) + ε′>t (3)

    = U>t Γ∗A∗Ψ(Xt) + Z

    >0,tA

    ∗Ψt + ε′>t .

    Again by introducing β∗T = Γ∗A∗ (the R ×K unknown underlying coefficient matrices consisting of

    βrk); εt = ZT0,tA

    ∗Ψt + ε′t, we could further rewrite it as:

    Y >tdef= U>t β

    ∗>Ψt + ε>t . (4)

    Notice that:

    • Time evolution / common factors: Zt = (Zt,1, . . . , Zt,L)> is an unobservable L-dimensional

    process consisting of both a deterministic portion, Γ∗>Ut, and a stochastic one, Z0,t. Here

    {Z0,t}Tt=1 is a stationary process to be detailed later. A key difference to Park et al. (2009) is

    this additional “nonstationary” component Γ∗>Ut.

    • Factor loading functions and error terms: m(Xtj) = A∗Ψ(Xtj) is an L-tuple (m1, . . . ,mL) of

    unknown real-valued functions ml defined on a subset of Rd and ε′t = (ε′t,1, . . . , ε′t,J)> are the

    errors. Throughout the paper, we assume that the covariates Xt,j have support [0, 1]d. The error

    terms εt and ε′t only need satisfy some mild condition (details specified later in Assumption 3.2

    and Assumption 3.3.3), which allows them to be weakly dependently (over time) and cross-

    correlated (over space).

    • Time and space basis: Use a series expansion to capture the time trend and the space dependence

    structure. Select U>t = (u1(t), . . . , uR(t)) as the 1×R vector of time base functions (polynomial

    7

  • and harmonic functions etc.), that are selected and weighted by the matrix Γ∗. For the space

    basis, take Ψt = (ψ1(Xt), . . . , ψK(Xt))> (K × J matrix). For every β matrix, we introduce

    βr = (βkr, 1 ≤ k ≤ K), which is, the column vector formed by the coefficients corresponding to

    the r-th time basis. Additionally, we define the mixed (2, 1) norm ‖β‖2,1 =∑R

    r=1

    √∑Kk=1 β

    2rk.

    Finally, we set R(β) = {r : βr 6= 0} and M(β) = |R(β)|, where |R(β)| denotes the cardinality

    of set R(β). For the sake of simplicity and convenience, we use | · | to denote the L1 norm for

    vectors and ‖ · ‖ to denote the L2 norm for vectors or the mixed (2, 1) norm for matrices.

    Since the nonstationry behavior might be very complex, to ensure all the trends leading the time

    series to be nonstationary are considered, the dimension R of the initially included time basis might be

    large. For example, in the temperature analysis, since we never know the exact frequency (frequencies)

    of the period(s), at the beginning, we include all the basis functions which we think might be useful

    for capturing the nonstationary behavior, e.g. 16 Fourier base functions w.r.t. different frequencies

    and 53 × 3 (year by year) cubic polynomial basis. Consequently we end up with R = 175. On

    the other hand, to avoid overfitting, variable selection with regularization techniques is necessary. A

    popular variable selection method is the Lasso, Tibshirani (1996). An extension for factor structured

    models is the group Lasso, Yuan and Lin (2006), in which the penalty term is a mixed (2, 1)-norm of

    the coefficient matrix. Here, we assume that the vectors βrs are not only sparse, but also have the

    same sparsity pattern across different factors. We study the estimate’s theoretical sparsity properties

    relating to the time basis selection, and take (3) to be the “true” model. Since the group-LASSO

    permits over-parametrization, this is a mild assumption. We would also like to emphasize that our

    nonasymptotic sparse oracle inequality results are independent of the specifications of the time and

    space basis. They apply equally to local polynomials, periodic functions such as sin and cos, or

    B-splines etc, while we just assume that there is no additional approximation error for obtaining the

    space basis at this nonasymptotic analysis step.

    2.1 A Panel Version with Multiple Individuals

    Here we just present a panel version of (3) based on assumptions closely related to the fMRI neuro

    economics study, Mohr et al. (2010). It is reasonable to assume that different subjects have different

    patterns of brain activation (to the external stimuli) represented by the time series Zt, but they (and

    all human beings) share essentially the same spatial structure of the brain represented by the space

    8

  • function A∗Ψt. With a panel of I subjects, we formulate the following generalization of (3) and (4),

    Y it,j =L∑l=1

    (Zi0,t,l + U>t Γ

    il)ml(Xt,j) + ε

    it,j, 1 ≤ j ≤ Jt, 1 ≤ t ≤ T, 1 ≤ i ≤ I, (5)

    where the fixed effect Zi0,t,l and Γil are the individual effects on functions ml for subject i at time point

    t. For identification purpose, assume E

    (I∑i=1

    L∑l=1

    Zi0,t,lml(Xt,j)|Xt,j

    )= 0. For this data structure, use

    Y t,j to denote the average of Yit,j across different subjects i, we have from (5):

    Y t,j =L∑l=1

    (U>t Γl)ml(Xt,j) + εt,j , 1 ≤ j ≤ J,

    and the two steps estimation procedure for the panel version model is as follows:

    I Take the average of Y it,j across different subjects i, and estimate the common basis function in

    space m̂l as in the original approach, see subsection 2.4 for more details.

    II Given the common m̂l’s, estimate subject specific time factors Zit,l:

    Y it,j =L∑l=1

    (Zi0,t,l + U>t Γ

    il)m̂l(Xt,j) + ε

    it,j.

    Next we will discuss the choice of time basis Ut, space basis Ψt and the estimation procedure for (4).

    2.2 Choice of Time Basis

    To capture the global trend in time, one may use any orthogonal polynomial basis, e.g. u1(t) =

    1/C1, u2(t) = t/C2, u3(t) = (3t2 − 1)/C3, . . . (Ci are generic constants with T−1

    ∑Tt=1 u

    2r(t)/C

    2r = 1).

    One may also use the fact that there are natural frequencies in the data, and start with a few

    harmonic functions. In the temperature example, the yearly cycle and a “large” period are two clear

    phenomena. To capture these periodic variations, one may use Fourier series, u4(t) = sin(2πt/p)/C4,

    u5(t) = cos(2πt/p)/C5, u6(t) = sin{2πt/(p/2)}/C6, u7(t) = cos{2πt/(p/2)}/C7, . . . with the given

    period p’s: 365 and 10 for the yearly cycle and “large” period respectively. In the fMRI application

    of Myšičková et al. (2013), in which the basic experiment is repeated every 29.5 second, we have the

    period p = 11.8 (there is a fMRI scan every 2.5s). In general, to adopt various kinds of nonlinearity

    nonstationarity, various basis functions could be employed, such as powers, trigonometrics, local

    polynomials, periodic functions or B-splines etc. The theory to be presented later for selecting the

    significant time basis selection is actually independent of their specific forms, and thus is very useful

    in practice.

    9

  • 2.3 Choice of Space Basis

    There are various choices for a space basis. In Park et al. (2009), a multidimensional B-Spline basis

    was proposed. Alternatively, functional PCA, Hall et al. (2006) may be employed, which combines

    smoothing techniques with ideas related to functional principal component analysis. The basic steps

    are as follows:

    I Calculate the covariance operator (in a functional sense). Denote Xtj = (X1tj, . . . , X

    dtj), u =

    (u1, . . . , ud) and v = (v1, . . . , vd) (same for b, b̂, b1, b̂1, b2 and b̂2). Given u ∈ [0, 1]d, and

    bandwidths hµ and hφ, define (â, b̂) to minimize

    mina,b

    T∑t=1

    Jt∑j=1

    {Ytj − a− bT(u−Xtj)}2K(Xtj − u

    ),

    and take µ̂(u) = â. Then, given u, v ∈ [0, 1]d, choose (â0, b̂1, b̂2) to minimize

    T∑t=1

    ∑16j 6=k6Jt

    {YtjYtk − a0 − bT1(u−Xtj)− bT2(v −Xtk)}2K(Xtj − u

    )K(Xtj − v

    ).

    Denote â0 by φ̂(u, v) and construct µ̂(v) similarly to µ̂(u). The estimate of the covariance

    operator is then:

    ψ̂(u, v) = φ̂(u, v)− µ̂(u)µ̂(v). (6)

    II Compute the principal space basis. Compute from (6) the largest K eigenvalues and correspond-

    ing orthonormal eigenfunctions as the basis ψ̂1(x), . . . , ψ̂K(x). For computational methods and

    practical considerations we refer to Section 8.4 of Ramsay and Silverman (2005).

    As remarked by Hall et al. (2006), the operator defined by (6) is not necessarily positive semidef-

    inite, but is assured to have real eigenvalues. Theorem 1 of Hall et al. (2006) provides theoretical

    foundations that the bandwidths hµ and hφ should be chosen as O(T−1/5) to minimize the distance

    between the estimates ψ̂’s and the corresponding true ones. In Section 4 (details presented later), we

    find that the performance of the β̂ is very robust to the choice of the smoothing parameter here.

    Here we would like to emphasize that the space basis function Ψ̂t is only an estimate of the

    true (unobservable) Ψt. But in proving the time basis selection’s properties as in Theorem 3.2 and

    Corollary 3.1, we assume that this space basis estimation does not affect the study of selecting the

    temporal basis, since, otherwise, the non-asymptotic theoretical deviation will be too complex. If we

    10

  • still stick to the B-spline basis as in Park et al. (2009), all the proofs afterwards do not need to be

    modified. For simplicity of notation, we continue to use Ψt to denote this estimate of space basis

    afterwards.

    We apply this method to the implied volatility modeling problem, as already discussed in detail

    in Park et al. (2009). Figure 2 displays the space basis modeling using the FPCA approach, which

    could capture the special “smiling” effect well, which those spline basis can’t do well.

    0

    0.5

    1

    1.5

    2

    0

    0.5

    1

    1.5

    2

    −0.1

    0

    0.1

    0.2

    0.3

    0.4

    moneyness

    1st eigenfunction, interpolated

    time to maturity0

    0.5

    1

    1.5

    2

    0

    0.5

    1

    1.5

    2

    −0.4

    −0.3

    −0.2

    −0.1

    0

    0.1

    0.2

    moneyness

    2nd eigenfunction, interpolated

    time to maturity

    Figure 2: Space basis using the FPCA approach for IVS modeling.

    2.4 Estimation Procedure

    We have now collected sufficient tools to present the estimation method:

    I Given the pre-specified time and space basis, find significantly loaded time basis functions (i.e.

    coefficients β) utilizing the group Lasso technique by minimizing:

    minβT−1

    T∑t=1

    (Y >t − U>t β>Ψt

    ) (Y >t − U>t β>Ψt

    )>+ 2λ‖β‖2,1. (7)

    Here we use T−1 instead of (JT )−1 because the space basis has been orthonormalized (Ψ̂tΨ̂>t =

    IK).

    II Split the joint matrix β̂ into 2 separate coefficient matrices Γ̂ and  by taking

    Γ̂ as the L eigenvectors of β̂β̂> (w.r.t. the L largest eigenvalues) and  = Γ̂>β̂.

    Given Y >t − U>t β̂>Ψt and Â,Ψt, estimate Z0,t by the OLS method.

    11

  • It is worthwhile to note that both Γ (and Z0,t respectively) and A are not identifiable in the model

    (3), since trivially Γ∗A∗ = (Γ∗B)(B−1A∗). However, if we concentrate on prediction, identification

    of β (as a product of Γ and A, as in (7)) is enough. Additionally, we show that for any version of

    {Z0,t}, there exists a version of {Ẑ0,t} whose lagged covariances are asymptotically the same as those

    of {Z0,t}.

    The group Lasso estimates depend on the tuning parameter λ. We implement an easily computable

    BIC-type criterion. The solution path is computed by evaluating some criterion on equally spaced

    λ’s between 0 and λmax = maxr ‖∑

    t ΨtYtUtr ‖. We select the λ that minimizes:

    BIC(λ) = log(∑

    t

    ‖ Y >t − U>t β̂>Ψt ‖2/T)

    + log T · df/T, (8)

    df =∑r

    1{‖ β̂r ‖> 0}+∑r

    ‖ β̂r ‖‖ β̂OLS ‖

    (K − 1).

    For reference purposes, we also list the formulae of Cp, GCV and AIC criterion.

    Cp(λ) =∑t

    ‖ Y >t − U>t β̂>Ψt ‖2/σ̃2 − T + 2df,

    σ̃2 =∑t

    ‖ Y >t − U>t β̂>OLSΨt ‖2/(T − df),

    GCV (λ) =∑t

    ‖ Y >t − U>t β̂>Ψt ‖2/(1− df/T )2,

    AIC(λ) = log(∑

    t

    ‖ Y >t − U>t β̂>Ψt ‖2 /T)

    + 2df/T.

    As pointed by Yuan and Lin (2006) (for i.i.d.data), the performance of this approximate information

    criterion is generally comparable with that of computationally much more expensive (especially for

    the massive data) fivefold cross-validation. More importantly, since the data here are observed in

    time, the order of observations is of importance, and hence a simple cross-validation procedure is not

    appropriate in a time series context, which doesn’t affect the information criterion’s. Besides BIC,

    there are other types of parameter selection criterion, such as Cp, GCV and AIC. In terms of variable

    selection, Wang and Leng (2008) finds that BIC is superior to Cp. The reason is that when there

    exists a true model, AIC types of criteria (including GCV and Cp) tend to overestimate the model size,

    Leng et al. (2004), Wang et al. (2007a) and Wang et al. (2007b). Subsequently, estimation accuracy

    using Cp may suffer. A theoretical justification that shows GCV overfits for smoothly clipped absolute

    12

  • deviation (Fan and Li (2001), SCAD) method is given by Wang et al. (2007b). Analogous arguments

    apply to the Cp methods too.

    3 Estimates’ Properties

    In this section, we study sparse oracle inequalities for the estimate β̂ defined in (7) assuming the errors

    εt are dependent (β-mixing in Theorem 3.2 and m-dependent in Corollary 3.1). This work extends

    those of Lounici et al. (2009), Bickel et al. (2009) and Lounici (2008) concerning upper bounds on the

    prediction error and the distance between the estimator and the true matrix β∗.

    For the second step of the estimation procedure, an important question arises: is it justified, from

    an inferential point of view, to base further statistical inference on the detrended stochastic time

    series? Theorem 3.4 shows that the difference between the inference based on the estimated time

    series and “true” unobserved time series is asymptotically negligible.

    Before stating the first theorem, we make the following assumption first.

    ASSUMPTION 3.1 There exists a positive number κ = κ(s) such that

    min{√∑

    t ‖Ψ>t ∆Ut‖2√T ‖ ∆R ‖

    : |R| 6 s,∆ ∈ RK×R\{0},

    ‖ ∆Rc ‖2,16 3 ‖ ∆R ‖2,1}> κ,

    where Rc denotes the complement of the set of indices R, ∆R denotes the matrix formed by stacking

    the rows of matrix ∆ w.r.t. row index set R.

    Assumption 3.1 is essentially a restriction on the eigenvalues of∑T

    t=1 UtU>t as a function of sparsity

    s. It in fact requires the initially involved time basis not to be too dependent, which is naturally

    satisfied by orthogonal polynomials and Fourier series. Low sparsity means that s is big and therefore

    κ is small. κ(s) is thus a decreasing function of s. Also see Lemma 4.1 of Bickel et al. (2009) for more

    details and related discussions.

    THEOREM 3.1 (Deterministic Part) Consider the model (4). Assume that ΨtΨ>t = IK (or-

    thonormalized space basis), T−1∑T

    t=1 U>t Ut/R = 1, and the number of true nonzero time basis

    M(β∗) 6 s. If the random event

    A ={

    2T−1 max16r6R

    T∑t=1

    K∑k=1

    J∑j=1

    Ψ>tkjεtjUtr 6 λ}

    (9)

    13

  • holds for some λ > 0, and Assumption 3.1 is satisfied, then we have, for any solution β̂ of (7):

    T−1T∑t=1

    ‖ Ψ>t (β̂ − β∗)Ut ‖2 6 16sλ2κ−2, (10)

    K−1/2‖ β̂ − β∗ ‖2,1 6 16sλK−1/2κ−2, (11)

    M(β̂) 6 64φ2maxsκ−2. (12)

    Notice that Theorem 3.1 is valid for any J,R, T , any types of distributions of εt and yields non-

    asymptotic bounds.

    Since the standard assumption of εt being independent is often unsatisfied in practice, it is im-

    portant to understand how the estimator behaves in the more general situation, i.e. with dependent

    error terms. As far as we know, our result is one of the first attempts to deal with dependent error

    terms for (group) Lasso variable selection techniques. Before moving on, we first recall the defini-

    tion of β-mixing, which is an important measure of dependence between σ-fields (for time series).

    More precisely, following Doukhan (1994), let (Ω,F ,P) be a probability space and A, B be two sub

    σ-algebras of F , various measures of dependence between A and B have been defined as:

    β(A,B) = sup 12

    I∑i=1

    J∑j=1

    |P(Ai⋂

    Bj)− P(Ai) P(Bj)|, (13)

    α(A,B) = sup |P(A⋂

    B)− P(A) P(B)|, A ∈ A, B ∈ B, (14)

    where the supremum is taken over all pairs of (finite) partitions {A1, . . . , AI} and {B1, . . . , BJ} of

    Ω such that Ai ∈ A for each i and Bj ∈ B for each j. Now suppose {Vt}t∈T is a (not necessarily

    stationary) sequence of random variables. For −∞ 6 i 6 j 6∞, define the σ-field σji = σ(Vt, i 6 t 6j, t ∈ T ). For each a > 1, define the following dependence coefficients:

    β(a) = supt∈T

    β(σt−∞, σ∞t+a), α(a) = sup

    t∈Tα(σt−∞, σ

    ∞t+a).

    In the special case where the sequence {Vt}t∈T is strictly stationary, one simply has

    β(a) = β(σt−∞, σ∞t+a), α(a) = α(σ

    t−∞, σ

    ∞t+a).

    A stochastic process is said to be β-mixing (or α-mixing) if β(a)→ 0 (or α(a)→ 0) as a→∞. By

    definition, when σt−∞ and σ∞t+a are independent to each other, β(a) = 0; the closer to 0 β(a) is, the

    more independent the time series is. A very natural question to ask is: to what extend, the degree of

    dependence (in terms of β-mixing coefficients) is allowed, while we could still obtain certain sparse

    14

  • oracle inequalities, i.e. to study the relationship among high dimensionality R, moderate sample size

    T and β-mixing coefficients β.

    We use the following mild technical assumption similar to the typical bounded second moment

    requirement for i.i.d.data.

    ASSUMPTION 3.2 The matrices Ψt and Ut and random variables εt are such that for Vtdef=

    K−1/2∑K

    k=1

    ∑Jj=1 ΨtkjεtjUtr, ∃σ2 such that ∀n,m, m−1 E(Vn + . . . + Vn+m)2 6 σ2 and ∀t, |Vt| 6 C ′′,

    ∀ r and some constants σ2, C ′′ > 0, t = 1, . . . , T .

    Note that since Vt (as a function of εtj) is defined as a sum over j, it also indicates that the error

    term εt could be weakly cross-correlated. We can now state our main result.

    THEOREM 3.2 (β-mixing) Consider the model (4). Assume the sequence {Vt}Tt=1 satisfies As-

    sumption 3.2 and the β-mixing condition with the β-mixing coefficients

    β([38σε2T 1/2(1−ε)1/2C ′′−1 logR−(1+δ′)/2]−1) 6 {24σ(1−ε)1/2(R1+δ′

    √logR1+δ′TC ′′)−1} for any ε > 0,

    some δ′ > 0 and λ defined below. ΨtΨ>t = IK, T

    −1∑Tt=1 U

    >t Ut/R = 1, and M(β

    ∗) 6 s. Further-more let κ be defined as in Assumption 3.1 and φmax be the maximum eigenvalue of the matrix∑T

    t=1 UtU>t /T . Let

    λ =

    √16 logR1+δ′Kσ2

    T (1− ε).

    Then with probability at least 1− 3R−δ′, for any solution β̂ of (7), we have:

    T−1T∑t=1

    ‖ Ψ>t (β̂ − β∗)Ut ‖2 6 256s

    { logR1+δ′Kσ2T (1− ε)

    }κ−2, (15)

    K−1/2‖ β̂ − β∗ ‖2,1 6 96s√

    logR1+δ′σ2

    T (1− ε)κ−2, (16)

    M(β̂) 6 64φ2maxsκ−2. (17)

    Remark I Before explaining the results, as also mentioned in Song and Bickel (2011), we would

    like to discuss some related results first. For technical simplicities, we consider the following simplest

    linear regression model with R→∞:

    et = xt1θ1 + . . . , xtRθR + �t = x>t θ + �t, (18)

    with the regressors (xt1, . . . , xtR) = x>t , the coefficients (θ1, . . . , θR) = θ

    > and the error term �t.

    Suppose x in (18) has full rank R and �t is N(0, σ2). Consider the least squares estimate (R 6 T )

    15

  • θ̂OLS = (xx>)−1xe. Then from standard least squares theory, we know that the prediction error

    ‖x>(θ̂OLS − θ∗)‖22/σ2 is χ2R-distributed, i.e.

    E‖x>(θ̂OLS − θ∗)‖22

    T=σ2

    TR. (19)

    In the sparse situation if �t is N(0, σ2) (different from our case), Corollary 6.2 of Bühlmann and van de

    Geer (2011) shows that the Lasso estimate obeys the following oracle inequality :

    ‖x>(θ̂Lasso − θ∗)‖22T

    6 C0σ2 logR

    TM(θ∗) (20)

    with a large probability and some constant C0. The additional logR factor here could be seen as the

    price to pay for not knowing the set {θ∗p, θ∗p 6= 0}, Donoho and Johnstone (1994). Similar to the i.i.d.

    Gaussian situation discussed above, the term (logR)1+δ′

    in (15) could be interpreted as the price to

    pay for not knowing the set {β∗r , θ∗r 6= 0}. Here we have (logR)1+δ′instead of logR because we deviate

    from the typical i.i.d. Gaussian situation and establish the result under the more general Assumption

    3.2, which could be thought as the “finite second moment” condition. And the δ′ term is the price to

    pay for this deviation.

    Remark II Since β([38σε2T 1/2(1− ε)1/2C ′′−1 logR−(1+δ′)/2]− 1)

    6 {24σ(1 − ε)1/2(R1+δ′√

    logR1+δ′TC ′′)−1} is required, as dimensionality R increases, the “allowed”

    dependence level reflected by the β-mixing coefficients must decrease fast enough, such that one still

    achieves similar risk bounds as in the independent case. Intuitively this makes sense because if the

    dependence level inheriting in Z0,t (or εt equivalently) is too strong, i.e. β exceeds some level, the

    amount of information provided by these observations is less, therefore the estimate does not perform

    well. On the other hand, strong dependence in Z0,t might be caused by some trend, which should be,

    but not included in U>t Γ and results in the increased dependence. This tells us that at the start, we

    should include a large enough number R of prespecified time basis functions such that it could include

    most of the deterministic (although could be segment by segment) time evolution and the remained

    dependence level in Z0,t is controlled.

    COROLLARY 3.1 (m-dependent) Consider the model (4). Assume the sequence {Vt}Tt=1 is an

    m-dependent process with order k (k > 1) and satisfies the following conditions for some constantsσ20, C

    ′′ > 0, t = 1, . . . , T :

    • ∀t,EV 2t 6 σ20,

    16

  • • [38σε2T 1/2(1 − ε)1/2C ′′−1 logR−(1+δ′)/2] − 1 > k + 1 for any ε > 0, some δ′ > 0 and λ defined

    below,

    • ∀t, |Vt| 6 C ′′.

    And also ΨtΨ>t = IK, T

    −1∑Tt=1 U

    >t Ut/R = 1, and M(β

    ∗) 6 s. Furthermore let κ be defined as inAssumption 3.1, φmax be the maximum eigenvalue of the matrix

    ∑Tt=1 UtU

    >t /T and λ be defined as in

    Theorem 3.2. Then with probability at least 1− 3R−δ′, for any solution β̂ of (7), we have:

    T−1T∑t=1

    ‖ Ψ>t (β̂ − β∗)Ut ‖2 6 512s

    { logR1+δ′Kkσ20T (1− ε)

    }κ−2, (21)

    K−1/2‖ β̂ − β∗ ‖2,1 6 96√

    2s

    √logR1+δ′kσ20T (1− ε)

    κ−2, (22)

    M(β̂) 6 64φ2maxsκ−2. (23)

    Remark III We can see that when k increases, i.e. the dependence in {Vt}Tt=1 is stronger and stronger,

    the risk bounds get larger and larger. To assure [38σε2T 1/2(1− ε)1/2C ′′−1]− 1 > k + 1, approximately

    we need T 1/2 logR−(1+δ′)/2 > (3

    4σ0ε

    2√

    (1− ε))−1C ′′√k, which gives the requirement on the sample

    size T (relative to the high dimensionality) and “the amount of information” from the data. Similar

    results could also be separately obtained for the generalized m-dependent process based on fractional

    cover theory and the (extended) McDiarmid inequality, see Theorem 2.1 of Janson (2004).

    At the second step, Z0,t is estimated based on β̂ instead of β∗, so we need to show the influence of

    this plug-in estimate is negligible. Our result relies on the following assumptions, which are similar

    to Assumptions (A1-A8) in Park et al. (2009).

    ASSUMPTION 3.3

    3.3.1 The sets of variables (X1,1, . . . , XT,J), (ε′1,1, . . . , ε

    ′T,J), and (Z0,1, ..., Z0,T ) are independent of each

    other.

    3.3.2 For t = 1, . . . , T , the variables Xt,1, . . . , Xt,J are identically distributed, have support [0, 1]d and

    a density ft that is bounded from below and above on [0, 1]d, uniformly over t = 1, . . . , T .

    3.3.3 We assume that E ε′t,j = 0 for 1 ≤ t ≤ T, 1 ≤ j ≤ J , and for c > 0 small enough,

    sup1≤t≤T,1≤j≤J E exp{c(ε′t,j)2}

  • 3.3.4 The vector of functions m = (m1, . . . ,mL)> can be approximated by Ψk, i.e.

    δKdef= sup

    x∈[0,1]dinf

    A∈RL×K‖m(x)− AΨ(x)‖ → 0

    as K →∞. We denote A that fulfills supx∈[0,1]d ‖m(x)− AΨ(x)‖ ≤ 2δK by A∗.

    3.3.5 There exist constants 0 < CL < CU 0,t lie in the interval [CL, CU ] with probability tending to one.

    3.3.6 For all β and A (β> = ΓA) in (7), with probability tending to one, we have

    supx∈[0,1]d

    max16t6T

    ‖Z>0,tAΨ(x)‖ 6MT ,

    where the constant MT satisfies max16t6T ‖Z0,t‖ 6MT/Cm for a constant Cm such that supx∈[0,1]d ‖m(x)‖ <Cm.

    3.3.7 It holds that ρ2 = (K + T )M2T log(JTMT )/(JT )→ 0. The dimension L is fixed.

    Assumption (3.3.6) and the additional bound MT in the minimization are introduced purely for

    technical reasons. They are similar to the assumption that Vt is upper bounded in Assumptions

    3.2 by noticing Vt = K−1/2∑K

    k=1

    ∑Jj=1 ΨtkjεtjUtr and εt = Z

    T0,tA

    ∗Ψt + ε′t. Recall that given β, the

    number of parameters still to be estimated equals KT ({Z0,t}Tt=1) and KL (A) (given β, if A is fixed,

    Γ is also fixed). Since L is fixed, Assumption (3.3.7) basically requires that, neglecting the factors

    M2T log(JTMT ), the number of parameters grows slower than the number of observations JT .

    THEOREM 3.3 Suppose that model (3), all assumptions in Theorem 3.2 and Assumption 3.3 hold.

    Then we have

    1

    T

    ∑1≤t≤T

    ∥∥∥Ẑ>0,tÂ− Z>0,tA∗∥∥∥2 = OP (ρ2 + δ2K). (24)In the following, we discuss how statistical analysis differs if the inference of stochasticity on Z0,t

    is based on Ẑ0,t instead of using the (unobserved) process Z0,t. We are going to establish theoretical

    properties under a strong mixing condition which is more general than the β-mixing as consider in

    Theorem 3.2. For the statement of the theorem, we need the following assumptions, which are similar

    to Assumptions (A9-11) in Park et al. (2009):

    ASSUMPTION 3.4

    18

  • 3.4.1 Z0,t is a strictly stationary sequence with E(Z0,t) = 0, E(‖Z0,t‖γ) < ∞ for some γ > 2. It is

    α-mixing with∑∞

    i=1 α(i)(γ−2)/γ < ∞. The matrix EZ0,tZ>0,t has full rank. The process Z0,t is

    independent of X11, . . . , XTJ , ε′11, . . . , ε

    ′TJ .

    3.4.2 It holds that [log(KT )2{(KMT/J)1/2 + T 1/2M4TJ−2 +K3/2J−1 +K4/3J−2/3T−1/6}+ 1]T 1/2(ρ2 +

    δ2K) = O(1)

    Assumption (3.4.2) imposes very weak conditions on the growth of J,K and T . Suppose, for

    example, that MT is of logarithmic order and that K is of order (JT )1/5, then the condition requires

    that T/J2 times a logarithmic factor converges to zero. As remarked by Doukhan (1994), if a stochastic

    process is β-mixing, then it is also α-mixing with 2α(A,B) 6 β(A,B). If the requirement onthe β-mixing coefficient in Theorem 3.2 is satisfied, the requirement on the α-mixing coefficient in

    Assumption (3.4.1) is usually satisfied.

    Furthermore, note that the minimization problem (7) has only a unique solution in β, but not in

    Γ and A. If (Ẑ0,t, Â) is a minimizer, then so is (B>Ẑ0,t, B

    −1A), where B is an arbitrary invertible

    matrix. With the choice B = (∑T

    t=1 Z0,tẐ>0,t)−1∑T

    t=1 Z0,tZ>0,t, we get

    ∑Tt=1 Z0,t(Z̃0,t − Z0,t)> = 0

    where Z̃0,tdef= B>Ẑ0,t and Ã

    def= B−1A. Without loss of generality, we may assume T−1

    ∑Ts=1 Ẑ0,s =

    T−1∑T

    s=1 Z0,s = 0. Additionally, define

    Z̃n,t = (T−1

    T∑s=1

    Z̃0,sZ̃>0,s)−1/2Z̃0,t,

    Zn,t = (T−1

    T∑s=1

    Z0,sZ>0,s)−1/2Z0,t.

    THEOREM 3.4 Suppose that model (3) holds. Besides all assumptions in Theorem 3.2, also let

    Assumptions 3.3 and 3.4 be satisfied. Then there exists a random matrix B specified above such that

    for h ≥ 0,

    T−1T−h∑t=1

    Z̃0,t

    (Z̃0,t+h − Z̃0,t

    )>− Z0,t (Z0,t+h − Z0,t)> = OP (T−1/2)

    and

    T−1T−h∑t=1

    Z̃n,tZ̃>n,t+h − Zn,tZ>n,t+h = OP (T−1/2).

    In Theorem 3.4, we consider autocovariances of the estimated stochastic process Ẑ0,t and (the un-

    observed) process Z0,t, and show that these estimators differ only by second order terms. Thus the

    statistical analysis based on Ẑ0,t is equivalent to the (unobserved) process Z0,t.

    19

  • 4 Dynamics of Temperature Analysis

    This section presents the application to the temperature dynamics’ analysis by fitting the daily tem-

    perature observations provided by Climatic Data Center (CDC), China Meteorological Administration

    (CMA), Figure 1. To capture the upward trend, seasonal and “large period” effects, similar to Racsko

    et al. (1991), Parton and Logan (1981) and Hedin (1991), we propose the following initial choice of

    time basis (rescaling factors omitted) in Table 1.

    Factors Factors

    Trend 1 Large sin 2πt/(365 · 15)

    (Year by Year) t Period cos 2πt/(365 · 15)

    3t2 − 1 sin 2πt/(365 · 10)

    Seasonal sin 2πt/365 cos 2πt/(365 · 10)

    Effect cos 2πt/365 sin 2πt/(365 · 5)

    . . . cos 2πt/(365 · 5)

    cos 10πt/365

    Table 1: Initial choice of 53 · 3 + 16 = 175 time basis.

    1 2 3 4 50.9

    0.92

    0.94

    0.96

    0.98

    1

    Figure 3: Relative proportion of variance explained by the first K basis and China’s climate types.

    For the space basis, consider the relative proportion of variance explained by the first K basis

    (eigenvalues of the smoothed covariance operator) and the 5 climate types of China, as in Figure 3,

    the number of space basis K = 5 is appealing. As we have discussed in subsection 2.4, the choice of

    tuning parameter λ is crucial here. Figure 4 presents the solution path of 4 different selection criterion

    Cp, GCV , AIC and BIC evaluated on 500 equally spaced λ’s, where the minimizer is marked as the

    20

  • red dot. As we can see, the minimizers of Cp GCV and AIC are significantly smaller than BIC’s,

    which confirms the previous discussions in the literature that AIC type of criterion (including GCV

    and Cp) tends to overestimate the model size and thus overfits. Since our estimate also involves the

    smoothing bandwidth in the smoothed FPCA step, which, by Theorem 1 of Hall et al. (2006), should

    be chosen as O(T−1/5) to minimize the distance between the estimates of ψ̂’s eigenfunctions and the

    corresponding true ones. Figure 5 presents the BIC solution path w.r.t. 4 different (by a constant

    factor) values of the smoothing parameter for the same 500 λ’s as above. As we can see, the solution

    path is very stable w.r.t. the choice of the smoothing parameter.

    0 20 40 60 80200

    400

    600

    800Cp

    0 20 40 60 801.04

    1.06

    1.08

    1.1 x 109 GCV

    0 20 40 60 8010.9

    10.92

    10.94

    10.96

    10.98AIC

    0 20 40 60 8010.8

    11

    11.2

    11.4BIC

    Figure 4: Comparison of 4 different selection criterion: Cp, GCV , AIC and BIC evaluated on 500

    equally spaced λ’s.

    Figure 6 displays the estimated coefficients of the first factor with respect to the 54 · 3 yearly

    polynomial time basis under the optimal choice of λ selected by the BIC-criterion. The coefficients

    of constant, linear and quadratic terms are displayed as solid, dashed and dotted lines respectively,

    which are also coupled with the corresponding 90% confidence intervals (based on year by year OLS

    estimates) represented by the thin lines (with same color and style). The fact that all these coefficients

    are non negative indicates that over the past 50 years, there might be a warming effect across China.

    21

  • 0 100 200 300 400 50010.95

    11

    11.05

    11.1

    11.15

    11.2

    11.25

    11.3

    min

    Figure 5: BIC solution path w.r.t. 4 different values of the smoothing parameter.

    The confidence intervals are computed using ordinary least square polynomial fitting to the year

    by year time series after removing the normal seasonality and “large period” effects. We observe

    the unusual large positive (w.r.t. linear term) and negative (w.r.t. quadratic term) variation for

    the OLS estimates at the end of 60s, which are caused by extreme temperatures in China at that

    time. By employing shrinkage techniques, we could get rid of this disadvantage and produce stabler

    estimates. The coefficients’ estimates of the 16 Fourier series time basis corresponding to the optimal

    λ are displayed in Table 2. It clearly indicates that the 15-year period effect, as some meteorologists

    claimed, are related to the solar activity.

    57 62 67 72 77 82 87 92 97 02 07

    -40

    -20

    0

    20

    40

    60

    Figure 6: Estimated coefficients of the 54 · 3 yearly polynomial time basis w.r.t. k = 1.

    22

  • Basis Estimates

    sin 2πt/365 −25.4922 1.1059 2.4129 −2.6985 1.2320

    cos 2πt/365 −87.3303 1.8228 5.3358 −5.0823 1.6284

    sin 4πt/365 0.0000 0.0000 0.0000 0.0000 0.0000

    cos 4πt/365 −4.5532 0.8761 0.6752 −0.6709 0.9163

    . . . 0.0000 . . .

    cos 10πt/365 0.0000 . . .

    sin 2πt/(365 · 15) 11.7818 −0.0053 −1.4026 0.4743 −0.0214

    cos 2πt/(365 · 15) 0.0000 . . .

    . . . 0.0000 . . .

    cos 2πt/(365 · 5) 0.0000 . . .

    Table 2: Estimated coefficients of the 5 factors w.r.t. the 20 Fourier series time basis.

    Since the eigenvalues of β̂β̂> are (10140, 208, 118, 44, 14, 0, 0, . . .) (with the first five being nonzero

    and the next being zero’s), we choose L = 5 and get the remaining 5-dimensional random process

    Ẑ0,t, which could be further modeled by multivariate time series techniques. For example, if we use

    a VAR(1) process, Ẑ0,t = SẐ0,t−1 + ε0,t, where ε0,t is a random vector, then the estimated coefficient

    matrix is:

    0.7703 0.0103 0.0007 0.0015 0.0005

    −0.0552 0.1449 −0.1841 −0.0285 0.0003

    −0.3047 −0.3419 0.3877 −0.0436 −0.0020

    0.2078 −0.1717 −0.1337 0.8431 0.0071

    0.6345 −0.0484 −0.0447 0.0184 0.8338

    .

    Compared with the existing temperature modeling (weather derivatives pricing) techniques, e.g.

    Benth and Benth (2005), our approach has the following advantages. First, based on the high di-

    mensional time series data, it offers integrated analysis considering space (high dimensionality) and

    time (dynamics) parts simultaneously, while forecasting at places different from the existing weather

    stations is also possible since the space basis are actually functions of the geographical location in-

    formation. Second, it extracts the trend more clearly. Third, it provides theoretical justification for

    23

  • further inferential analysis of Ẑ0,t instead of Z0,t.

    5 Simulation Study

    Since the simulation results about the group Lasso estimate itself’s performance have been well

    illustrated in the literature, to evaluate the overall fitting performance of the GDSFM model, we

    conduct a Monte Carlo experiment designed to mimic the previous empirical example.

    We generate random variables β1, . . . , β175 ∈ R4 such that all coordinates are i.i.d. standard

    normal r.v.’s. We randomly pick 80% of the βr coefficients from β1, . . . , β175 and assign them to be

    0 ∈ R4. We choose the same time basis as in Table 1 with p = 365 and T = 19345. For the space part,

    inspired by Park et al. (2009), we consider d = 2 and the following tuples of 2-dimensional functions:

    m1(x1, x2) = 1, m2(x1, x2) = 3.46(x1 − .5),

    m3(x1, x2) = 9.45{

    (x1 − .5)2 + (x2 − .5)2}− 1.6,

    m4(x1, x2) = 1.41 sin(2πx2).

    The functions here are chosen to be close to orthogonal. The design points Xt,j are independently

    generated from a uniform distribution on the unit square. We generate Y >t = U>t β>Ψt+εt, t = 1, . . . , T

    with the following three types of error distributions:

    I all coordinates of ε1, . . . , εT are i.i.d. N(0, 0.05) r.v.s.

    II the εt’s are generated from a centered VAR(1) process εt = Sεt−1 + ηt, where S is a diagonal

    matrix with all diagonal entries equal 0.4 and all entries of ηt are N(0, 0.84 · 0.05) r.v.’s (such

    that Var(εt) is still the same as the independent case’s).

    III same as above except that all diagonal entries of S equal 0.8, i.e. a stronger dependence level

    and ηt are N(0, 0.36 · 0.05) r.v.’s.

    The algorithm presented in (28) is converging fast (tolerance: 10−3). The β’s are estimated by

    the group Lasso technique as in (7) with the tuning parameter λ selected by the BIC-type criterion

    as in (8). After obtaining β̂, we further estimate the stochastic process Z0,t by a VAR(1) model. We

    take the remained variation (1−R2) as a measure of the fitting performance where

    1−R2 =∑T

    t=1 ‖Y >t − (U>t Γ̂ + Ẑ>0,t)ÂΨ̂t‖22∑Tt=1 ‖Y >t −

    ∑Tt=1

    ∑Jj=1 Yt,j/JT‖22

    (25)

    24

  • is the proportion of the remained variation not explained by the model among the total variation.

    We repeat this experiment 100 times and present the averaged 1 − R2’s in Table 3. As we can see,

    as the dependence level (in εt) increases, although the remained variation slightly increases resulting

    from the worse estimates of β, overall it is still relatively good.

    indep weakly dep strongly dep

    1−R2 5.30% 5.32% 5.40%

    Table 3: Remained variation of the GDSFM model w.r.t. the independent, weakly dependent and

    strongly dependent cases.

    6 Concluding Remarks

    This article provides an integrated and yet flexible analysis for high dimensional nonstationary time

    series which reveal both complex trends and stochastic components. In applying a generalized dynamic

    semiparametric factor models, we employ a nonparametric series expansion for both temporal and

    spatial parts. After choosing smoothed (nonparametric) functional principal components as space

    basis and extracting temporal trends utilizing time basis function selection techniques, the estimate’s

    properties are investigated under the dependent scenario together with the weakly cross-correlated

    error term. This is not built upon any specific forms of the time and space basis. This enables us

    to explore the interplay among degree of time dependence, high dimensionality and moderate sample

    size (relative to dimensionality). The presented theory is an extension to the current regularization

    techniques. We further justify statistical inference, e.g. estimation and classification based on the

    detrended low dimensional stochastic process. Applications to the dynamic behavior analysis of

    temperatures confirm its power.

    7 Appendix

    In order to study the statistical properties of this estimator, it is useful to derive some optimality

    conditions for a solution of (7). Our implementation of the group Lasso-type estimator comes from

    Yuan and Lin (2006), which is an extension of the shooting algorithm of Fu (1998). As a direct

    25

  • consequence of the Karush-Kuhn-Tucker conditions, we have a necessary and sufficient condition for

    β̂ to be a solution of (7):

    T−1T∑t=1

    {Ψt(Yt −Ψ>t β̂Ut)U>t }r = λβ̂r

    ‖β̂r‖, if β̂r 6= 0 (26)

    T−1‖T∑t=1

    {Ψt(Yt −Ψ>t β̂Ut)U>t }r‖ 6 λ, if β̂r = 0 (27)

    Recall that ΨtΨ>t = IK . It can be easily verified that the solution to (26) and (27) is

    β̂r =(

    1− λ/‖Sr‖)+Sr, (28)

    where Sr =∑T

    t=1{Ψt(Yt −Ψ>t β̂−rUt)U>t }r with β̂−r = (β̂1, . . . , β̂r−1, 0, β̂r+1, . . . , β̂R). The solution to

    expression (7) can therefore be obtained by applying equation (28) to r = 1, . . . , R iteratively.

    LEMMA 7.1 Consider the model (4). Assume that ΨtΨ>t = IK, T

    −1∑Tt=1 U

    >t Ut/R = 1, and

    M(β∗) 6 s. If the random event

    A ={

    2T−1 max16r6R

    T∑t=1

    K∑k=1

    J∑j=1

    Ψ>tkjεtjUtr 6 λ}. (29)

    holds with high probability for some λ > 0, then for any solution β̂ of problem (7) and ∀β, we have:

    T−1T∑t=1

    ‖ Ψ>t (β̂ − β∗)Ut ‖2

    + λ‖β̂ − β‖2,1

    6 T−1T∑t=1

    ‖ Ψ>t (β − β∗)Ut ‖2

    + 4λ∑

    r∈R(β)

    ‖β̂r − βr‖, (30)

    T−1 max16r6R

    ‖T∑t=1

    {ΨtΨ>t (β̂ − β∗)UtU>t }r‖ 6 3λ/2, (31)

    M(β̂) 6 4φ2max

    λ−2T−2

    T∑t=1

    ‖ (β̂ − β∗)Ut ‖2

    2, (32)

    where φmax is the maximum eigenvalue of the matrix∑T

    t=1 UtU>t /T .

    Proof of Lemma 7.1 The proof involves similar thoughts given in Lemma 3.1 in Lounici et al.

    (2009). By the definition of β̂ as a minimizer of (7), ∀β we have

    T−1T∑t=1

    ‖ Ψ>t β̂Ut − Yt ‖2

    + 2λR∑r=1

    ‖β̂r‖

    6 T−1T∑t=1

    ‖ Ψ>t βUt − Yt ‖2

    + 2λR∑r=1

    ‖βr‖,

    (33)

    26

  • which, using Yt = Ψ>t β∗Ut + εt, is equivalent to

    T−1T∑t=1

    ‖ Ψ>t (β̂ − β∗)Ut ‖2 6 T−1

    T∑t=1

    ‖ Ψ>t (β − β∗)Ut ‖2

    +2T−1T∑t=1

    ε>t Ψ>t (β̂ − β)Ut + 2λ

    R∑r=1

    (‖βr‖ − ‖β̂r‖). (34)

    By Hölder’s inequality, we have

    2T−1T∑t=1

    ε>t Ψ>t (β̂ − β)Ut 6 2T−1

    T∑t=1

    ‖ΨtεtU>t ‖2,∞‖β̂ − β‖2,1, (35)

    where ‖∑T

    t=1 ΨtεtU>t ‖2,∞ 6 max16r6R

    ∑Tt=1

    ∑Kk=1

    ∑Jj=1 Ψ

    >tkjεtjUtr.

    If the random event

    A ={

    2T−1 max16r6R

    T∑t=1

    K∑k=1

    J∑j=1

    Ψ>tkjεtjUtr 6 λ}. (36)

    holds with high probability for some λ > 0, which we will specify afterwards, then it follows from (34)

    and (35) that, on the event A:

    T−1T∑t=1

    ‖Ψ>t (β̂ − β∗)Ut‖2 + λR∑r=1

    ‖β̂r − βr‖

    6 T−1T∑t=1

    ‖Ψ>t (β − β∗)Ut‖2 + 2λR∑r=1

    (‖β̂r − βr‖+ ‖βr‖ − ‖β̂r‖)

    6 T−1T∑t=1

    ‖Ψ>t (β − β∗)Ut‖2 + 2λ∑

    r∈R(β)

    (‖β̂r − βr‖+ ‖βr‖ − ‖β̂r‖)

    + 2λ∑

    r∈Rc(β)

    (‖β̂r − βr‖+ ‖βr‖ − ‖β̂r‖)

    6 T−1T∑t=1

    ‖Ψ>t (β − β∗)Ut‖2 + 4λ∑

    r∈R(β)

    ‖β̂r − βr‖,

    (37)

    which proves (30).

    To prove (31), we use (26) and (27), which yields the inequality

    T−1 max16r6R

    ‖T∑t=1

    {Ψt(Yt −Ψ>t β̂Ut)U>t }r‖ 6 λ. (38)

    Then

    T−1‖T∑t=1

    {ΨtΨ>t (β̂ − β∗)UtU>t }r‖

    6 T−1‖T∑t=1

    {Ψt(Ψ>t β̂Ut − Yt)U>t }r‖+ T−1‖T∑t=1

    (ΨtεtU>t )r‖, (39)

    27

  • where we have used Yt = Ψ>t β∗Ut + εt and the triangle inequality. Then, the bound (31) follows by

    combining (39) with (38) and using the definition of the event A.

    Finally, we show (32). First, observe that,

    T∑t=1

    Ψt(Yt −Ψ>t β∗Ut)U>t =T∑t=1

    ΨtΨ>t (β̂ − β∗)UtU>t +

    T∑t=1

    ΨtεtU>t .

    On the event A, utilizing (26) and the triangle inequality, we have:

    T−1‖T∑t=1

    {ΨtΨ>t (β̂ − β∗)UtU>t }r‖ > λ/2, if β̂r 6= 0.

    The following arguments yield the bound (32) on the number of nonzero rows of β̂>r :

    M(β̂) 6 4λ2T 2

    ∑r∈R(β̂)

    ‖T∑t=1

    {ΨtΨ>t (β̂ − β∗)UtU>t }r‖2

    6 4λ2T 2

    R∑r=1

    ‖T∑t=1

    {ΨtΨ>t (β̂ − β∗)UtU>t }r‖2

    =4

    λ2T 2‖

    T∑t=1

    {ΨtΨ>t (β̂ − β∗)UtU>t }‖22

    6 4φ2max

    λ2T

    T∑t=1

    ‖(β̂ − β∗)Ut‖22,

    which uses the fact that ΨtΨ>t = IK and φmax is the maximum eigenvalue of the matrix

    ∑Tt=1 UtU

    >t /T .

    Proof of Theorem 3.1 We proceed along the lines of Theorem 6.2 in Bickel et al. (2009) and

    Theorem 3.1 in Lounici et al. (2009). Let R = R(β∗) = {r : β∗r 6= 0}.

    By inequality (30) in Lemma 7.1 with β = β∗, we have, on the event A defined in (29):

    T−1T∑t=1

    ‖Ψ>t (β̂ − β∗)Ut‖2 6 4λ∑r∈R

    ‖β̂r − β∗r‖ 6 4λ√s‖(β̂ − β∗)R‖. (40)

    Moreover, by the same inequality, on the event A, we have∑R

    r=1 ‖β̂r−β∗r‖ 6 4∑

    r∈R ‖β̂r−β∗r‖, which

    implies that∑

    r∈Rc ‖β̂r − β∗r‖ 6 3∑

    r∈R ‖β̂r − β∗r‖. Thus, by Assumption 3.1 with ∆ = (β̂ − β∗):

    ‖(β̂ − β∗)R‖ 6

    √√√√ T∑t=1

    ‖Ψ>t (β̂ − β∗)Ut‖2/(κ√T ). (41)

    28

  • Now T−1∑T

    t=1 ‖Ψ>t (β̂ − β∗)Ut‖2 6 16sλ2κ−2 (10) follows from (40) and (41).Inequality (11) follows by noting that

    K−1/2R∑r=1

    ‖β̂r − β∗r‖ 6 4K−1/2∑r∈R

    ‖β̂r − β∗r‖ 6 4K−1/2√s‖(β̂ − β∗)R‖ 6 16sλκ−2K−1/2, (42)

    and then using (10). Inequality (12) follows from (32) and (10). �

    Proof of Theorem 3.2 The proofs of this theorem are similar to those of Theorem 3.1 up to a

    specification of the bound on P(Ac) in Lemma 7.1. We now consider the event

    A ={

    2T−1 max16r6R

    T∑t=1

    K∑k=1

    J∑j=1

    Ψ>tkjεtjUtr 6 λ}.

    Observe that:

    P(Ac) 6 RP( T∑t=1

    K∑k=1

    J∑j=1

    ΨtkjεtjUtrK−1/2 > 2−1λTK−1/2

    )def= RP

    ( T∑t=1

    Vt > 2−1λTK−1/2

    ).

    Since Assumption 3.2 holds, applying the Bernstein type inequality for β-mixing random variables

    {Vt}Tt=1 (Theorem 4 of Doukhan (1994)[P.36]), yields that ∀ε > 0 and ∀ 0 < q 6 1,

    P(T∑t=1

    Vt > 2−1λTK−1/2) 62 exp[− (1− ε)3(1 + ε

    2/4)λ2TK−1

    4{6(1 + ε2/4)σ2 + qC ′′λTK−1/2}

    ]︸ ︷︷ ︸

    def= T1

    +(1 + ε2/4)β([qTε2/(4 + ε2)]− 1)

    q︸ ︷︷ ︸def= T2

    .

    To make T1 6 R−(1+δ′), δ′ > 0 and T2 6 R−(1+δ

    ′), we choose λ =√

    16 logR1+δ′Kσ2

    T (1−ε) , qC′′λTK−1/2 =

    6(1 + ε2/4)σ2 and β([qTε2/(4 + ε2)]− 1)

    6 qR−(1+δ′)/(1 + ε2/4) = {24σ(1− ε)1/2(R1+δ′√

    logR1+δ′TC ′′)−1} with qTε2/(4 + ε2) = 38σε2T 1/2(1−

    ε)1/2C ′′−1 logR−(1+δ′)/2. Then we have

    P(Ac) 6 RP( T∑t=1

    Vt > λT/K)6 3R−δ′ . �

    Proof of Corollary 3.1 To prove this corollary, we need to show that Assumption 3.2 is satisfied,

    i.e. for an m-dependent process with order k, σ2 in Assumption 3.2 is equal to 2kσ20. For simplicity,

    29

  • we assume n = 1 and m is divisible by 2k. Then

    E( m∑i=1

    Vi

    )2= E

    ( k∑i=1

    Vi +2k∑

    i=k+1

    Vi + . . .+m∑

    i=m−k

    Vi

    )2= E

    (m/2k−1∑j=0

    2jk+k∑i=2jk+1

    Vi︸ ︷︷ ︸def=C

    +

    m/2k−1∑j=0

    2(j+1)k+k∑i=2jk+k+1

    Vi︸ ︷︷ ︸def=D

    )2

    6 2EC2 + 2ED2

    Since for j = 0, . . . ,m/2k − 1,∑2jk+k

    i=2jk+1 Vi are independent to each other by the definition of Vt and

    the same argument holds for∑2(j+1)k+k

    i=2jk+k+1 Vi, we have

    2EC2 + 2ED2 = 2

    m/2k−1∑j=0

    E( 2jk+k∑i=2jk+1

    Vi

    )2+ 2

    m/2k−1∑j=0

    E( 2(j+1)k+k∑i=2jk+k+1

    Vi

    )26 m/kk2σ20 +m/kk2σ20 = 2mkσ20. �

    Proof of Theorem 3.3 Similar tỗY>

    tdef= Y >t − U>t β̂Ψt, define Ỹ >t

    def= Y >t − U>t β∗Ψt with the

    corresponding estimate Z̃0,t. Thus

    1

    T

    ∑1≤t≤T

    ∥∥∥Ẑ>0,tÂ− Z>0,tA∗∥∥∥2 6 1T ∑1≤t≤T

    ∥∥∥Ẑ>0,tÂ− Z̃>0,tÂ∥∥∥2 + 1T ∑1≤t≤T

    ∥∥∥Z̃>0,tÂ− Z>0,tA∗∥∥∥2 ,where the second term is bounded by OP (ρ2 + δ2K) by Theorem 2 of Park et al. (2009). For the first

    term, since

    Ẑ0,t = (ÂΨtΨ>t Â>)−1ÂΨt

    ̂̃Yt,

    Z̃0,t = (ÂΨtΨ>t Â>)−1ÂΨtỸt,

    Z̃0,t − Ẑ0,t = (ÂΨtΨ>t Â>)−1ÂΨt{Ψ>t (β̂ − β∗)Ut},

    Theorem 3.2 tells us that T−1∑T

    t=1 ‖ Ψ>t (β̂ − β∗)Ut ‖2

    is bounded by O(T−1). From the definitions

    of ρ2 and δK , we know that the first term is dominated by the second one. �

    Proof of Theorem 3.4 The proof shares ideas with Park et al. (2009). We will prove the first

    equation of the theorem for h 6= 0. The second equation follows from the first one. We start with

    proving that the matrix T−1∑T

    t=1 Z0,tẐ>0,t is invertible. Suppose that the assertion is not true, then

    30

  • we can choose a random vector e such that ‖e‖ = 1 and e>∑T

    t=1 Z0,tẐ>0,t = 0. Note that

    ‖T−1T∑t=1

    Z0,tẐ>0,tÂ− T−1

    T∑t=1

    Z0,tZ>0,tA

    ∗‖

    6 T−1T∑t=1

    ‖Z0,t(Ẑ>0,tÂ− Z>0,tA∗)‖

    6 (T−1T∑t=1

    ‖Z0,t‖2)1/2(T−1T∑t=1

    ‖Ẑ>0,tÂ− Z>0,tA∗‖2)1/2

    = OP (ρ+ δK), (43)

    because of Assumption (3.3.5) and Theorem 3.3. Thus with f = T−1∑T

    t=1 Z0,tZ>0,te, we obtain

    ‖f>m‖ = ‖f>(A∗Ψ)‖+OP (δK)

    = ‖e>T−1T∑t=1

    Z0,tZ>t ÂΨ‖+OP (ρ+ δK)

    = OP (ρ+ δK).

    This implies that m1, . . . ,mL are linearly dependent, contradicting the construction that all space

    basis are independent.

    Notice Z̃0,t = B>Ẑ0,t and à = B

    −1A. With (43) this gives

    ‖Ã− A∗‖ = ‖T−1T∑t=1

    Z0,tZ>t (Ã− A∗)‖OP (1)

    = ‖T−1T∑t=1

    Z0,tZ̃>0,tÃ− T−1

    T∑t=1

    Z0,tZ>0,tA

    ∗‖OP (1)

    = OP (ρ+ δK). (44)

    31

  • From Assumptions (3.3.4), (44) and Theorem 3.3, we get

    T−1T∑t=1

    ‖Z̃>t − Z0,t‖2

    = T−1T∑t=1

    ‖Z̃>t (m1, . . . ,mL)> − Z>0,t(m1, . . . ,mL)>‖2OP (1)

    = T−1T∑t=1

    ‖Z̃>t A∗ − Z̃>t Ã‖2OP (1)

    + T−1T∑t=1

    ‖Z̃>t Ã− Z>0,tA∗‖2OP (1) +OP (δ2K)

    6 T−1T∑t=1

    ‖Z̃0,t − Z0,t‖2‖Ã− A∗‖2OP (1)

    + T−1T∑t=1

    ‖Z0,t‖2‖Ã− A∗‖2OP (1)

    + T−1T∑t=1

    ‖Z̃>t Ã− Z>0,tA∗‖2OP (1) +OP (δ2K)

    = OP (ρ2 + δ2K).

    (45)

    We will show that for h 6= 0,

    T−1T∑

    t=h+1

    {(Z̃0,t+h − Z0,t+h)− (Z̃0,t − Z0,t)}Z>0,t = OP (T−1/2). (46)

    This implies the first statement of Theorem 3.4 because by (45),

    T−1T∑

    t=−h+1

    (Z̃0,t − Z0,t)(Z̃0,t+h − Z0,t+h) = OP (b2) = OP (T−1/2).

    To prove (46), define

    S̃t,Z = J−1

    J∑j=1

    ÃΨ(Xt,j)Ψ(Xt,j)>Ã>,

    St,Z = A∗E{

    Ψ(Xt,j)Ψ(Xt,j)>}A∗>,

    S̃α = (JT )−1

    T∑t=1

    J∑j=1

    {Ψ(Xt,j)⊗ Z̃0,t}{Ψ(Xt,j)⊗ Z̃0,t}>,

    Sα = T−1

    T∑t=1

    E[{Ψ(Xt,j)⊗ Z0,t}{Ψ(Xt,j)⊗ Z0,t}>

    ∣∣Z0,t] ,S = J−1A∗

    [Ψ(Xt,j)Ψ(Xt,j)

    >e− E{

    Ψ(Xtj)Ψ(Xtj)>e}],

    32

  • where e ∈ RK with ‖e‖ = 1. Let ã be the stack form of Ã. It can be verified that

    Z̃0,t = S̃−1t,ZJ

    −1J∑j=1

    {Yt,jAΨ(Xt,j)} , (47)

    ã = S̃−1α (JT )−1

    T∑t=1

    J∑j=1

    {Ψ(Xt,j)⊗ Z̃0,t}Yt,j. (48)

    Let γ = T−1/2/b. We argue that

    sup1≤t≤T

    ‖S̃t,Z − St,Z‖ = OP (γ), ‖S̃α − Sα‖ = OP (γ). (49)

    We will show the first part of (49), and the second part can be shown analogously. Since

    ÃΨtΨ>t Ã> = (Ã− A∗ + A∗)(ΨtΨ>t − EΨtΨ>t + EΨtΨ>t )(Ã− A∗ + A∗)>,

    in order to prove the first part, it suffices to show that, uniformly for 1 6 t 6 T ,

    J−1J∑j=1

    A∗[Ψ(Xt,j)Ψ(Xt,j)

    > − E{

    Ψ(Xt,j)Ψ(Xt,j)>}] (Ã− A∗)> = OP (γ), (50)

    J−1J∑j=1

    (Ã−A∗)[Ψ(Xt,j)Ψ(Xt,j)

    >−E{

    Ψ(Xt,j)Ψ(Xt,j)>}] (Ã−A∗)>=OP (γ), (51)

    J−1J∑j=1

    A∗[Ψ(Xt,j)Ψ(Xt,j)

    > − E{

    Ψ(Xt,j)Ψ(Xt,j)>}]A∗> = OP (γ), (52)

    J−1J∑j=1

    A∗ E{

    Ψ(Xt,j)Ψ(Xt,j)>} (Ã− A∗)> = OP (γ), (53)

    J−1J∑j=1

    (Ã− A∗)E{

    Ψ(Xt,j)Ψ(Xt,j)T}

    (Ã− A∗)> = OP (γ). (54)

    The proof of (50)-(52) follows by simple arguments. We now show (53). Claim (54) can be shown

    similarly. For the proof of (53), we use Bernstein’s inequality for the following sum:

    P

    (|

    J∑j=1

    Wj| > x

    )6 2 exp

    (−1

    2

    x2

    V +Mx/3

    ). (55)

    Here for a value of t with 1 ≤ t ≤ T , the random variable Wj is an element of the L × 1-matrix

    S = J−1A∗[Ψ(Xt,j)Ψ(Xt,j)

    >e− E{

    Ψ(Xtj)Ψ(Xtj)>e}]

    , where e ∈ RK with ‖e‖ = 1. In (55), V is

    an upper bound for the variance of∑J

    j=1Wj, and M is a bound for the absolute values of Wj, i.e.

    |Wj| ≤ M for 1 ≤ j ≤ J , a.s. With some constants C1 and C2 that do not depend on t and the row

    33

  • number, we get V ≤ C1J−1 and M ≤ C2K1/2J−1. Application of Bernstein’s inequality gives that,

    uniformly for 1 ≤ t ≤ T and e ∈ RK with ‖e‖ = 1, all L elements of S are of order OP (γ). This

    completes the proof of claim (50).

    From (44), (45), (47), (48) and (49), it follows that uniformly for 1 6 t 6 T ,

    Z̃0,t − Z0,t = S−1t,ZJ−1

    J∑j=1

    ε′t,jA∗Ψ(Xt,j)

    +S−1t,ZJ−1

    J∑j=1

    ε′t,j(Ã− A∗)Ψ(Xt,j) + OP (T−1/2) (56)

    def= ∆t,1,Z + ∆t,2,Z + OP (T

    −1/2).

    To prove the theorem, it remains to show that for 1 6 j 6 2,

    T−1T∑

    t=−h+1

    (∆t+h,j,Z −∆t,j,Z)Z>0,t = OP (T−1/2). (57)

    This can easily be checked for j = 1. For j = 2, it follows from ‖Ã− A∗‖ = OP (ρ+ δK) and

    E

    {‖(JT )−1

    T∑t=1

    J∑j=1

    ε′t,jS−1t,ZMΨ(Xt,j)‖

    2

    }= O(K(JT )−1),

    for any L×K matrix M with ‖M‖ = 1. �

    References

    Benth, F. and Benth, J. (2005). Stochastic modelling of temperature variations with a view towards

    weather derivatives. Applied Mathematical Finance, 12(1):53–85.

    Bickel, P. J., Ritov, Y., and Tsybakov, A. B. (2009). Simultaneous analysis of Lasso and Dantzig

    selector. Annals of Statists, 37(4):1705–1732.

    Bowsher, C. G. and Meeks, R. (2006). High dimensional yield curves: Models and forecasting.

    Technical report.

    Bühlmann, P. and van de Geer, S. (2011). Statistics for High-Dimensional Data: Methods, Theory

    and Applications. Heidelberg: Springer Verlag.

    Cai, Z., Li, Q., and Park, J. Y. (2009). Functional-coefficient models for nonstationary time series

    data. Journal of Econometrics, 148(2):101–113.

    34

  • Campbell, J. Y. and Yogo, M. (2006). Efficient tests of stock return predictability. Journal of Financial

    Economics, 81(1):27–60.

    Christiano, L., Eichenbaum, M., and Evans, C. (1999). Monetary policy shocks: What have we

    learned and to what end? Handbook of Macroeconomics, 1(1):65–148.

    Diebold, F. X. and Li, C. (2006). Forecasting the term structure of government bond yields. Journal

    of Econometrics, 130:337–364.

    Donoho, D. L. and Johnstone, I. M. (1994). Ideal spatial adaptation by wavelet shrinkage. Biometrika,

    81(3):pp. 425–455.

    Doukhan, P. (1994). Mixing: Properties and Examples. Heidelberg: Springer Verlag, 1994.

    Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle

    properties. Journal of the American Statistical Association, 96(456):1348–1360.

    Fengler, M. R., Härdle, W., and Mammen, E. (2007). A semiparametric factor model for implied

    volatility surface dynamics. Journal of Financial Econometrics, 5(2):189–218.

    Forni, M., Hallin, M., Lippi, M., and Reichlin, L. (2000). The generalized dynamic-factor model:

    Identification and estimation. The Review of Economics and Statistics, 82(4):540–554.

    Forni, M., Hallin, M., Lippi, M., and Reichlin, L. (2005). The generalized dynamic factor model:

    One-sided estimation and forecasting. Journal of the American Statistical Association, 100:830–

    840.

    Fu, W. J. (1998). Penalized regressions: The bridge versus the Lasso. Journal of Computational and

    Graphical Statistics, 7(3):397–416.

    Geweke, J. (1977). The dynamic factor analysis of economic time series. Latent Variables in

    Socio-Economic Models, eds. D. J. Aigner and A. S. Goldberg, Amsterdam: North-Holland, pages

    365–383.

    Giannone, D., Reichlin, L., and Sala, L. (2005). Monetary policy in real time. In NBER

    Macroeconomics Annual 2004, Volume 19, NBER Chapters, pages 161–224. National Bureau of

    Economic Research, Inc.

    35

  • Gleick et al., P. H. (2010). Climate change and the integrity of science. Science, 328:689–691.

    Hall, A. and Hautsch, N. (2006). Order aggressiveness and order book dynamics. Empirical Economics,

    30(4):973–1005.

    Hall, P., Müller, H. G., and Wang, J. L. (2006). Properties of principal component methods for

    functional and longitudinal data analysis. Annals of Statistics, 34(3):1493–1517.

    Hallin, M. and Liska, R. (2007). Determining the number of factors in the general dynamic factor

    model. Journal of the American Statistical Association, 102:603–617.

    Hedin, A. E. (1991). Extension of the msis thermosphere model into the middle and lower atmosphere.

    Journal of Geophysical Research, 96:1159–1172.

    Horowitz, J. and Huang, J. (2012). Penalized estimation of high-dimensional models under a general-

    ized sparsity condition. CeMMAP working papers CWP17/12, Centre for Microdata Methods and

    Practice, Institute for Fiscal Studies.

    Horowitz, J., Klemel, J., and Mammen, E. (2006). Optimal estimation in additive regression models.

    Bernoulli, 12(2):pp. 271–298.

    Horowitz, J. L. (2006). Testing a parametric model against a nonparametric alternative with identi-

    fication through instrumental variables. Econometrica, 74(2):521–538.

    Horowitz, J. L. and Lee, S. (2005). Nonparametric estimation of an additive quantile regression model.

    Journal of the American Statistical Association, 100(472):pp. 1238–1249.

    Huang, J., Horowitz, J. L., and Wei, F. (2010). Variable selection in nonparametric additive models.

    Annals of Statistics, 38(4):2282–2313.

    Janson, S. (2004). Large deviations for sums of partly dependent random variables. Random

    Structures Algorithms, 24(3):234–248.

    Lee, R. D. and Carter, L. (1992). Modeling and forecasting the time series of u.s. mortality. Journal

    of the American Statistical Association, 87(419):659–671.

    Leng, C., Leng, C., Lin, Y., Lin, Y., Wahba, G., and Wahba, G. (2004). A note on the lasso and

    related procedures in model selection. 16:1273–1284.

    36

  • Lounici, K. (2008). Sup-norm convergence rate and sign concentration property of lasso and dantzig

    estimators. Electronic Journal of Statistics, 2:90–102.

    Lounici, K., Pontil, M., Tsybakov, A. B., and van de Geer, S. (2009). Taking advantage of sparsity

    in multi-task learning. Proceedings of Conference on Learning Theory (COLT) 2009.

    Mohr, P. N. C., Biele, G., Krugel, L. K., Li, S.-C., and Heekeren, H. R. (2010). Neural foundations

    of risk-return trade-off in investment decisions. NeuroImage, 49(3):2556–2563.

    Myšičková, A., Song, S., Mohr, P. N., Heekeren, H. R., and Härdle, W. K. (2013). Risk patterns and

    correlated brain activities. forthcoming in Psychometrika.

    Nelson, C. R. and Siegel, A. F. (1987). Parsimonious modeling of yield curves. Journal of Business,

    60:473–489.

    Odening, M., Berg, E., and Turvey, C. (2008). Management of climate risk in agriculture. Special

    Issue of the Agricultural Finance Review, 68(1):83C97.

    Park, B. U., Mammen, E., Härdle, W., and Borak, S. (2009). Time series modelling with semipara-

    metric factor dynamics. Journal of the American Statistical Association, 104(485):284–298.

    Parton, W. J. and Logan, J. A. (1981). A model for diurnal variation in soil and air temperature.

    Agricultural Meteorology, 23:205 – 216.

    Racsko, P., Szeidl, L., and Semenov, M. (1991). A serial approach to local stochastic weather models.

    Ecological Modelling, 57(1-2):27 – 41.

    Ramsay, J. and Silverman, B. (2005). Functional Data Analysis, 2nd Edition. Springer.

    Sargent, T. J. and Sims, C. A. (1977). Business cycle modeling without pretending to have too much

    a priori economic theory. Working Papers 55, Federal Reserve Bank of Minneapolis.

    Song, S. and Bickel, P. (2011). Large vector auto regressions. submitted.

    Stock, J. H. and Watson, M. W. (2002a). Forecasting using principal components from a large number

    of predictors. Journal of the American Statistical Association, 97:1167–1179.

    Stock, J. H. and Watson, M. W. (2002b). Macroeconomic forecasting using diffusion indexes. Journal

    of Business & Economic Statistics, 20(2):147–62.

    37

  • Stock, J. H. and Watson, M. W. (2005a). An empirical comparison of methods for forecasting using

    many predictors. Manuscript, Princeton University.

    Stock, J. H. and Watson, M. W. (2005b). Implications of dynamic factor models for VAR anal-

    ysis. NBER Working Papers 11467, National Bureau of Economic Research, Inc. available at

    http://ideas.repec.org/p/nbr/nberwo/11467.html.

    Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical

    Society, Series B, 58(1):267–288.

    Wang, H. and Leng, C. (2008). A note on adaptive group lasso. Computational Statistics & Data

    Analysis, 52(12):5277 – 5286.

    Wang, H., Li, G., and Tsai, C.-L. (2007a). Regression coefficient and autoregressive order shrinkage

    and selection via the lasso. Journal of The Royal Statistical Society Series B, 69(1):63–78.

    Wang, H., Li, R., and Tsai, C.-L. (2007b). Tuning parameter selectors for the smoothly clipped

    absolute deviation method. Biometrika, 94(3):553–568.

    Wang, Q. and Phillips, P. C. (2009a). Asymptotic theory for local time density estimation and

    nonparametric cointegrating regression. Econometric Theory, 25(03):710–738.

    Wang, Q. and Phillips, P. C. (2009b). Structural nonparametric cointegrating regression.

    Econometrica, 77(6):1901–1948.

    Worsley, K., Liao, C., Aston, J., Petre, V., Duncan, G., Morales, F., and Evans, A. (2002). A general

    statistical analysis for fmri data. NeuroImange, 15:1–15.

    Xiao, Z. (2009). Functional-coefficient cointegration models. Journal of Econometrics, 152(2):81–92.

    Yuan, M. and Lin, Y. (2006). Model selection and estimation in regression with grouped variables.

    Journal of the Royal Statistical Society, Series B, 68(1):49–67.

    38