High-dimensional predictive regression in the …...cointegration, we derive the limiting distribution of the cointegrating vector under certain regularity conditions, in contrast

High-dimensional predictive regression in the presence of

cointegration⇤

Bonsoo Koo Heather Anderson

Monash University Monash University

Myung Hwan Seo† Wenying Yao

Seoul National University Deakin University

Abstract

We propose a Least Absolute Shrinkage and Selection Operator (LASSO) estimator of a

predictive regression in which stock returns are conditioned on a large set of lagged co-

variates, some of which are highly persistent and potentially cointegrated. We establish

the asymptotic properties of the proposed LASSO estimator and validate our theoreti-

cal findings using simulation studies. The application of this proposed LASSO approach

to forecasting stock returns suggests that a cointegrating relationship among the persis-

tent predictors leads to a significant improvement in the prediction of stock returns over

various competing forecasting methods with respect to mean squared error.

Keywords: Cointegration, High-dimensional predictive regression, LASSO, Return pre-

dictability.

JEL: C13, C22, G12, G17

⇤The authors acknowledge financial support from the Australian Research Council Discovery GrantDP150104292.

†Corresponding author. His work was supported by Promising-Pioneering ResearcherProgram throughSeoul National University(SNU) and from the Ministry of Education of the Republic of Korea and the Na-tional Research Foundation of Korea (NRF-0405-20180026). Author contact details are [email protected],[email protected], [email protected], [email protected]

1

1 Introduction

Return predictability has been of perennial interest to financial practitioners and scholars

within the economics and finance professions since the early work of Dow (1920). To this

end, empirical studies employ linear predictive regression models, as illustrated by Stam-

baugh (1999), Binsbergen et al. (2010) and the references therein. Predictive regression allows

forecasters to focus on the prediction of asset returns conditioned on a set of one-period

lagged predictors. In a typical example of a predictive regression, the dependent variable is

the rate of return on stocks, bonds or the spot rate of exchange, and the predictors usually

consist of financial ratios and macroeconomic variables.

The literature that has investigated the predictability of stock returns is very large, and

it has proposed many economically plausible predictive variables and documented their em-

pirical forecast performance. Koijen and Nieuwerburgh (2011) and Rapach and Zhou (2013)

provide recent surveys of this literature. However, despite the plethora of possible predictors

that have been suggested, there is no clear-cut evidence as to whether any of these variables

are successful predictors. Indeed, many studies have concluded that there is very little dif-

ference between predictions derived from a predictive regression and the historical mean of

stock returns. Welch and Goyal (2008) provided a detailed study of stock return predictability

based on many predictors suggested in the literature, and concluded that ”a healthy skepticism

is appropriate when it comes to predicting the equity premium, at least as of 2006”.

Least squares predictive regression has provided the main estimation tool for empirical

efforts to forecast stock returns, but this involves several econometric issues. See Phillips

(2015) for a summary of these issues, together with relevant econometric theory. Among

these issues, we restrict our attention to two specific problems, which we believe could be

tackled by using a different estimation approach. Firstly, the martingale difference features of

excess returns are not readily compatible with the high persistence in many of the predictors

suggested by the empirical literature. It is well-documented that excess returns are stationary

and that financial ratios are often highly persistent and potentially non-stationary processes.

The regression of a stationary variable on a non-stationary variable provides an example of

an unbalanced regression, and such a regression is difficult to reconcile with the assumptions

required for conventional estimation.1 This so-called unbalanced regression has the potential

1Although the issues associated with unbalanced regression can be tackled by incorporating an accommo-dating error structure such as (fractionally) integrated noise, this error structure is not readily compatible withmaking predictions.

2

to induce substantial small sample estimation bias, and it also renders many of the usual

types of statistical inference conducted in standard regression contexts invalid. See Campbell

and Yogo (2006), Phillips (2014) and Phillips (2015) for more details.

Secondly, the literature does not provide guidance on which predictors are to be included

or excluded from predictive regressions. Given a large pool of covariates available for return

predictability, selecting an appropriate subset of relevant covariates can improve the precision

of prediction. However, the choice of the right predictors is extremely difficult in practice,

and precision is often sacrificed for parsimony, especially if there are many predictors but

the sample size used for estimation is small. Quite commonly a single predictor, or only a

handful of predictors are used.

Against this backdrop, our paper studies the use of the highly celebrated LASSO (orig-

inally proposed by Tibshirani, 1996) to estimate a predictive regression when some of the

predictors are persistent and it is possible that some of these persistent predictors are cointe-

grated. We are particularly interested in (i) the LASSO’s capability in model selection in this

context; and (ii) the properties of LASSO estimation when potentially cointegrated predictors

are involved, especially if such properties can circumvent issues associated with unbalanced

regression. As will become clear in our empirical study, there is at least one cointegrating

relationship between persistent predictors for stock returns, which lends strong support to

an estimation approach that allows for cointegration. In a setting in which the target vari-

able is stationary and there are many persistent predictors, the fact that the model selection

mechanism of the LASSO eliminates variables with small coefficients, plays a role with re-

spect to excluding persistent variables that do not contribute to a cointegrating relationship.

This leads to ”balance” between the time series properties of the target variable and a linear

combination of the regressors, and tractable prediction errors. It is also possible that the in-

corporation of cointegration in this context of predicting returns will improve forecasts, as

Christoffersen and Diebold (1998) have noted in other settings.

Although predictive regression enables us to analyze the explanatory power of each indi-

vidual predictor, our primary objective is to improve the overall prediction of stock returns

given a set of predictors. To do this, we focus on the out-of-sample forecasting performance

of the set of predictors selected via LASSO estimation. In addition, we investigate the large

sample properties of the LASSO estimation of the linear predictive regression. We show the

consistency of the LASSO estimator of coefficients on both stationary and nonstationary pre-

dictors within the linear predictive regression framework. Furthermore, in the presence of

3

cointegration, we derive the limiting distribution of the cointegrating vector under certain

regularity conditions, in contrast with the usual LASSO estimation, in which case the limit-

ing distribution of the estimator is not readily available. To the best of our knowledge, this

paper is the first work that establishes the limiting distribution for the LASSO estimator of a

cointegrating vector.

The remainder of this paper is organized as follows. We briefly discuss related literature

in Section 2. Section 3 introduces predictive regression models, and Section 4 discusses the

LASSO selection of relevant predictors within the model framework. In particular, Section 4.1

demonstrates the consistency of the proposed LASSO estimators, and Section 4.2 derives the

limiting distribution of the LASSO estimator of a cointegrating vector. Simulation results are

discussed in Section 5. Also, a real application of our methodology based on the Welch and

Goyal (2008) data set is given in Section 6. Section 7 concludes. All proofs of theorems and

lemmas are found in A.

2 Relevant literature

Most of the literature on forecasting stock returns has focused on whether a certain variable

has predictive power. The present-value relationship between stock prices and their cash

flows suggests a wide array of financial ratios and other financial variables as possible pre-

dictors. Documented examples include the earnings-price ratio (Campbell and Shiller, 1988),

the yield spread (Fama and Schwert, 1977; Campbell, 1987), the book-to-market ratio (Fama

and French, 1993; Pontiff and Schall, 1998), the dividend-payout ratio (Lamont, 1998) and

the short interest rate (Rapach et al., 2016). Various macroeconomic variables such as the

consumption-wealth ratio (Lettau and Ludvigson, 2001) and the investment rate (Cochrane,

1991) have been considered as well. More recently, Neely et al. (2014) have suggested the use

of technical indicators.

An issue that arises when returns are conditioned on just one predictor, is that prediction

can be biased if other relevant predictors are omitted. This issue might partially explain

the failure of single predictor models, although the use of several predictors in a standard

regression set-up has rarely improved out of sample forecasts. Welch and Goyal (2008) noted

that despite the poor forecast performance of individual predictors over long periods of time,

there are episodes during which an individual predictor has power. The latter authors used

their findings to claim that most (return forecasting) models are unstable.

4

Recent research on forecasting stock returns has moved away from single predictor mod-

els, placing more emphasis on using the large pool of predictors suggested in the literature

and drawing on modern econometric forecasting techniques. This research comes with an ac-

knowledgment that many predictors are potentially useful in forecasting stock returns in one

way or another, but the high dimension of the set of predictors can pose a significant challenge

due to the degrees-of-freedom problem (Ludvigson and Ng, 2007), and the temporal insta-

bility of the predictive power of any single predictor renders forecasting extremely difficult

(Rossi, 2013). Forecast combination methods (see Timmermann, 2006) are now popular, and

have been used by Rapach et al. (2010) to draw on information from a large set of predictors

while alleviating the effects of structural instability. Other forecasts that rely on a multivariate

approach include the dynamic factor approach (Ludvigson and Ng, 2007), Bayesian averag-

ing (Cremers, 2002), and use of a system of equations involving vector autoregressive models

(Pastor and Stambaugh, 2009).

Our LASSO approach in the presence of cointegration is an addition to this strand of lit-

erature that uses multiple predictors. Since the seminal paper of Tibshirani (1996), LASSO

estimation has gained popularity in various fields of study. See Fan and Lv (2010) and Tibshi-

rani (2011) for recent developments. The wide variety of predictors available for forecasting

stock returns necessitates a wise selection of predictors that capture the quintessential dy-

namics of future returns. Moreover, the unstable nature of their prediction power at different

time periods renders covariate selection even more crucial.

A few studies have investigated the use of LASSO in the presence of non-stationary vari-

ables. Caner and Knight (2013) proposed a unit root test involving Bridge estimators in order

to differentiate unit root from stationary alternatives. Liao and Phillips (2015) applied the

adaptive LASSO approach to cointegrated system in order to estimate a vector error correc-

tion model (VECM) with an unknown cointegrating rank and unknown transient lag order.

Further, Wilms and Croux (2016) have explored the forecasting ability of a VECM using an ap-

proach based on an l1 penalized log-likelihood. Quite recently, Kock (2016) further discussed

the oracle efficiency of the adaptive LASSO in stationary and non-stationary autoregressions.

These papers differ from ours in many aspects. First of all, instead of using autoregressive

models, our paper focuses on predictive regressions in which stationary and non-stationary

predictors co-exist. Secondly, we allow for the presence of cointegration and have a different

objective from the other studies in the literature, which leads to a different approach.

Although LASSO has been widely employed in various fields of studies, the financial lit-

5

erature has not made extensive use of this powerful model selection and estimation tool to

address stock return predictability. Only two recent papers have emerged. Buncic and Tis-

chhauser (2016) and Li and Tsiakas (2016) suggest that LASSO contributes to better forecasts

of stock returns because it enables researchers to select the most relevant factors among a

broad set of predictors. Both papers are closely related to the work of Ludvigson and Ng

(2007) and Neely et al. (2014). However, each falls short in explaining the theoretical ground

that underlies their empirical results, and neither takes the impact of persistent regressors on

the predictions into account. Moreover, both Buncic and Tischhauser (2016) and Li and Tsi-

akas (2016) impose sign constraints on their estimated coefficients to improve their forecasts.

In contrast, we do not require any constraint on the sign of coefficients, and we explicitly ac-

count for persistence in some of our predictors. The main difference between our work here

is that we provide theoretical credence to the use of the LASSO in a linear predictive setting,

when persistent predictors are taken into consideration.

A fundamental problem associated with forecasting stock returns is model uncertainty,

with respect to which predictors are relevant, and when. Timmermann (2006) pointed out

that there are two schools of thought on how to improve forecasting when the underlying

model is uncertain. The first suggests the use of forecast combinations, which has been

actively investigated (see Elliott et al., 2013; Koo and Seo, 2015, among many others), and

used by Rapach et al. (2010) as mentioned above. The other school of thought considers

the entire set of possible predictors in a unified framework, allowing the data to select the

best combination of predictors in an attempt to search for the best forecasting model. Most

literature based on `1-regularization, including the LASSO, falls under this second approach.2

The LASSO does not explicitly address the unstable nature of stock return predictability,

although it is possible to use the automated selection of the LASSO at different points in time

by rolling an estimation window throughout the sample period, and allowing the LASSO to

pick different predictors for these different estimation windows and provide different coeffi-

cient estimates for each set of selected predictors. We adopt this approach in our empirical

analysis of stock return predictability in Section 6.

2The former approach pools forecasts whereas the latter approach pools information. See Timmermann (2006)for the pros and cons of these two different approaches.

6

3 Econometric framework: Predictive regression

Our econometric framework is a linear predictive regression in which a large set of condition-

ing variables is available, but the prediction power of each variable could be unstable. Some

of the conditioning variables could be non-stationary, in particular integrated of order 1, i.e.

I(1) variables and might be cointegrated. Our model is framed in the context of stock return

predictability because forecasting stock returns epitomizes the issues we are attempting to

address: (i) a possibly large set of conditioning variables; (ii) some of them are quite persis-

tent; and (iii) the prediction power of each conditioning variable is unstable. Let us consider

the following linear predictive regression model:

yt = x0t�1,T bT + z0t�1aT + ut,

where bT 2 Rk for k ! • with xt,T being I(1), and aT 2 Rp for p ! • with zt being I(0).

I(q) denotes an integrated process of order q. The above array structure could also imply that

the true values of the coefficients and the number of covariates may vary as the sample size,

T changes. In what follows, however, we shall use the simpler notation that dispenses with

the T subscript,

yt = x0t�1b + z0t�1a + ut. (1)

The high persistence of xt is compatible with the property of many commonly used stock

return predictors, such as the dividend-price ratio and the earning-price ratio.3 We use zt in

model (1) to represent the other group of non-persistent return predictors, for example the

long term bond rate and inflation. See Neely et al. (2014) and Ludvigson and Ng (2007) for a

pool of plausible financial and macroeconomic predictors.

Model (1) is in line with the most recent literature on stock return predictability. It is a

multi-factor model employed in the empirical asset pricing literature. The primary objective

is to estimate the conditional mean of yt, given a mixture of possibly high-dimensional sta-

tionary and non-stationary conditioning variables available at time t � 1. In the context of the

present paper, the prediction objective yt is the rate of return on a broad market index over

the risk free rate, commonly referred to as the equity premium. It is commonly accepted that

the equity premium displays martingale difference features, whereas many return predictors

are quite persistent. The stationarity feature of the martingale difference sequence on the left-

3Welch and Goyal (2008) showed that many other return predictors including the treasury bill rate, the longterm bond yield and the term spread are also highly persistent, with a first order autocorrelation above 0.99.

7

hand side of model (1) hardly matches with the highly persistent predictors on the right-hand

side. Furthermore, a close examination of the historical data of various financial ratios reveals

a strong co-movement among these persistent variables, which lends strong credence to the

existence of at least one cointegrating relationship among them.

Without any loss of generality in the forecasting context, we assume that there is at least

one cointegrating relationship within xt. Specifically, the cointegrating vector d and the corre-

sponding constant c are such that x0tb = x0tdc.4 For identification when it comes to estimating

the true parameters d0 and c0, we define b0 = c0d0, where d0 is the standardized cointegrat-

ing relation of unit `1 norm and c0 is the response of the outcome variable to the cointegrated

term. b0 = 0 if and only if c0 = 0, under which d0 is not identified.

4 LASSO selection of predictors

This section examines the automated selection of predictors given many covariates whose

cardinality can be larger than the sample size but sparsity is assumed. In particular, we

propose LASSO estimation in which an `1-type penalty is applied. Note that in matrix form,

model (1) can be written as

Y = Xb + Za + u. (2)

The LASSO estimator for (b0, a0)0 can be obtained by

(b0, a0)0 = arg min

b,akY � Xb � Zak2

2�

T + l(kbk1 + kak1), (3)

where k · km denotes an `m norm, l is a tuning parameter, and T is the sample size.

Equation (3) describes the usual LASSO objective function that might, but need not in-

corporate a cointegrating relationship. We reparameterize the first portion of this objective

to investigate how the presence of a cointegrating relationship affects the LASSO estimation

4The interpretation of d is subtle in the case of multiple cointegrating relationships. In such a case, d mightbe the optimal linear combination of all the cointegrating vectors for predicting the outcome. Importantly, thispaper does not seek to identify the number of the cointegrating relationships nor to estimate each cointegratingvector separately. For these, one needs to set up a cointegration system, for which we refer to Johansen (1995).

8

approach. With c = kbk1, d = b/c and the true parameters b0 = c0d0 and a0,

Y � Xb � Za = u � X(b � b0)� Z(a � a0)

= u � X(d � d0)c � Xd0(c � c0)� Z(a � a0)

= u � XpT

pT(d � d0)c � W(c � c0)� Z(a � a0)

= u � DBc(q� q0),

where q = (p

Td0, g0)0 with g = (c, a0)0, W = Xd0, D = (X�p

T, G) with G = (W, Z), and

Bc denotes a diagonal matrix whose first k diagonal elements are c and the others are 1.

Likewise, Bc denotes the matrix replacing c in Bc with c.

In addition, the `1 penalty terms can be rewritten as

l(kbk1 + kak1) = l(c + kak1) = lkgk1,

with the constraint kdk1 = 1 holding by construction.

Consequently, in the presence of a cointegrating relationship, the infeasible LASSO esti-

mator for q can be represented by

q = arg minq:kdk1=1

ku � DBc(q� q0)k22�

T + lkgk1. (4)

Note that (4) is a constrained minimisation problem given that the estimator for q is obtained

under the constraint that kdk1 = 1. Note, also, that lkgk1 can be replaced by lkqk1 without

changing the minimisation problem in this set-up. This formulation separates the estima-

tion of cointegrating relation and the effect of the error correction term and shows that the

cointegrating relation is estimated without shrinkage. This is useful for inference.

In the subsection that follows, we establish the large sample properties associated with the

proposed LASSO estimator, q = (p

Td0, g0)0. We start with the consistency of the proposed

estimator and its convergence rate, and proceed to the large sample distribution of the LASSO

estimator for the cointegrating vector d. All proofs for our theoretical results are provided

in Appendix A. We need the following set of regularity conditions to derive the statistical

properties of LASSO estimation of q in (4).

Assumptions

A1 Each element of the vector g belongs to a compact set and kdk1 = 1 for all d.

9

A2 Let s0 = |S0|, qS be the subvector of q whose elements are in the index set S, and S =

D0D/T be the properly scaled Gram matrix. Then, there exists a sequence of positive

constants fT such that for all q whose first k elements do not lie in the linear span of d0

and kqSc0k1 3kqS0k1, the following holds

kqS0k21

�q0Sq

�s0/fT

2,

with probability approaching one.

A3 Let the nonstationary predictors be xt = Â[Tt]k=1 vk. Then {vtut}, and

�ztjut

j=1,...,p are

centered a-mixing sequences such that for some positive g1, g2 > 1, a, and b,

a (m) exp (�amg1)

P��ztj

�� > z H (z) := exp

�1 � (z/b)g2

�for all j (5)

g =⇣

g�11 + g�1

2

⌘�1< 1,

ln p = O⇣

Tg1^(1�2g1(1�g)/g)⌘

. (6)

Assumption A1 is concerned with restrictions on the parameter space and a normalization

restriction on the cointegrating vector. Compatibility conditions such as A2 are common in the

large sample theory for LASSO estimation, and their purpose is to place weak restrictions on

the design matrix of regressors so as to bound the prediction and parameter estimation errors

in terms of the `1 norm. Here, the format of D accounts for the different dimensions of the

persistent and stationary regressors, and given the asymptotic orthogonality between these

two sets of regressors, one can think of our compatibility condition as providing separate

conditions for each set. The gram matrix of the persistent regressors does not converge to a

deterministic matrix, and therefore the compatibility factor fT is allowed to vary over T and

converge to zero. In the special case in which k is fixed, we may let it decay at a log T rate.

This is reflected in the deviation bound in Theorem 1.

Assumption A3 is a technical condition required for Lemmas 1 and 2. Notably, Assump-

tion A3 implies Assumption 1 in Hansen (1992), which ensures weak convergence of the sum

of the product of xt and ut in (1) to a stochastic integral. We can relax the tail condition for

the stationary regressors zt at the expense of slower rates of convergence.

It is worth noting that these conditions are quite general, accounting for the time series

features of the data and the model in Section 3. For instance, we do not require that the error

10

process is the martingale difference sequence. It suffices that {vt} and {ut} are strong-mixing.

4.1 Consistency and rate of convergence

Theorem 1 Let p = O (Tr) for some r < • and k = O⇣p

T⌘

. Then, under the regularity conditions

A1-A3 specified above,

kDBc(q� q0)k22�

T + kBc(q� q0)k1 = Op

✓s0l

fT

◆, (7)

where s0 is the number of non-zero elements in q0 and l�1 = o⇣⇣p

T/ ln p⌘^⇣p

T/ (Th ^ k ln k)⌘⌘

for any h > 0. Furthermore, if (minj |qS0,j|)�1 = o((s0l/fT)�1), then the LASSO selects all relevant

predictors with probability approaching one.

Proof. See Appendix A.

Theorem 1 establishes deviation bounds for the prediction error and the parameter esti-

mation error in terms of the `1-norm. It permits the numbers of both stationary and non-

stationary variables to diverge. However, our bound is more sensitive to the number k of

nonstationary variables in the sense that k increases the bound more quickly than does p.

Note that when k is fixed or increases only at a logarithmic rate, the deviation bound be-

comes s0/(p

TfT) up to a logarithmic factor. Furthermore, when k is allowed to increase to

infinity faster, the compatibility sequence fT may need to decay to zero at a faster rate.

Theorem 1 ensures that LASSO can serve as a predictor-selection tool in a linear predictive

regression, in that P�

S0 ⇢ S ! 1 where S0 := {j : q0

j 6= 0} for q0 with the corresponding

estimate S. This is so even when there are two different types of covariates, either stationary

(zt) or nonstationary (xt), and the dimensions for zt are very large as long as the dimen-

sion of xt increases at most at a logarithmic rate. This property is essential for stock return

predictability because the number of plausible return predictors is very large and could be

nonstationary or stationary. LASSO is an automated method that selects relevant predictors

under a sparsity assumption so as to achieve parsimony without sacrificing precision.5 This

becomes important when the number of observations is relatively small compared to the

number of available predictors.

For the case where the dimension of xt increases to infinity, Theorem 1 shows that the

convergence rate for q is slower than Op(p

s0 ln p/T), which is the usual rate of convergence

5Here, sparsity means that although there might be many potential predictors, the number of truly relevantpredictors is small.

11

of the LASSO estimator for linear models when the data is independent (see Buhlmann and

van de Geer, 2011, page 106). The slower rate of convergence occurs in our case because we

allow for quite general dependence. However, it is worth noting that due to the presence

of cointegrating relationship, the rate of convergence for d is faster than the usual rate of

convergence for LASSO estimation and can achieve the oracle rate of T�1 under some proper

conditions as shown below.

4.2 Large sample distribution of the LASSO estimator for the cointegrating vector

This section discusses the large sample distribution of the proposed estimator for the cointe-

grating vector, d, when its dimension k is fixed. From (3), when c and a are given, we can

focus on the estimator for d, the coefficient associated with Xc. The objective function (3)

becomes

d = arg mind:kdk1=1

kY � Xdk22�

T, (8)

where Y = Y � Za and X = Xc.

It is worth noting that there is no shrinkage on the cointegrating vector due to the re-

striction on the parameter space, kdk1 = 1. This feature is helpful for better prediction and

forecasting because the collective explanatory power coming from persistent predictors con-

stituting the cointegrating vector is not compromised by LASSO estimation. This is relevant

for the return predictive regression, in which it is highly likely that there exists a cointegrat-

ing relationship within persistent predictors given its seemingly unbalanced nature. As can

be seen from (8), we can derive the large sample distribution of d based on a constrained

least squares with a nonstandard norm restriction using the k · k1 norm. The large sample

distribution of appropriately scaled d � d0 can be obtained by the minimizer of a Gaussian

process.

Theorem 2 Let

D(b) = �2c0b0Z 1

0W (r)dU (r) + L

�+ |c0|2b0

Z 1

0W (r)W (r)0drb,

with

L = limT!•

1T

T�1

Âi=1

i�1

Âj=1

E(vi�jui),

12

where vt = xt � xt�1 and (W (·), U (·))0 is the weak limit of the partial sum process of (vt, ut)0. Also

let k be fixed, p = O(T), l�1 = o(p

T/ ln p), and s0l = O(T�q) for some positive q. Then, under

the regularity conditions A1-A3,

T(d � d0)d�! arg min

b:sgn(d0)0b=�Âj |bj|1{d0j =0}

D(b). (9)

Proof. See Appendix A.

Theorem 2 allows us to make inference on the cointegrating vector d. Theorem 2 is non-

standard because there is a restriction on the parameter space. If it were not for this re-

striction, we could obtain a closed-form expression by deriving the minimizer of the limit

objective function D(b) in Theorem 2. Nevertheless, we can simulate the limit distribution

for the purpose of making inference.6 We do not elaborate on inference here because our pri-

mary objective is on forecast improvement rather than making inference on the cointegrating

vector.

We illustrate the finite sample performance of the asymptotic approximation in Theorem

2 by considering the following design:

yt = a + b1x1t�1 + b2x2t�1 + ut

where a = 0.1, b = (b1, b2)0 = (1,�1)0, var(vit) = s2vi= 1 for i = 1, 2 where vit = xit � xit�1

and var(ut) = s2u = 1. We consider two cases in which T = 1000 and T = 10000 with 10000

repetitions. Without loss of generality, we assume there is no stationary regressor. Note that

the constrained limit objective function we consider is

D(b) = �2b0[Z 1

0W (r)dU (r)] + b0

Z 1

0W (r)W (r)0drb

subject to sgn(b0)0b = �Âj |b|j1{b0j =0}where b = cb and b is as in Theorem 2. The minimizer

of the above reparameterized limit objective function with a constraint yields the limit distri-

bution of T(b � b0). Following our design, the constraint becomes b1 = b2 = b. This implies

6The limit distribution cannot be tabulated as it is not asymptotically pivotal but it can be simulated. Inparticular, the restriction on b can be imposed by replacing d0j with dj = dj1{|dj |�cT} for some cT ! 0 and

T ⇥ cT ! •. Due to the super-consistent convergence rate of d, this event is equivalent to a restriction on theparameter space with a probability approaching one.

13

that

D(b) = �2b⇣

1 1⌘2

4R 1

0 W1(r)dU (r)R 1

0 W2(r)dU (r)

3

5

+b2⇣

1 1⌘2

4R 1

0 W 21 (r)dr

R 10 W1(r)W2(r)dr

R 10 W1(r)W2(r)dr

R 10 W 2

2 (r)dr

3

5

0

@ 1

1

1

A .

The minimizer of the above can then be obtained in a closed-form, and is

b⇤ =

R 10 [W1(r) +W2(r)] dU (r)

Â2i=1 Â2

j=1R 1

0 Wi(r)Wj(r)dr

0

@ 1

1

1

A .

Hence,

T(d � d0)d! b⇤/c

where d = b/c with b obtained from a LASSO estimation c = |b1|+ |b2|, and d0 = b0/c0. To

simulate the sample distribution of b⇤, note that

T�2T

Ât=1

xitxjtd!R 1

0 Wi(r)Wj(r)dr

T�1T

Ât=1

xitutd!

R 10 Wi(r)dU (r).

Then,

b⇤ ⇡ T�1 ÂTt=1(x1t + x2t)ut

T�2 Â2i=1 Â2

j=1 ÂTt=1 xitxjt

.

Meanwhile, for the sample distribution of (d � d0), we employ the LASSO estimation with a

fixed tuning parameter, l = ln p10p

T. This choice reflects the rate of convergence of the estimated

cointegrating coefficients when selecting the set of predictors. Under this setting, Figures 1

and 2 provide histograms of the limiting distribution of d and the sample distributions of

(the elements of) (d � d0) for 1000 and 10,000 observations where d is the LASSO estimator,

as well as the associated QQ-plots. We can see that the sampling distribution of (d � d0) is

centered around 0, and although the QQ plots show some differences between the sampling

and limiting distributions around the tails when T = 1, 000, these differences are barely

apparent when T = 10, 000.

14

Figure 1: Limiting distribution and sample distribution of (d � d0) and their QQ-plotT = 1000

a The first panel depicts the limiting distribution of (d1 � d01) whereas the second panel depicts its sample

distribution.b The limiting and sample distributions of (d2 � d0

2) are identical to those for (d1 � d01) by construction.

Figure 2: Limiting distribution and sample distribution of (d � d0) and their QQ-plotT = 10000

5 Simulation study

In this section, we use simulation to evaluate the ability of the LASSO estimator to select a set

of predictors that leads to out-of-sample forecasting gains. We start with a baseline predictive

regression that incorporates a mixture of just four persistent and six non-persistent predictors,

and then we expand the number of non-persistent predictors out to one hundred, which

allows us to consider high-dimensional predictive regressions as well. Our data generating

processes (DGPs) include a mixture of stationary and non-stationary cointegrated predictors

as in (1), and although (1) and the other equations in our text omit a constant to simplify our

exposition, we include a constant in all our simulated data and estimated equations.

We consider four estimation sample sizes (each generated sample follows a burn-in of

1000 observations) and ten different forecast methods. Given an estimation sample of size T,

we focus on forecasting yT+1 using different methods, including our proposed LASSO based

15

methods. We evaluate each forecast strategy in terms of the one-step-ahead forecast errors

associated with yT+1.

5.1 Simulation of our baseline DGP

The baseline DGP that we study incorporates several well known characteristics of financial

and macroeconomic data. These include high persistence in some (but not all) individual

series, interdependencies between many variables, and rich system-wide dynamics.

The specification associated with the persistent variables xt in this DGP is based on a four

variable vector error correction representation given by (10a), in which the first right-hand

side vector represents a 4 ⇥ 1 loading matrix (1 is the cointegrating rank in this setting) while

the second right-hand vector is the cointegrating vector. Hence, xt is a 4-dimensional I(1) pro-

cess with cointegrating rank 1 and from (10a), the cointegrating vector is d = (1, �0.5, 0, 0)0.

The non-persistent predictors zt in this DGP follow a 6-dimensional stationary VAR(2)

process with a complicated dependence structure. This structure, given in (10b), is partly

taken from an estimated VAR(2) model of US macroeconomic data (see Martin et al., 2012, p.

485, example 13.19), so it is representative of data obtained from a real economy.

The coefficients in the overall predictive regression model (1) are given in (10c), and these

are set to be b = cd = 0.3d and a = (�0.1, 0, 0.02, 0.4, 0, �0.2)0. All of the error terms in (10)

(i.e. (ut, #x,t, #z,t)) are i.i.d N (0, I). Putting these together, the full specification of our baseline

DGP is then

Dxt =�0 0.3 0 0.2

�0 �1 �0.5 0 0�

xt�1 + #x,t, (10a)

zt =

0

BBBBBB@

0.4 0 0 0 0 00 1.31 0.29 0.12 0.04 00 �0.21 1.25 �0.24 0.04 00 0.07 0.03 1.16 0.01 00 0.08 0.27 �0.07 1.25 00 0 0 0 0 0.4

1

CCCCCCAzt�1

+

0

BBBBBB@

0 0 0 0 0 00 �0.35 �0.28 �0.07 �0.02 00 0.19 �0.26 0.24 �0.05 00 �0.07 �0.02 �0.16 0.01 00 �0.13 �0.23 0.03 �0.31 00 0 0 0 0 0

1

CCCCCCAzt�2 + #z,t,

(10b)

yt = 0.15 + x0t�1�0.3 �0.15 0 0

�0+ z0t�1

��0.1 0 0.02 0.4 0 �0.2

�0+ ut. (10c)

16

5.2 Forecast methods

The first of our ten forecast strategies uses OLS and all predictors to estimate the standard

linear predictive regression model in (1) given by

yt = x0t�1b + z0t�1a + ut

to obtain yT+1, while our second strategy predicts yT+1 by using its historical average, i.e.

yT+1 = 1T ÂT

i=1 yi. The third method predicts yT+1 via the estimation of an AR(1) model:

yt = r yt�1 + ut. (11)

Then, since log-price series are often modelled as random walks, we also use a random walk

without drift as the fourth forecast method for comparison. We expect this model to show

inferior forecast performance when predicting returns because it restricts r = 0 in (11).

The fifth method is based on principal component analysis (PCA), as suggested by Stock

and Watson (2002) when there are many predictors. Due to the different statistical properties

of I(1) and I(0) series, we extract the first 4 principal components from the (standardised) I(0)

series, and then use them together with the 4 I(1) predictors in the forecasting exercise.

For the sixth method, we employ a forecast combination a la Timmermann (2006). More

specifically, we compute the combined forecast, yT+1, as an equally weighted average of all

forecasts that are separately obtained from distinctive linear predictive regression models that

each condition on just one predictor:

yT+1 =1M

M

Âm=1

ymT+1, (12)

where M = k + p is the total number of predictors, and ymT+1 is a forecast based on a bivariate

linear predictive regression that conditions on just the m-th predictor. We include forecast

combination as one of our competing forecast strategies because it has been widely recognized

as a way to improve forecast performance relative to a forecast based on a single model

(Timmermann, 2006; Rapach et al., 2010). Similar to the use of LASSO, the combined forecast

approach alleviates the impact of model uncertainty and parameter instability.

The next three methods are different variants of LASSO. We apply a standard LASSO

17

method with the following objective function:

arg minb,a

⇢1T

T

Ât=1

�yt � x0t�1b � z0t�1a

�2+ lkbk1 + lkak1

�, (13)

where k · k1 is the `1-norm. The penalty term l can be estimated using the cross validation

algorithm due to Qian et al. (2013), that minimizes the cross-validated error loss (LASSOmin).

We call the resulting penalty lmin. An alternative way to estimate l is via forward chaining,

which leads to l f wd. Here, we work with a grid of ls and the first 0.9T observations in the

generated series of T observations to estimate a set of coefficients for each l. We then use

each set of coefficients together with a rolling window of 0.9T observations to obtain one

step ahead forecasts for the remaining 0.1T data points, and choose l f wd to minimise the

associated forecast mean squared error (FMSE). We also consider setting l = ln p10p

Tto obtain a

penalty that we call l f ix. This latter choice of l is based on a metric associated with the set of

non-persistent variables zt (see Lemma 1), and the potential advantage of using this penalty

is that it takes the rate of convergence of the estimated cointegrating coefficients into account

when selecting the set of predictors.

Our final forecasting technique is based on the elastic net by Zou and Hastie (2005), which

is often used in macroeconomic or financial applications, when regressors are highly corre-

lated. The elastic net mixes the `1 and `2 norms for the penalty term,

arg minb,a

⇢1T

T

Ât=1

�yt � x0t�1b � z0t�1a

�2+ l

⇥h(kbk1 + kak1) + (1 � h)(kbk2

2 + kak22)⇤ �

, (14)

and in our case we set the mixing parameter, h, to be 0.5.

We generate 5000 samples of T = 100 to estimate the parameters that are needed (together

with the sample data) to predict yT+1 for each of the ten forecast methods. This generates

50000 predictions for estimation samples of size 100, and we use these, together with yT+1

to calculate 5000 forecast errors for each of the ten forecast methods. We then repeat this for

T = 200, 400, and 1000, so that a total of four estimation sample sizes are considered. Note

that we do not encounter any estimation problems when we generate data from our baseline

DGP, but later, when we start generating high-dimensional predictive DGPs in Section 5.3 and

work with small estimation samples, we will not have sufficient observations to implement

the first of our forecast methods which is based on OLS.

We compare the forecast performance for different methods by considering two measures.

18

The first of these is the forecasting mean squared error (FMSE), while the second is a measure

of out-of-sample relative fit (RFIT) and is closely related to the out-of-sample R2 measure

used by Welch and Goyal (2008). The former is defined as the expected squared difference

between the actual realization and the one-step-ahead forecast: FMSE := E[yT+1 � yT+1]2.

The latter is defined as RFIT:=1 � FMSEi/FMSEOLS for each forecasting method i. We use

the OLS based forecast method as our benchmark when calculating the out-of-sample RFIT

because this method is typically used in empirical applications of predictive regressions.7

Table 1 presents FMSE and out-of-sample RFIT for the forecasting strategies based on

5000 repetitions. Overall, the LASSO methods dominate, with LASSO f ix having lower FMSE

and higher RFIT than all other forecast methods when the sample size is small, and LASSOmin

playing this role when the sample size becomes larger. We can infer that the bias induced

by adopting the LASSO is small, whereas the reduction in the estimation variance domi-

nates. The LASSOmin, LASSO f ix and elastic net methods produce smaller FMSEs than the

benchmark OLS forecasts across the four different sample sizes (with only one exception),

and hence they produce positive out-of-sample RFIT values. On the other hand, the other

methods always lead to inferior forecasting performance compared to the OLS benchmark.

The bias induced by using the historical average is particularly costly, and it is interesting to

note that despite considerable empirical evidence that forecast combinations are often helpful,

they do not appear to help in this setting. One possible reason for this is that the simulated

DGP used here has a stable dependence structure, whereas an important practical advantage

of forecast combination is that it can ameliorate the effects of structural change. The PCA4

and the forward chaining LASSO methods generate FMSEs that are of similar magnitudes as

the three best performing forecast strategies, but they do not show any advantage.

We conjecture that the LASSO approach renders the presence of a cointegrating relation-

ship conspicuous due to the shrinkage imposed on the irrelevant non-stationary predictors,

and this, together with the small but remaining nonzero coefficients on stationary predictors

leads to better forecast performance. Since OLS can also ”pick up” cointegrating relation-

ships as the number of observations increases, the forecasting performance based on the OLS

estimation of the predictive regression becomes closer to that of the LASSO for large samples.

This is confirmed in Table 1. When we have a small sample size T = 100, the improvement

in the FMSE of LASSOmin over the OLS benchmark, as measured by the out-of-sample RFIT,

is 2.96%. However, as the sample size increases to T = 1000, the out-of-sample RFIT of7All code for our simulations in Section 5 and our forecasts in Section 6 can be found at

https://sites.google.com/site/myunghseo/research

19

LASSOmin reduces to 0.03%.

5.3 Alternative DGPs

One clear advantage of the LASSO approach is that the dimension of the stationary variables

zt can be arbitrarily large, and indeed can be larger than the sample size T. In the previous

DGP we set k = 4 and p = 6. In this section we investigate how the forecasting performance of

LASSO compares with alternative forecasting strategies as the dimension p of the stationary

predictors increases. Starting from p = 6, we expand the dimension of the non-persistent

regressors to p = 12, 20, 50, 100 and 200. Our DGP structure is similar to that in Section 5.1

in that our DGP for the nonstationary variables still follow (10a), and the stationary variables

are still generated by VAR(2) processes, but the sets of coefficients for these processes are now

randomly drawn (so that our analysis no longer focuses on just one VAR). The coefficients in

the associated loading vector a are also drawn randomly, in a way that ensures that the set of

elements of zt that actually contribute to the generation of yt is sparse. We provide more detail

on the stationary components of the DGPs below. We use the same forecast methodologies

as in the previous section (except when OLS is infeasible) and as before, we base our forecast

comparison on the FMSE associated with one-step-ahead forecasts.

The details relating to the stationary components of our DGPs are that we consider 100

different VAR(2)s for each sample size, and work with 1000 replications of each.8 The coeffi-

cients for each of these 100 VAR(2)s are drawn randomly from a normal distribution with a

zero mean and variance of less than 0.2, and we have made sure that each drawn VAR(2) pro-

cess is stationary, by checking that its maximum eigenvalue is less than 0.8 in absolute value.

We impose weak sparsity by randomly setting half of the a coefficients equal to zero and

drawing the remaining non-zero coefficients in a from a uniform distribution over [�0.3, 0.3].

The non-zero coefficients are small, making the dependence of yt on zt relatively weak. We

should emphasize that this design is not favorable to the LASSO due to the many smallish

non-zero coefficients. The average FMSE produced across 1000 replications of the 100 DGPs

are tabulated in Table 2, and the smallest FMSE in each row appears in bold letters with an

associated tick.

The results observed previously are largely preserved when we increase the dimension

of the stationary predictors zt. The three forecasting methods based on the LASSO and the

elastic net produce more accurate forecast in terms of FMSE, with LASSO f ix performing the

8These simulated VAR(2) processes are available upon request.

20

Tabl

e1:

Fore

cast

ing

perf

orm

ance

ofdi

ffer

entf

orec

asts

trat

egie

s

TO

LSA

vera

geA

R(1

)r-

wal

kPC

A4

F-co

mb

LASS

Ofw

dLA

SSO

min

LASS

Ofi

xel

-net

FMSE

100

1.26

937.

1731

2.23

692.

3820

3.99

024.

4995

1.33

821.

2316

1.2

132X

1.23

9920

01.

1051

7.92

312.

1883

2.33

494.

3106

5.27

081.

1568

1.09

521.0

923X

1.09

8740

01.

0183

7.85

352.

1278

2.30

024.

0072

5.40

671.

0305

1.0

133X

1.01

361.

0150

1000

0.99

778.

3772

2.14

692.

3593

4.19

565.

9519

1.00

450.9

974X

1.00

130.

9975

Out

-of-

sam

ple

RFI

T

100

-4.6

514

-0.7

624

-0.8

767

-2.1

437

-2.5

450

-0.0

544

0.02

960.0

442X

0.02

3220

0-6

.169

6-0

.980

2-1

.112

9-2

.900

7-3

.769

6-0

.046

80.

0089

0.0

115X

0.00

5740

0-6

.712

6-1

.089

6-1

.259

0-2

.935

3-4

.309

8-0

.012

00.0

049X

0.00

460.

0032

1000

-7.3

969

-1.1

519

-1.3

649

-3.2

055

-4.9

659

-0.0

069

0.0

003X

-0.0

037

0.00

02a

Each

colu

mn

deno

tes

one

fore

cast

ing

stra

tegy

,res

pect

ivel

y,fr

omle

ftto

righ

t:(1

)OLS

mod

el;(

2)H

isto

rica

lave

rage

;(3)

AR

(1)m

odel

;(4)

rand

omw

alk

mod

el;(

5)Pr

inci

palc

ompo

nent

anal

ysis

(PC

A);

(6)

Fore

cast

com

bina

tion;

(7)

Forw

ard

chai

ning

LASS

O;(

8)M

inim

ized

cros

sva

lidat

edLA

SSO

;(9)

Fixe

dpe

nalty

LASS

O;

(10)

Elas

ticne

t.b

The

out-

of-s

ampl

eR

FIT

isde

fined

asR

FIT=

1�

FMSE

i/FM

SEO

LSfo

rea

chm

odel

i.

21

Tabl

e2:

Com

pari

son

ofFM

SEfo

rdi

ffer

entf

orec

asts

trat

egie

sas

the

num

ber

ofst

atio

nary

pred

icto

rs(p

)gro

ws

pT

OLS

Ave

rage

AR

(1)

r-w

alk

PCA

4F-

com

bLA

SSO

fwd

LASS

Om

inLA

SSO

fix

el-n

et

6

100

1.16

371.

5259

1.46

292.

3513

1.18

751.

4714

1.21

621.

1473

1.1

401X

1.15

4220

01.

0749

1.53

111.

4586

2.35

481.

1245

1.48

871.

1275

1.07

061.0

696X

1.07

3540

01.

0335

1.52

211.

4487

2.36

671.

0876

1.48

801.

0550

1.0

325X

1.03

381.

0335

1000

1.01

221.

5294

1.45

542.

3775

1.07

311.

5011

1.02

141.0

120X

1.01

541.

0122

12

100

1.25

951.

6997

1.64

152.

6931

1.35

591.

6387

1.30

221.

2241

1.2

132X

1.23

5520

01.

1139

1.69

191.

6247

2.68

171.

2803

1.64

121.

1637

1.1

029X

1.10

351.

1086

400

1.05

221.

6998

1.63

152.

7119

1.23

881

1.65

421.

0764

1.0

501X

1.05

541.

0523

1000

1.02

801.

6877

1.61

352.

6796

1.21

751.

6466

1.03

441.0

273X

1.03

501.

0279

20

100

1.38

421.

8769

1.82

993.

0623

1.56

701.

8196

1.42

261.

3117

1.2

990X

1.32

8820

01.

1598

1.88

761.

8278

3.03

911.

4609

1.83

591.

2093

1.1

417X

1.14

471.

1508

400

1.08

011.

8829

1.81

813.

0380

1.41

611.

8360

1.09

301.0

752X

1.08

181.

0789

1000

1.02

941.

8742

1.81

373.

0688

1.39

061.

8301

1.03

851.0

283X

1.03

981.

0293

50

100

2.31

412.

8287

2.80

474.

8699

2.60

742.

7609

1.99

031.

7607

1.7

151X

1.77

6720

01.

3949

2.80

732.

7709

4.88

752.

4013

2.74

301.

4050

1.3

159X

1.32

431.

3420

400

1.17

672.

8061

2.76

364.

8898

2.32

062.

7434

1.16

871.1

553X

1.16

851.

1667

1000

1.06

592.

7956

2.74

394.

8790

2.30

092.

7347

1.06

911.0

621X

1.08

081.

0651

100

100

N/A

3.91

303.

9167

7.07

343.

8034

3.85

603.

1791

2.90

962.7

334X

2.84

5620

02.

1674

3.89

563.

8782

7.11

383.

6052

3.84

071.

8934

1.7

428X

1.74

691.

7892

400

1.36

973.

8947

3.86

887.

0424

3.47

713.

8406

1.31

821.2

996X

1.32

771.

3263

1000

1.12

083.

9227

3.88

547.

0685

3.42

843.

8688

1.11

801.1

106X

1.14

091.

1179

200

100

N/A

6.79

146.

8466

12.8

705

7.09

136.

7299

6.08

375.

9320

5.81

835.7

287X

200

N/A

6.83

926.

8440

12.7

566

6.66

396.

7768

3.72

953.

4627

3.3

619X

3.45

1740

02.

0915

6.83

806.

8446

12.8

389

6.47

346.

7762

1.78

071.7

538X

1.82

231.

8143

1000

1.27

136.

7071

6.69

6312

.875

06.

3050

6.64

621.

2351

1.2

323X

1.27

701.

2492

22

best when the sample size is small and LASSOmin performing the best when the sample size

becomes larger. Interestingly, for p = 200 and T = 100, the elastic net dominates the LASSO

methods in producing more accurate forecasts. The percentage improvement of the LASSOmin

specification relative to the OLS benchmark is always larger when the sample size is small.

Notably, as p increases, all forecast strategies considered here generate larger forecast errors.

The historical average, AR(1) model, random walk, and principal components model are all

mis-specified and hence lead to much larger forecast errors.

6 Stock return predictability

The predictability of the equity premium has been a controversial subject in the finance lit-

erature, generating considerable debate about whether predictive models of excess returns

can outperform the historical average of these returns, and if so, which predictors lead to

good out of sample performance, and over which forecast periods. Welch and Goyal (2008)

conducted a particularly influential study of this topic, providing an extensive review of the

usefulness and stability of many commonly used predictors. They concluded that return pre-

dictability is very weak, both in terms of in-sample fit and out-of-sample forecasting. Their

paper inspired a burst of more recent work on this topic, that has not only proposed new

predictors, but has also exploited innovative ways of extracting information from the existing

set of predictors (see, for example Rapach and Zhou, 2013; Neely et al., 2014; Buncic and

Tischhauser, 2016). In this section we use an extension of the data set examined in Welch and

Goyal (2008), and we demonstrate how LASSO estimation can improve the predictability of

stock returns by allowing for cointegration between persistent predictors.

6.1 Data

We follow the mainstream literature on forecasting the equity premium and use the same

fourteen predictors as Welch and Goyal (2008). An updated version of their data is posted

on the website (http://www.hec.unil.ch/agoyal/), and we use monthly data from January

1945 to December 2012 (816 observations) from this site in the work that follows. Table 3

presents a brief description of the predictors as well as their estimated first-order autocor-

relation coefficients over the entire sample period. The dependent variable of interest is the

equity premium yt, which is defined as the difference between the compounded return on

the S&P 500 index and the three month Treasury bill rate, and most of the predictors are fi-

23

http://www.hec.unil.ch/agoyal/

Table 3: Variable definitions and estimated first order autocorrelation coefficients

Predictor Definition Auto.corr(1)d/p dividend price ratio: difference between the log of dividends

and the log of prices0.9940

d/y dividend yield: difference between the log of dividends and thelog of lagged prices

0.9940

e/p earning price ratio: difference between the log of earning andthe log of prices

0.9908

d/e dividend payout ratio: difference between the log of dividendsand the log of earnings

0.9854

b/m book-to-market ratio: ratio of book value to market value forthe Dow Jones Industrial Average

0.9926

ntis net equity expansion: ratio of 12-month moving sums of netissues by NYSE listed stocks over the total end-of-year marketcapitalization of NYSE stocks

0.9757

tbl Treasury bill rates: 3-month Treasury bill rates 0.9891lty long-term yield: long-term government bond yield 0.9935tms term spread: difference between the long term bond yield and

the Treasury bill rate0.9565

d f y default yield spread: difference between Moody’s BAA andAAA-rated corporate bond yields

0.9720

d f r default return spread: difference between the returns oflong-term corporate and government bonds

0.9726

svar stock variance: sum of squared daily returns on S&P500 index 0.4716in f l inflation: CPI inflation for all urban consumers 0.5268ltr long-term return: return of long term government bonds 0.0476

nancial or macroeconomic variables. As shown in Table 3, the majority of these predictors are

highly persistent, with eleven out of fourteen variables having a first order autocorrelation

coefficient that is higher than 0.95. The remaining three variables (the stock index, inflation

and the government bond return) have little persistence, as does the equity premium (EQ),

which has a first order autocorrelation coefficient of only 0.1494.

We use twenty year rolling windows to allow for possible parameter instability during

our sample, and since our data starts from 1945:01, our out-of-sample predictions of the

equity premium start from 1965:01. We treat the results for the twenty year windows as our

benchmark, but we also conduct some robustness analysis using ten and thirty year rolling

windows. Figure S1 in our online supplement plots the estimated AR(1) coefficients for each

predictor and the equity premium (EQ) using twenty year rolling windows, and it is not only

evident that many predictors exhibit a high persistence level over the entire sample, but also

that the time series dependence structure changes throughout the forecasting period.

We expect to observe cointegrating relationships between the persistent variables, given

24


the features of the predictors reported in Table 3 and illustrated in Figure S1. Further, fi-

nancial considerations suggest that the dividend price ratio (d/p) and the dividend yield

(d/y) should have a stable long-run relationship. If such cointegrating relationships exist,

they would demonstrate our theoretical framework in Section 3 very nicely, so we study this

aspect of our data next. Specifically, we conduct the tests proposed by Johansen (1991, 1995)

and use the model selection technique proposed by Poskitt (2000) to assess whether there is

cointegration between some of our persistent variables. We exclude the dividend payout ratio

(d/e) and long-term yield (lty) when estimating models to avoid multi-collinearity, making

our cointegration analysis relate to the nine remaining persistent predictors.

Figure 3: Test for cointegration using twenty year rolling windows

1965 1975 1985 1995 2005 2012

1234567

(a) all 9 persistent variables

1965 1975 1985 1995 2005 2012

0

1

2

3

4

5

(b) exclude d/p and d/y

1965 1975 1985 1995 2005 2012

1

2

3

(c) d/p, d/y, e/p and tbl

1965 1975 1985 1995 2005 2012

1

2

(d) d/p and d/yJohansenPoskitt

Figure 3 shows the number of cointegrating relationships (i.e. cointegrating ranks) that

are found in each of the twenty year windows in our sample.9 The blue solid line denotes the

cointegrating rank selected using the Johansen test (Johansen, 1991, 1995), and the red dashed

line denotes the selected cointegrating rank when the non-parametric method of Poskitt (2000)

is used. The Johansen test is based on estimating a vector error correction model and it is one

of the most widely used tests for cointegration among multiple time series. However, it is well

known that this test often suffers from the curse of dimensionality (Gonzalo and Pitarakis,

1999), and in particular this test tends to overestimate the cointegrating rank. This is reflected

in Figure 3. Here, we see that the Johansen test always finds at least three cointegrating

relationships between the nine persistent variables, whereas the more conservative Poskitt9The test outcomes based on ten and thirty year windows are qualitatively similar, and the relevant results

are available upon request.

25

(2000) procedure only finds a cointegrating rank of one or two. It is interesting to note that

when only d/p and d/y are included in the analysis, the Poskitt (2000) procedure always

finds that these two variables are cointegrated.

6.2 Forecast comparison

We use the same set of forecasting methods as in the simulation study in Section 5, excepting

the method based on principal components since we now have only three I(0) predictors. The

benchmark forecast is based on OLS estimation of (1). Note that all forecasts (including the

the historical averages) use just the observations in the relevant rolling window instead of the

entire past history from 1945:01. One step ahead FMSEs are tabulated in Table 4, and out-

of-sample RFIT measures are provided in Table S1 (in our online supplement), respectively.

Following Welch and Goyal (2008), we first consider forecasts based on the generic “kitchen

sink” model, where all of the predictors listed in Table 3 are included except for d/e and lty.

In addition, we base predictions on a simplified model that contains the three non-persistent

predictors (svar, in f l and ltr) and two persistent predictors d/p and d/y, that appear from

Figure 3 (d) to be cointegrated.

Tables 4 and S1 show that in terms of overall performance, the forecasts produced by

forecast combination, and use of the three LASSO methods or the elastic net produce the

smallest one-step-ahead FMSE and hence the highest out-of-sample RFIT, over the three

different window sizes. The improvement in out-of-sample RFIT can be as large as 13.7%

over the forecasts based on the kitchen sink model. However, the LASSO f ix model sometimes

fails to generate better predictions than the OLS benchmark when the persistent predictors

other than d/p and d/y are excluded from the model. Forecast combination produces the

smallest FMSE using ten year and thirty year windows in the kitchen sink case but it fails to

improve forecast accuracy upon the OLS benchmark using twenty and thirty year windows in

the simplified case that includes only two persistent predictors. In contrast to our simulated

forecasts, the elastic net performs slightly better than the LASSOmin for this simple example

based on real data, perhaps reflecting higher correlations between the actual I(0) variables

than between the simulated I(0) variables.

We conduct pairwise Giacomini and White (2006) tests of equal conditional predictive

ability of OLS forecasts and alternative forecast methods, using the one-step-ahead FMSE as

our loss function. We indicate the significance levels of these tests in Table 4. For the kitchen

sink specification, forecast combinations, LASSOmin and the elastic net produce significantly

26


Tabl

e4:

Com

pari

son

ofFM

SEfo

rdi

ffer

entf

orec

astm

etho

ds

win

dow

OLS

Ave

rage

AR

(1)

r-w

alk

F-co

mb

LASS

Ofw

dLA

SSO

min

LASS

Ofi

xel

-net

Kitc

hen

sink

vari

able

s(9

pers

iste

ntva

riab

les

asw

ella

ssv

ar,i

nf

and

ltr)

10-y

ear

0.00

2089

0.00

1884

†0.

0019

000.

0033

440.0

01803X⇤⇤

0.00

1913

⇤⇤0.

0018

31⇤⇤

0.00

1839

⇤0.

0018

49⇤⇤

20-y

ear

0.00

2046

0.00

2050

0.00

2036

0.00

3598

0.00

1929

0.00

2003

0.0

01928X

†0.

0019

510.

0019

28†

30-y

ear

0.00

2035

0.00

2051

0.00

2042

0.00

3651

0.00

1956

0.0

01928X

0.00

1968

0.00

1961

0.00

1972

Two

pers

iste

ntva

riab

les

(d/

pan

dd/

y)as

wel

las

svar

,in

flan

dlt

r

10-y

ear

0.00

1909

0.00

1884

0.00

1900

0.00

3344

0.00

1808

0.00

1818

0.0

01799X

†0.

0184

30.

0018

19†

20-y

ear

0.00

1914

0.00

2050

0.00

2036

0.00

3598

0.00

1921

0.00

1924

0.00

1866

0.00

1944

0.0

01866X

†

30-y

ear

0.00

1895

0.00

2051

0.00

2042

0.00

3651

0.00

1931

0.00

1895

0.00

1892

0.00

1929

0.0

01888X

aBo

ldle

tter

sw

itha

tick

mar

kin

dica

teth

efo

reca

stm

etho

dw

ithth

esm

alle

stfo

reca

stm

ean

squa

red

erro

rin

each

row

.b

Sign

ifica

nce

leve

lsof

pair

wis

eG

iaco

min

ian

dW

hite

(200

6)te

sts

ofeq

ual

cond

ition

alpr

edic

tive

abili

tyof

OLS

fore

cast

san

dan

alte

rnat

ive

fore

cast

met

hod

inth

eco

rres

pond

ing

colu

mn:

†,⇤

and⇤⇤

deno

te10

%,5

%an

d1%

leve

lsof

sign

ifica

nce.

cC

olor

edbl

ocks

repr

esen

tsup

erio

rfo

reca

stm

etho

dsco

mpa

red

toth

eot

hers

usin

gth

eH

anse

n(2

005)

test

for

supe

rior

pred

ictiv

eab

ility

.

27

better one-step-ahead forecasts (at the 1% level) than the corresponding OLS alternative using

the ten year window, and LASSOmin out-performs OLS at the 10% level using the twenty year

window. For the parsimonious two persistent predictors case, the LASSOmin model and elastic

net perform significantly better than the OLS benchmark at the 10% level over the the ten year

rolling window samples.

The Hansen (2005) test for superior predictive ability leads us to similar conclusion. For

each forecasting method i, we test the null that

H0 : FMSEi FMSEj, for j 6= i, j 2 J ,

where J contains all methods considered here. We use purple blocks in Table 4 to highlight

those cases, for which H0 cannot be rejected at the 5% significance level. Regardless of the

size of the estimation rolling window, we find no statistically significant evidence that the

forecasts generated by combinations, LASSOmin or elastic net are inferior to any of the other

forecast strategies, for either the kitchen sink or for the more parsimonious model. For the

thirty year estimation samples we find no evidence that the LASSO f ix and the OLS forecasts

are dominated by other forecasts either.

6.3 Robustness analysis

The tests for cointegration rank shown in Figure 3 suggest that there is usually more than one

cointegration relationship between the nine persistent predictors. We point out that although

our modelling framework assumes just a single cointegrating vector, it is easy to accommo-

date multiple cointegrating relationships. To see this, suppose that there are two normalised

cointegrating vectors are d1 and d2 that have normalisation constants c1 and c2. The fact that

we are considering one-equation predictive regression implies that b = c1d1 + c2d2 and the

LASSO is simply estimating a linear combination of the cointegrating vectors. In other words,

instead of estimating the two cointegrating relationships separately, the LASSO identifies a

linear combination of them in b. The same intuition applies when we are working with sets

of predictors that have higher cointegrating rank.

This is exactly the case for our data set relating to stock return prediction. Simple algebra

shows that the LASSO estimates of the coefficients of the linear combination of independent

cointegrating vectors are still consistent. Furthermore, all the asymptotic properties of the

LASSOmin estimates presented in Section 4 still hold. We plot the twenty year rolling window

28

LASSO estimates of some of the coefficients in a kitchen sink model in Figure 4, and provide

the full set of estimated coefficients in Figures S2 and S3 in our online supplement.

An estimated coefficient of 0 indicates that the LASSOmin model drops the associated

variable from the predictive model. Interestingly, the first two panels of Figure 4 show that

although the coefficients of d/p and d/y often move together (in opposite directions), they

do not always reflect a (1, �1) cointegrating relationship. This is due to the presence of other

persistent predictors whose predictive ability is sporadic, and as these variables enter and

leave the estimated cointegrating relationship, the coefficients of d/p and d/y change. There

is an apparent structural break in early 1996, after which more than half of the persistent

variables are discarded by the LASSO. Only the earnings price ratio (e/p), stock variance

(svar) and inflation (in f l) appear relevant in most windows after early 1996.

Figure 4, together with Figures S2 and S3 also shows that the signs of some of the esti-

mated coefficients change in different parts of the sample. For instance, the earnings price

ratio (e/p) is positively related to the equity premium prior to 1975 or after 1995, but it is

negatively correlated with the equity premium from 1975 to 1995. This reflects the changing

nature of stock return predictability through time. However, the signs of some coefficients

stay the same throughout the sample. For instance, a higher Treasury bill rate (tbl) or inflation

rate (in f l) always leads to a lower equity premium, and a higher dividend price ratio (d/p),

default yield spread (d f y) or long-term return (ltr) always leads to higher premium.

We use colour to illustrate the NBER recession dates in Figures 4, S2 and S3 to investigate

whether expansions or recessions have any systematic impact on the predictive ability of

the regressors. There does not appear to be a systematic relationship between the relevance

of any predictor and phases of the business cycle. However, several predictor coefficients

(e.g. d/p, d/y and e/p) change very sharply immediately after the 1981-1982 recession, and

then again just before the 1990-1991 recession, and several (e/p, ntis, svar and in f l) take

on different values during the 2007-2009 recession. The results demonstrate the sporadic

nature of the relationship between equity premium and the predictors commonly used in the

literature, which highlights the usefulness of the LASSO to “auto-select” the most relevant

and important predictors over successive estimation windows to achieve forecasting gains.

Finally, we examine the robustness of the forecasting results using the Welch and Goyal

(2008) data, but starting from January 1927. The FMSEs of the competing forecast methods

under consideration are tabulated in Table S2 in our online supplement. Compared with

Table 4, most of the results are qualitatively similar. Overall, the LASSOmin forecasts per-

29



Figure 4: LASSOmin estimated coefficients over twenty year windows for eight key predictors(with recession periods shaded in colour)

1965 1975 1985 1995 2005 20120

0.20.4

d/p

1965 1975 1985 1995 2005 2012-0.2-0.10

0.1

d/y

1965 1975 1985 1995 2005 2012-0.05

0

0.05

e/p

1965 1975 1985 1995 2005 2012

-1-0.50

tbl

1965 1975 1985 1995 2005 20120

2

4

dfy

1965 1975 1985 1995 2005 2012-101

dfr

1965 1975 1985 1995 2005 2012

-202

svar

1965 1975 1985 1995 2005 2012-2-10

infl

30

form the best, and pair-wise Giacomini and White (2006) tests in the kitchen sink set-up

always find that forecasts based on LASSOmin estimation are statistically better than those

based on the OLS regression. The Hansen (2005) test for superior predictive ability always

ranks LASSOmin model as one of the three best model specifications for forecasting the equity

premium. In general, we can conclude that LASSO has been successful in dealing with the

sporadic relationship between the predictors and the prediction objective. This, together with

the LASSO’s ability to capture cointegration among persistent predictors leads to superior

forecasting performance.

7 Conclusion

This paper considers the use of the LASSO in a predictive regression in which the depen-

dent variable is stationary, there are many predictors, and some of these predictors are non-

stationary and potentially cointegrated. We show that, asymptotically, the LASSO yields

a consistent estimator of the coefficients, including those for the cointegrating vector even

when both stationary and nonstationary predictors are allowed to diverge although the di-

mension of the nonstationary predictors is not as large as that of the stationary ones. We also

provide the asymptotic distribution of the cointegrating coefficients, which can be simulated

and used for statistical inference. The set of assumptions required for the consistency of the

LASSO estimator allows the dimensions of both the stationary and nonstationary predictors

to be arbitrarily large, making it feasible to implement the LASSO even when the number

of predictors is higher than the sample size. Our simulation studies show that the proposed

LASSO approach leads to superior forecasting performance relative to the commonly used

OLS kitchen sink model, an historical average, and several commonly used forecast proce-

dures.

The application of the LASSO to the problem of predicting the equity premium provides

an interesting illustrative example. Given that there are numerous predictors for the equity

premium, the variable selection function of the LASSO tackles the challenge of selecting the

most relevant variables very nicely. The LASSO also helps to deal with the non-stationarity

of some of the predictors by retaining those in a cointegrating relationship and estimating

the associated cointegrating coefficients. When we compare the performance of forecasts

of the equity premium based on models estimated via the LASSO with forecasts based on

other forecasting methods, we find that the former are usually superior, and we attribute

31

this finding, at least in part, to the LASSO’s ability to capture cointegration among persistent

predictors.

To the best of our knowledge, our paper is the first to study the statistical properties of

LASSO estimates of predictive regression coefficients in the presence of cointegration. We

should emphasize that our methodology does not exclude other ways of improving the pre-

dictability of stock returns, for example, bringing in other macroeconomic variables (Buncic

and Tischhauser, 2016) and technical indicators (Neely et al., 2014) and imposing constraints

on the return forecasts and/or estimated coefficients (Campbell and Thompson, 2008). As

documented by Buncic and Tischhauser (2016) and Li and Tsiakas (2016), combining addi-

tional information and parameter constraints with LASSO might further improve the pre-

dictability of the equity premium.

Note: We refer readers to our online supplement for empirical results not contained in this

paper.

References

Binsbergen, V., Jules, H. and Koijen, R. S. (2010), ‘Predictive regressions: A present-valueapproach’, The Journal of Finance 65(4), 1439–1471.

Buhlmann, P. and van de Geer, S. (2011), Statistics for High-Dimensional Data, New York:Springer.

Buncic, D. and Tischhauser, M. (2016), Macroeconomic factors and equity premium pre-dictability, Working paper.

Campbell, J. Y. (1987), ‘Stock returns and the term structure’, Journal of Financial Economics18(2), 373–399.

Campbell, J. Y. and Shiller, R. J. (1988), ‘The dividend-price ratio and expectations of futuredividends and discount factors’, Review of Financial Studies 1(3), 195–228.

Campbell, J. Y. and Thompson, S. B. (2008), ‘Predicting excess stock returns out of sample:Can anything beat the historical average?’, Review of Financial Studies 21(4), 1509–1531.

Campbell, J. Y. and Yogo, M. (2006), ‘Efficient tests of stock return predictability’, Journal ofFinancial Economics 81(1), 27–60.

Caner, M. and Knight, K. (2013), ‘An alternative to unit root tests: Bridge estimators differ-entiate between nonstationary versus stationary models and select optimal lag’, Journal ofStatistical Planning and Inference 143(4), 691–715.

32


Christoffersen, P. and Diebold, F. (1998), ‘Cointegration and Long-horizon Forecasting’, Jour-nal of Business and Economic Statistics 16, 450–458.

Cochrane, J. H. (1991), ‘Production-based asset pricing and the link between stock returnsand economic fluctuations’, The Journal of Finance 46(1), 209–237.

Cremers, K. M. (2002), ‘Stock return predictability: A Bayesian model selection perspective’,Review of Financial Studies 15(4), 1223–1249.

Dow, C. H. (1920), Scientific stock speculation, Magazine of Wall Street.

Elliott, G., Gargano, A. and Timmermann, A. (2013), ‘Complete subset regressions’, Journal ofEconometrics 177(2), 357–373.

Fama, E. F. and French, K. R. (1993), ‘Common risk factors in the returns on stocks and bonds’,Journal of Financial Economics 33(1), 3–56.

Fama, E. F. and Schwert, G. W. (1977), ‘Asset returns and inflation’, Journal of Financial Eco-nomics 5(2), 115–146.

Fan, J. and Lv, J. (2010), ‘A selective overview of variable selection in high dimensional featurespace’, Statistica Sinica 20(1), 101.

Giacomini, R. and White, H. (2006), ‘Tests of conditional predictive ability’, Econometrica74(6), 1545–1578.

Gonzalo, J. and Pitarakis, J.-Y. (1999), Dimensionality effect in cointegration analysis, in R. En-gle and H. White, eds, ‘Festschrift in Honour of Clive Granger’, Oxford University Press,pp. 212–229.

Hansen, B. E. (1992), ‘Convergence to stochastic integrals for dependent heterogeneous pro-cesses’, Econometric Theory 8(04), 489–500.

Hansen, P. R. (2005), ‘A test for superior predictive ability’, Journal of Business & EconomicStatistics 23, 365–380.

Johansen, S. (1991), ‘Estimation and Hypothesis Testing of Cointegration Vectors in GaussianVector Autoregressive Models’, Econometrica 59, 1551–1580.

Johansen, S. (1995), Likelihood-based Inference on Cointegrated Vector Auto-regressive Models, Ox-ford University Press, Oxford.

Kock, A. B. (2016), ‘Consistent and conservative model selection with the adaptive lasso instationary and nonstationary autoregressions’, Econometric Theory 32(01), 243–259.

Koijen, R. S. and Nieuwerburgh, S. V. (2011), ‘Predictability of returns and cash flows’, AnnualReview of Financial Economics 3(1), 467–491.

Koo, B. and Seo, M. H. (2015), ‘Structural-break models under mis-specification: implicationsfor forecasting’, Journal of Econometrics 188(1), 166–181.

Lamont, O. (1998), ‘Earnings and expected returns’, The Journal of Finance 53(5), 1563–1587.

Lettau, M. and Ludvigson, S. (2001), ‘Consumption, aggregate wealth, and expected stockreturns’, The Journal of Finance 56(3), 815–849.

33

Li, J. and Tsiakas, I. (2016), A new approach to equity premium prediction: Economic funda-mentals matter!, Working paper.

Liao, Z. and Phillips, P. C. (2015), ‘Automated estimation of vector error correction models’,Econometric Theory 31(03), 581–646.

Ludvigson, S. C. and Ng, S. (2007), ‘The empirical risk–return relation: A factor analysisapproach’, Journal of Financial Economics 83(1), 171–222.

Martin, V., Hurn, S. and Harris, D. (2012), Econometric Modelling with Time Series, number9780521196604 in ‘Cambridge Books’, Cambridge University Press.

Neely, C., Rapach, D., Tu, J. and Zhou, G. (2014), ‘Forecasting the equity risk premium: Therole of technical indicators’, Management Science 60(7), 1772–1791.

Pastor, L. and Stambaugh, R. F. (2009), ‘Predictive systems: Living with imperfect predictors’,The Journal of Finance 64(4), 1583–1628.

Phillips, P. C. (2014), ‘On confidence intervals for autoregressive roots and predictive regres-sion’, Econometrica 82(3), 1177–1195.

Phillips, P. C. (2015), ‘Halbert white jr. memorial jfec lecture: Pitfalls and possibilities inpredictive regression’, Journal of Financial Econometrics 13(3), 521–555.

Pontiff, J. and Schall, L. D. (1998), ‘Book-to-market ratios as predictors of market returns’,Journal of Financial Economics 49(2), 141–160.

Poskitt, D. S. (2000), ‘Strongly Consistent Determination of Cointegrating Rank via CanonicalCorrelations’, Journal of Business and Economic Statistics 18, 77–90.

Qian, J., Hastie, T., Friedman, J., Tibshirani, R. and Simon, N. (2013), Glmnet for matlab,http://www.stanford.edu/ hastie/glmnetmatlab/.

Rapach, D. E., Ringgenberg, M. C. and Zhou, G. (2016), ‘Short interest and aggregate stockreturns’, Journal of Financial Economics 121(1), 46–65.

Rapach, D. E., Strauss, J. K. and Zhou, G. (2010), ‘Out-of-sample equity premium prediction:Combination forecasts and links to the real economy’, Review of Financial Studies 23(2), 821–862.

Rapach, D. and Zhou, G. (2013), Forecasting stock returns, Vol. 2, Elsevier.

Rossi, B. (2013), ‘Advances in forecasting under instability’, Handbook of Economic Forecasting(111), 1203–1324.

Stambaugh, R. F. (1999), ‘Predictive regressions’, Journal of Financial Economics 54(3), 375–421.

Stock, J. H. and Watson, M. W. (2002), ‘Forecasting using principal components from a largenumber of predictors’, Journal of the American Statistical Association 97(460), 1167–1179.

Tibshirani, R. (1996), ‘Regression shrinkage and selection via the lasso’, Journal of the RoyalStatistical Society. Series B (Methodological) 58(1), 267–288.

Tibshirani, R. (2011), ‘Regression shrinkage and selection via the lasso: a retrospective’, Journalof the Royal Statistical Society: Series B (Statistical Methodology) 73(3), 273–282.

34

Timmermann, A. (2006), ‘Forecast combinations’, Handbook of Economic Forecasting 1, 135–196.

Welch, I. and Goyal, A. (2008), ‘A comprehensive look at the empirical performance of equitypremium prediction’, Review of Financial Studies 21(4), 1455–1508.

Wilms, I. and Croux, C. (2016), ‘Forecasting using sparse cointegration’, International Journalof Forecasting 32, 1256 – 1267.

Zou, H. and Hastie, T. (2005), ‘Regularization and variable selection via the elastic net’, Journalof the Royal Statistical Society. Series B (Statistical Methodology) 67(2), 301–320.

35

Appendix A

A.1 Proofs of Theorems

Proof of Theorem 1. Since by definition

ku � DBc(q� q0)k22�

T + lkgk1 kuk22�

T + lkg0k1,

rearranging the above equation yields

kDBc(q� q0)k22�

T + lkgk1 2u0DBc(q� q0)

�T + lkg0k1

(q� q0)0BcD0DBc

T(q� q0) + lkgk1 2u

0DBc(q� q0)�

T + lkg0k1. (15)

We start with the first term of the right hand side in (15). Given A1 (i.e. c belongs to acompact parameter space), c = O(1) and thus for any h > 0

Pr�

2c��u

0X�

T��

• (Th ^ k ln k) ! 1

by Lemma 2. Further, Lemma 1 implies that there exists a l such that

Pr�

2ku0Gk•

�T l

! 1

and l�1 = o(p

T/ ln p). Hence, with probability approaching one,

2ku0DBc(q� q0)k•

�T lkg � g0k1 + cOp (Th ^ k ln k) kd � d0k1

due to Lemma 1 and the Holder inequality.Therefore, regarding (15), for l �

⇣2l _

⇣(Th ^ k ln k) /3

pT⌘⌘

,

2kDBc(q� q0)k22�

T + 2lkgk1 lkg � g0k1 + 2lkg0k1 + cOp (Th ^ k ln k) kd � d0k1.

Note that due to the triangle inequality,

kgk1 = kgS0k1 + kgSc

0k1 � kg0

S0k1 � kgS0

� g0S0k1 + kgSc

0k1

where S0 is the support of the parameter g0, i.e. S0 := {j : g0j 6= 0} for g, and by construction

kg � g0k1 = kgS0� g0

S0k1 + kgSc

0k1.

Then, as k/p

T 3l,


T + lkgSc0k1 3lkgS0

� g0S0k1 + Op((Th ^ k ln k) /

pT)ck

pT(d � d0)k1

3l⇣kgS0

� g0S0k1 + ck

pT(d � d0)k1

⌘

= 3l��B⇤

⇣q

0S⇤ � q0

S⇤

⌘��1

, (16)

where S⇤ is the index set combining S0 and the indices for d, and B⇤ is a diagonal matrix ofsize S⇤ whose first k elements are c and the others are 1. Also, the first k elements of q� q0 donot lie within the span of d0 due to the norm normalization. That is, if d � d0 = ad0 for somea, then d = (1 + a)d0, which in turn implies that a = 0 due to the norm normalization. This

36

enables us to apply the compatibility condition A2, implying that there exists a sequence ofpositive constants fT such that

��B⇤⇣

q0S⇤ � q0

S⇤

⌘��1 kDBc(q� q0)k2

2s0(l, q)�

Tf2T

= (q� q0)0BcD0DBc

T(q� q0)s0(l, q)

�f2

T, (17)

where s0(l, q) denotes cardinality of the index set S⇤ for q, i.e. s0 = |S⇤|. Then


T + l��Bc

�q� q0��

1

=2kDBc(q� q0)k22�

T + lkgSc0k1 + l

��B⇤⇣

q0S⇤ � q0

S⇤

⌘��1

4l��B⇤

⇣q

0S⇤ � q0

S⇤

⌘��1

4lq

s0(l, q)kDBc(q� q0)k2�p

TfT

kDBc(q� q0)k22�

T + 4l2s0(l, q)�

f2T.

The first inequality comes from (16), the second inequality is due to (17) and the last inequalityuses 4uv u2 + 4v2. Rearranging the last inequality,

kDBc(q� q0)k22�

T + l��Bc

�q� q0��

1 4l2s0(l, d)/f2T.

When b0 = 0, c =��b��

1= Op

⇣l/

pT⌘

, since��Bc

�q� q0��

1 = cp

T��d � d0

��1+ kg � g0k1

and d is not consistent due to the lack of identification.

Proof of Theorem 2.

First, we refine the convergence rate of d under the additional assumption that k is fixed.Recall that

D0DT

=

✓X0X

�T2 X0G

�Tp

TG0X

�Tp

T G0G�

T

◆

and note that

kX0G�

Tp

Tk = Op(T�1/2+q)

where q is any positive number due to Lemma 2. Removing the positive term involvingG0G and substituting this and the convergence rate of q that we just obtained into the firstinequality in (16), we can deduce that

c1Tkd � d0k21 Op(1)kd � d0k1 + Op(s0l2T�1+q)

This yields the desired rate for d.Let

ST(g, d) =1T

T

Ât=1

�yt � g0

tg � x0td�2 .

It is clear from Theorem 1 that

supd

��ST(g, d)� ST(g0, d)

�� = op(T�1),

37

where the supremum is taken over any T�1 neighborhood of d0. Therefore, let ST(d) =ST(g0, d) and consider weak convergence of the following stochastic process

DT(b) = T⇣

ST(d0 + T�1b)� ST(d

0)⌘

,

= �2c0

T

T

Ât=1

utx0tb +c0

Tb0

T

Ât=1

xtx0tbc0

) �2c0b0Z 1

0W (r)dU (r) + L

�+ c0b0

Z 1

0W (r)W (r)0drbc0

:= D(b)

over any given compact space for b.Furthermore, the estimator d is the minimizer of ST(g, d) under the constraint that kdk1 =

1. This makes the parameter space for b orthogonal to d0 to prevent the quadratic term in DTfrom degenerating. The precise formulation takes some elaboration. Consistency of d meansthe sign-consistency of its nonzero elements as well. Thus,

0 = T(kdk1 � kd0k1)

= Âj(sgn(d0

j ))T(dj � d0j ) + Â

jT|dj|1{d0

j =0}

= sgn(d0)0b + Âj|bj|1{d0

j =0}

with probability approaching 1. Therefore, we can conclude that

T(d � d0)d! arg min

b:sgn(d0)0b=�Âj |bj|1{d0j =0}

D(b),

by the argmax continuous mapping theorem since the limit process is quadratic over theparameter space.

A.2 Lemmata

Lemma 1 Let�

ztj

, j = 1, ..., p, be centered a-mixing sequences satisfying the following: for somepositive g1, g2 > 1, a, and b,

a (m) exp (�amg1)

P��ztj

�� > z H (z) := exp

�1 � (z/b)g2

�for all j (18)

g =⇣

g�11 + g�1

2

⌘�1< 1,

ln p = O⇣

Tg1^(1�2g1(1�g)/g)⌘

. (19)

Then,

maxjp

��

T

Ât=1

ztj

�� = Op

⇣pT ln p

⌘.

Proof of Lemma 1. Consider the sums of truncated variables first. Consider zMtj =

ztj1��ztj

�� M + M1

��ztj�� > M

with M = (1 + Tg1)1/g2 . Due to Proposition 2 in Mer-

38

levede et al. (2011), we have

E

exp

T�1/2

T

Ât=1

⇣zM

tj � EzMtj

⌘!!= O (1)

uniformly in j, and thus

E maxjp

��T�1/2

T

Ât=1

⇣zM

tj � EzMtj

⌘�� = O (ln p) . (20)

For the remainder term, note that

E maxjp

T

Ât=1

��ztj � zMtj + EzM

tj

�� TpE��ztj � zM

tj + EzMtj

��

2TpZ •

MH (x) dx

= O (Tp exp (�Tg1)) ,

where the last equality is obtained by bounding the integral by MH(M) = exp (�Tg1) ingeneral since H (x) decays at an exponential rate.

Thus,

E maxjp

��

T

Ât=1

ztj

�� E maxjp

��

T

Ât=1

zMtj � EzM

tj

��+ E maxjp

��

T

Ât=1

hztj �

⇣zM

tj � EzMtj

⌘i��

= O⇣p

T ln p⌘+ O (Tp exp (�Tg1)) .

Finally, recall that ln p = o (Tg1) by (19).Next, we derive another maximal inequality involving integrated processes by means of

martingale approximation (e.g. Hansen 1992).

Lemma 2 Let {xti} be an integrated process such that Dxti is centered and x0i = Op (1). Let {Dxti}and

�ztj

, i = 1, ..., k, j = 1, ..., p, be a-mixing sequences satisfying the conditions in Lemma 1. Alsolet k = O

⇣pT⌘

and p = O (Tr) for some r < •. Then

max1ik

max1jp

��

T

Ât=1

ztjxti

�� = Op

⇣T1+d ^ Tkp

⌘

for any d > 0.

Proof of Lemma 2. When k and p are of order O (lnc T) for some finite c, boundingthe maximum by the summation yields the bound of Tkp since ÂT

t=1 ztjxti = Op (T), whichis standard see e.g. Hansen (1992). Thus, we turn to the case where k and p grows at apolynomial rate of T and tighten this bound.

We apply the martingale difference approximation to ztj following Hansen (1992). Let Ftdenote the generated sigma field from all zt�s,j and xt+1�s,i, s, i, j = 1, 2, ... and let Et (·) be theconditional expectation based on Ft. Define

etj =•

Âs=0

�Etzt+s,j � Et�1zt+s,j

�and ztj =

•

Âs=1

Etzt+s,j.

39

Then it is clear thatztj = etj + zt�1,j � zt,j

and etj is a martingale difference sequence.We begin with deriving a tail probability bound for the sum of a martingale difference

arrayn

#t = etjxti/p

To

for each i and j, where we suppress the dependence of #t on i and jto ease notation. Consider a truncated sequence

#Ct = #t1 {|#t| < C}+ C1 {#t � C}� C1 {#t �C}

and its centered version#C

t = #Ct � E

⇣#C

t |Ft�1

⌘.

Then, by construction #Ct is a martingale difference sequence. Furthermore,

��#Ct

�� #C

t

��+ E⇣��#C

t

�� |Ft�1

⌘ 2C

by the triangle and Jensen’s inequalities, and by the fact that��#C

t�� C. Furthermore, let

rCt = #t � #C

t =⇣

#t � #Ct

⌘� E

⇣⇣#t � #C

t

⌘|Ft�1

⌘,

where the last equality follows from the fact that {#t} is an MDS and thus

#t = #Ct + rC

t .

By Azuma’s (1967) inequality,

Pr

(��1pT

T

Ât=1

#Ct

�� c

) 2 exp

✓� c2

4C2

◆.

On the other hand, since rCt = 0 if |#t| C, we have for any c > 0,

Pr

(��1pT

T

Ât=1

rCt

�� > c

) Pr

⇢max

t|#t| > C

�. (21)

However,

Pr⇢

maxt

|#t| > C�

Pr⇢

maxt

��etj��max

t

��xtip

T

�� > C�

Ât

Prn��etj

�� >p

Co+ Pr

⇢max

t

��xTip

T

�� >p

C�

C�qT

Ât=1

E��etj��2q

+ 2T exp

� (TC)g/2

C1

!+ exp

✓� C

C2

◆,

where the first term comes from the Markov inequality and the second and third terms aregiven by Merlevede et al.’s (2011) equation (1.11).

Next, we derive some moments bounds for the sums of ztj and etj. By (A.1) of Hansen

40

(1992), if ztj is a-mixing of size �qb/ (q � b) with E��ztj��q < •,

⇣E��ztj��b⌘1/b

6•

Âk=1

a1/b�1/qk

⇣E��ztj��q⌘1/q

.

Further, by the triangle inequality,

⇣E��etj��q⌘1/q

=⇣

E��ztj + ztj � zt�1,j

��q⌘1/q

⇣

E��ztj��q⌘1/q

+ 2⇣

E��ztj��q⌘1/q

.

Under our moment conditions and the mixing decay rate in Lemma 1 , these quantities arebounded for any b and q such that q > b. Then, combining the preceding arguments weconclude that for any aT,

Pr

(��1

aTp

T

T

Ât=1

#t

�� c

)

Pr

(��1pT

T

Ât=1

#Ct

�� aTc2

)+ Pr

⇢max

t|#t| > C

�

2 exp✓� a2

Tc2

4C2

◆+ C�q

T

Ât=1

E��etj��2q

+ 2T exp

� (TC)g/2

C1

!+ exp

✓� C

C2

◆.

Given this bound, we can deduce from the union bound that

Pr

(max1ik

max1jp

��1

aTT

T

Ât=1

etjxti

�� > c1

)

kp Pr

(��1

aTT

T

Ât=1

etjxti

�� > c1

)

kp

2 exp

✓� a2

Tc21

4C2

◆+ C�q

T

Ât=1

E��etj��2q

+ 2T exp

� (TC)g/2

C1

!+ exp

✓� C

C2

◆!.

Since E��etj��2q

< •, we begin with choosing C such that kpTC�q ! 0. This choice of C makesthe last two terms degenerate, and for any c1 > 0 the first term also degenerates as long asaT � bT (kpT)

12q ln 2kp for some bT ! •. Since q can be chosen arbitrarirly big, we can set aT

at any polynomial rate of T to conclude that

max1ik

max1jp

��1T

T

Ât=1

etjxti

�� = Op

⇣Td⌘

, (22)

for any d > 0.

41

Next, we note that by the union bound,

max1ik

max1jp

��1T

T

Ât=1

�zt�1,j � zt,j

�xti

�� (23)

max1ik

max1jp

��z0,j�� |x1i|

T+ max

1ikmax1jp

��zT,j�� |xTi|

T

+ max1ik

max1jp

��1T

T

Ât=2

�zt�1,jDxti � Ezt�1,jDxti

��+ max

1ikmax1jp

Ezt�1,jDxt

= Op

✓k log pp

T

◆+ Op

✓log kpp

T

◆+ O (1) ,

since supt |xt| /p

T = Op (1) , maxj��z0,j

�� = Op�p

log p�

, and we can use Lemma 1 with{|xy| > c} ⇢

�|x| >

pc [�|y| >

pc

. Note that the bound in (22) dominates the one in (23)for k = O

⇣pT⌘

and any d > 0.

42

Documents

High-dimensional predictive regression in the …...cointegration, we derive the limiting distribution of the cointegrating vector under certain regularity conditions, in contrast