Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
High-dimensional predictive regression in the presence of
cointegration⇤
Bonsoo Koo Heather Anderson
Monash University Monash University
Myung Hwan Seo† Wenying Yao
Seoul National University Deakin University
Abstract
We propose a Least Absolute Shrinkage and Selection Operator (LASSO) estimator of a
predictive regression in which stock returns are conditioned on a large set of lagged co-
variates, some of which are highly persistent and potentially cointegrated. We establish
the asymptotic properties of the proposed LASSO estimator and validate our theoreti-
cal findings using simulation studies. The application of this proposed LASSO approach
to forecasting stock returns suggests that a cointegrating relationship among the persis-
tent predictors leads to a significant improvement in the prediction of stock returns over
various competing forecasting methods with respect to mean squared error.
Keywords: Cointegration, High-dimensional predictive regression, LASSO, Return pre-
dictability.
JEL: C13, C22, G12, G17
⇤The authors acknowledge financial support from the Australian Research Council Discovery GrantDP150104292.
†Corresponding author. His work was supported by Promising-Pioneering ResearcherProgram throughSeoul National University(SNU) and from the Ministry of Education of the Republic of Korea and the Na-tional Research Foundation of Korea (NRF-0405-20180026). Author contact details are [email protected],[email protected], [email protected], [email protected]
1
1 Introduction
Return predictability has been of perennial interest to financial practitioners and scholars
within the economics and finance professions since the early work of Dow (1920). To this
end, empirical studies employ linear predictive regression models, as illustrated by Stam-
baugh (1999), Binsbergen et al. (2010) and the references therein. Predictive regression allows
forecasters to focus on the prediction of asset returns conditioned on a set of one-period
lagged predictors. In a typical example of a predictive regression, the dependent variable is
the rate of return on stocks, bonds or the spot rate of exchange, and the predictors usually
consist of financial ratios and macroeconomic variables.
The literature that has investigated the predictability of stock returns is very large, and
it has proposed many economically plausible predictive variables and documented their em-
pirical forecast performance. Koijen and Nieuwerburgh (2011) and Rapach and Zhou (2013)
provide recent surveys of this literature. However, despite the plethora of possible predictors
that have been suggested, there is no clear-cut evidence as to whether any of these variables
are successful predictors. Indeed, many studies have concluded that there is very little dif-
ference between predictions derived from a predictive regression and the historical mean of
stock returns. Welch and Goyal (2008) provided a detailed study of stock return predictability
based on many predictors suggested in the literature, and concluded that ”a healthy skepticism
is appropriate when it comes to predicting the equity premium, at least as of 2006”.
Least squares predictive regression has provided the main estimation tool for empirical
efforts to forecast stock returns, but this involves several econometric issues. See Phillips
(2015) for a summary of these issues, together with relevant econometric theory. Among
these issues, we restrict our attention to two specific problems, which we believe could be
tackled by using a different estimation approach. Firstly, the martingale difference features of
excess returns are not readily compatible with the high persistence in many of the predictors
suggested by the empirical literature. It is well-documented that excess returns are stationary
and that financial ratios are often highly persistent and potentially non-stationary processes.
The regression of a stationary variable on a non-stationary variable provides an example of
an unbalanced regression, and such a regression is difficult to reconcile with the assumptions
required for conventional estimation.1 This so-called unbalanced regression has the potential
1Although the issues associated with unbalanced regression can be tackled by incorporating an accommo-dating error structure such as (fractionally) integrated noise, this error structure is not readily compatible withmaking predictions.
2
to induce substantial small sample estimation bias, and it also renders many of the usual
types of statistical inference conducted in standard regression contexts invalid. See Campbell
and Yogo (2006), Phillips (2014) and Phillips (2015) for more details.
Secondly, the literature does not provide guidance on which predictors are to be included
or excluded from predictive regressions. Given a large pool of covariates available for return
predictability, selecting an appropriate subset of relevant covariates can improve the precision
of prediction. However, the choice of the right predictors is extremely difficult in practice,
and precision is often sacrificed for parsimony, especially if there are many predictors but
the sample size used for estimation is small. Quite commonly a single predictor, or only a
handful of predictors are used.
Against this backdrop, our paper studies the use of the highly celebrated LASSO (orig-
inally proposed by Tibshirani, 1996) to estimate a predictive regression when some of the
predictors are persistent and it is possible that some of these persistent predictors are cointe-
grated. We are particularly interested in (i) the LASSO’s capability in model selection in this
context; and (ii) the properties of LASSO estimation when potentially cointegrated predictors
are involved, especially if such properties can circumvent issues associated with unbalanced
regression. As will become clear in our empirical study, there is at least one cointegrating
relationship between persistent predictors for stock returns, which lends strong support to
an estimation approach that allows for cointegration. In a setting in which the target vari-
able is stationary and there are many persistent predictors, the fact that the model selection
mechanism of the LASSO eliminates variables with small coefficients, plays a role with re-
spect to excluding persistent variables that do not contribute to a cointegrating relationship.
This leads to ”balance” between the time series properties of the target variable and a linear
combination of the regressors, and tractable prediction errors. It is also possible that the in-
corporation of cointegration in this context of predicting returns will improve forecasts, as
Christoffersen and Diebold (1998) have noted in other settings.
Although predictive regression enables us to analyze the explanatory power of each indi-
vidual predictor, our primary objective is to improve the overall prediction of stock returns
given a set of predictors. To do this, we focus on the out-of-sample forecasting performance
of the set of predictors selected via LASSO estimation. In addition, we investigate the large
sample properties of the LASSO estimation of the linear predictive regression. We show the
consistency of the LASSO estimator of coefficients on both stationary and nonstationary pre-
dictors within the linear predictive regression framework. Furthermore, in the presence of
3
cointegration, we derive the limiting distribution of the cointegrating vector under certain
regularity conditions, in contrast with the usual LASSO estimation, in which case the limit-
ing distribution of the estimator is not readily available. To the best of our knowledge, this
paper is the first work that establishes the limiting distribution for the LASSO estimator of a
cointegrating vector.
The remainder of this paper is organized as follows. We briefly discuss related literature
in Section 2. Section 3 introduces predictive regression models, and Section 4 discusses the
LASSO selection of relevant predictors within the model framework. In particular, Section 4.1
demonstrates the consistency of the proposed LASSO estimators, and Section 4.2 derives the
limiting distribution of the LASSO estimator of a cointegrating vector. Simulation results are
discussed in Section 5. Also, a real application of our methodology based on the Welch and
Goyal (2008) data set is given in Section 6. Section 7 concludes. All proofs of theorems and
lemmas are found in A.
2 Relevant literature
Most of the literature on forecasting stock returns has focused on whether a certain variable
has predictive power. The present-value relationship between stock prices and their cash
flows suggests a wide array of financial ratios and other financial variables as possible pre-
dictors. Documented examples include the earnings-price ratio (Campbell and Shiller, 1988),
the yield spread (Fama and Schwert, 1977; Campbell, 1987), the book-to-market ratio (Fama
and French, 1993; Pontiff and Schall, 1998), the dividend-payout ratio (Lamont, 1998) and
the short interest rate (Rapach et al., 2016). Various macroeconomic variables such as the
consumption-wealth ratio (Lettau and Ludvigson, 2001) and the investment rate (Cochrane,
1991) have been considered as well. More recently, Neely et al. (2014) have suggested the use
of technical indicators.
An issue that arises when returns are conditioned on just one predictor, is that prediction
can be biased if other relevant predictors are omitted. This issue might partially explain
the failure of single predictor models, although the use of several predictors in a standard
regression set-up has rarely improved out of sample forecasts. Welch and Goyal (2008) noted
that despite the poor forecast performance of individual predictors over long periods of time,
there are episodes during which an individual predictor has power. The latter authors used
their findings to claim that most (return forecasting) models are unstable.
4
Recent research on forecasting stock returns has moved away from single predictor mod-
els, placing more emphasis on using the large pool of predictors suggested in the literature
and drawing on modern econometric forecasting techniques. This research comes with an ac-
knowledgment that many predictors are potentially useful in forecasting stock returns in one
way or another, but the high dimension of the set of predictors can pose a significant challenge
due to the degrees-of-freedom problem (Ludvigson and Ng, 2007), and the temporal insta-
bility of the predictive power of any single predictor renders forecasting extremely difficult
(Rossi, 2013). Forecast combination methods (see Timmermann, 2006) are now popular, and
have been used by Rapach et al. (2010) to draw on information from a large set of predictors
while alleviating the effects of structural instability. Other forecasts that rely on a multivariate
approach include the dynamic factor approach (Ludvigson and Ng, 2007), Bayesian averag-
ing (Cremers, 2002), and use of a system of equations involving vector autoregressive models
(Pastor and Stambaugh, 2009).
Our LASSO approach in the presence of cointegration is an addition to this strand of lit-
erature that uses multiple predictors. Since the seminal paper of Tibshirani (1996), LASSO
estimation has gained popularity in various fields of study. See Fan and Lv (2010) and Tibshi-
rani (2011) for recent developments. The wide variety of predictors available for forecasting
stock returns necessitates a wise selection of predictors that capture the quintessential dy-
namics of future returns. Moreover, the unstable nature of their prediction power at different
time periods renders covariate selection even more crucial.
A few studies have investigated the use of LASSO in the presence of non-stationary vari-
ables. Caner and Knight (2013) proposed a unit root test involving Bridge estimators in order
to differentiate unit root from stationary alternatives. Liao and Phillips (2015) applied the
adaptive LASSO approach to cointegrated system in order to estimate a vector error correc-
tion model (VECM) with an unknown cointegrating rank and unknown transient lag order.
Further, Wilms and Croux (2016) have explored the forecasting ability of a VECM using an ap-
proach based on an l1 penalized log-likelihood. Quite recently, Kock (2016) further discussed
the oracle efficiency of the adaptive LASSO in stationary and non-stationary autoregressions.
These papers differ from ours in many aspects. First of all, instead of using autoregressive
models, our paper focuses on predictive regressions in which stationary and non-stationary
predictors co-exist. Secondly, we allow for the presence of cointegration and have a different
objective from the other studies in the literature, which leads to a different approach.
Although LASSO has been widely employed in various fields of studies, the financial lit-
5
erature has not made extensive use of this powerful model selection and estimation tool to
address stock return predictability. Only two recent papers have emerged. Buncic and Tis-
chhauser (2016) and Li and Tsiakas (2016) suggest that LASSO contributes to better forecasts
of stock returns because it enables researchers to select the most relevant factors among a
broad set of predictors. Both papers are closely related to the work of Ludvigson and Ng
(2007) and Neely et al. (2014). However, each falls short in explaining the theoretical ground
that underlies their empirical results, and neither takes the impact of persistent regressors on
the predictions into account. Moreover, both Buncic and Tischhauser (2016) and Li and Tsi-
akas (2016) impose sign constraints on their estimated coefficients to improve their forecasts.
In contrast, we do not require any constraint on the sign of coefficients, and we explicitly ac-
count for persistence in some of our predictors. The main difference between our work here
is that we provide theoretical credence to the use of the LASSO in a linear predictive setting,
when persistent predictors are taken into consideration.
A fundamental problem associated with forecasting stock returns is model uncertainty,
with respect to which predictors are relevant, and when. Timmermann (2006) pointed out
that there are two schools of thought on how to improve forecasting when the underlying
model is uncertain. The first suggests the use of forecast combinations, which has been
actively investigated (see Elliott et al., 2013; Koo and Seo, 2015, among many others), and
used by Rapach et al. (2010) as mentioned above. The other school of thought considers
the entire set of possible predictors in a unified framework, allowing the data to select the
best combination of predictors in an attempt to search for the best forecasting model. Most
literature based on `1-regularization, including the LASSO, falls under this second approach.2
The LASSO does not explicitly address the unstable nature of stock return predictability,
although it is possible to use the automated selection of the LASSO at different points in time
by rolling an estimation window throughout the sample period, and allowing the LASSO to
pick different predictors for these different estimation windows and provide different coeffi-
cient estimates for each set of selected predictors. We adopt this approach in our empirical
analysis of stock return predictability in Section 6.
2The former approach pools forecasts whereas the latter approach pools information. See Timmermann (2006)for the pros and cons of these two different approaches.
6
3 Econometric framework: Predictive regression
Our econometric framework is a linear predictive regression in which a large set of condition-
ing variables is available, but the prediction power of each variable could be unstable. Some
of the conditioning variables could be non-stationary, in particular integrated of order 1, i.e.
I(1) variables and might be cointegrated. Our model is framed in the context of stock return
predictability because forecasting stock returns epitomizes the issues we are attempting to
address: (i) a possibly large set of conditioning variables; (ii) some of them are quite persis-
tent; and (iii) the prediction power of each conditioning variable is unstable. Let us consider
the following linear predictive regression model:
yt = x0t�1,T bT + z0t�1aT + ut,
where bT 2 Rk for k ! • with xt,T being I(1), and aT 2 Rp for p ! • with zt being I(0).
I(q) denotes an integrated process of order q. The above array structure could also imply that
the true values of the coefficients and the number of covariates may vary as the sample size,
T changes. In what follows, however, we shall use the simpler notation that dispenses with
the T subscript,
yt = x0t�1b + z0t�1a + ut. (1)
The high persistence of xt is compatible with the property of many commonly used stock
return predictors, such as the dividend-price ratio and the earning-price ratio.3 We use zt in
model (1) to represent the other group of non-persistent return predictors, for example the
long term bond rate and inflation. See Neely et al. (2014) and Ludvigson and Ng (2007) for a
pool of plausible financial and macroeconomic predictors.
Model (1) is in line with the most recent literature on stock return predictability. It is a
multi-factor model employed in the empirical asset pricing literature. The primary objective
is to estimate the conditional mean of yt, given a mixture of possibly high-dimensional sta-
tionary and non-stationary conditioning variables available at time t � 1. In the context of the
present paper, the prediction objective yt is the rate of return on a broad market index over
the risk free rate, commonly referred to as the equity premium. It is commonly accepted that
the equity premium displays martingale difference features, whereas many return predictors
are quite persistent. The stationarity feature of the martingale difference sequence on the left-
3Welch and Goyal (2008) showed that many other return predictors including the treasury bill rate, the longterm bond yield and the term spread are also highly persistent, with a first order autocorrelation above 0.99.
7
hand side of model (1) hardly matches with the highly persistent predictors on the right-hand
side. Furthermore, a close examination of the historical data of various financial ratios reveals
a strong co-movement among these persistent variables, which lends strong credence to the
existence of at least one cointegrating relationship among them.
Without any loss of generality in the forecasting context, we assume that there is at least
one cointegrating relationship within xt. Specifically, the cointegrating vector d and the corre-
sponding constant c are such that x0tb = x0tdc.4 For identification when it comes to estimating
the true parameters d0 and c0, we define b0 = c0d0, where d0 is the standardized cointegrat-
ing relation of unit `1 norm and c0 is the response of the outcome variable to the cointegrated
term. b0 = 0 if and only if c0 = 0, under which d0 is not identified.
4 LASSO selection of predictors
This section examines the automated selection of predictors given many covariates whose
cardinality can be larger than the sample size but sparsity is assumed. In particular, we
propose LASSO estimation in which an `1-type penalty is applied. Note that in matrix form,
model (1) can be written as
Y = Xb + Za + u. (2)
The LASSO estimator for (b0, a0)0 can be obtained by
(b0, a0)0 = arg min
b,akY � Xb � Zak2
2�
T + l(kbk1 + kak1), (3)
where k · km denotes an `m norm, l is a tuning parameter, and T is the sample size.
Equation (3) describes the usual LASSO objective function that might, but need not in-
corporate a cointegrating relationship. We reparameterize the first portion of this objective
to investigate how the presence of a cointegrating relationship affects the LASSO estimation
4The interpretation of d is subtle in the case of multiple cointegrating relationships. In such a case, d mightbe the optimal linear combination of all the cointegrating vectors for predicting the outcome. Importantly, thispaper does not seek to identify the number of the cointegrating relationships nor to estimate each cointegratingvector separately. For these, one needs to set up a cointegration system, for which we refer to Johansen (1995).
8
approach. With c = kbk1, d = b/c and the true parameters b0 = c0d0 and a0,
Y � Xb � Za = u � X(b � b0)� Z(a � a0)
= u � X(d � d0)c � Xd0(c � c0)� Z(a � a0)
= u � XpT
pT(d � d0)c � W(c � c0)� Z(a � a0)
= u � DBc(q� q0),
where q = (p
Td0, g0)0 with g = (c, a0)0, W = Xd0, D = (X�p
T, G) with G = (W, Z), and
Bc denotes a diagonal matrix whose first k diagonal elements are c and the others are 1.
Likewise, Bc denotes the matrix replacing c in Bc with c.
In addition, the `1 penalty terms can be rewritten as
l(kbk1 + kak1) = l(c + kak1) = lkgk1,
with the constraint kdk1 = 1 holding by construction.
Consequently, in the presence of a cointegrating relationship, the infeasible LASSO esti-
mator for q can be represented by
q = arg minq:kdk1=1
ku � DBc(q� q0)k22�
T + lkgk1. (4)
Note that (4) is a constrained minimisation problem given that the estimator for q is obtained
under the constraint that kdk1 = 1. Note, also, that lkgk1 can be replaced by lkqk1 without
changing the minimisation problem in this set-up. This formulation separates the estima-
tion of cointegrating relation and the effect of the error correction term and shows that the
cointegrating relation is estimated without shrinkage. This is useful for inference.
In the subsection that follows, we establish the large sample properties associated with the
proposed LASSO estimator, q = (p
Td0, g0)0. We start with the consistency of the proposed
estimator and its convergence rate, and proceed to the large sample distribution of the LASSO
estimator for the cointegrating vector d. All proofs for our theoretical results are provided
in Appendix A. We need the following set of regularity conditions to derive the statistical
properties of LASSO estimation of q in (4).
Assumptions
A1 Each element of the vector g belongs to a compact set and kdk1 = 1 for all d.
9
A2 Let s0 = |S0|, qS be the subvector of q whose elements are in the index set S, and S =
D0D/T be the properly scaled Gram matrix. Then, there exists a sequence of positive
constants fT such that for all q whose first k elements do not lie in the linear span of d0
and kqSc0k1 3kqS0k1, the following holds
kqS0k21
�q0Sq
�s0/fT
2,
with probability approaching one.
A3 Let the nonstationary predictors be xt = Â[Tt]k=1 vk. Then {vtut}, and
�ztjut
j=1,...,p are
centered a-mixing sequences such that for some positive g1, g2 > 1, a, and b,
a (m) exp (�amg1)
P���ztj
�� > z H (z) := exp
�1 � (z/b)g2
�for all j (5)
g =⇣
g�11 + g�1
2
⌘�1< 1,
ln p = O⇣
Tg1^(1�2g1(1�g)/g)⌘
. (6)
Assumption A1 is concerned with restrictions on the parameter space and a normalization
restriction on the cointegrating vector. Compatibility conditions such as A2 are common in the
large sample theory for LASSO estimation, and their purpose is to place weak restrictions on
the design matrix of regressors so as to bound the prediction and parameter estimation errors
in terms of the `1 norm. Here, the format of D accounts for the different dimensions of the
persistent and stationary regressors, and given the asymptotic orthogonality between these
two sets of regressors, one can think of our compatibility condition as providing separate
conditions for each set. The gram matrix of the persistent regressors does not converge to a
deterministic matrix, and therefore the compatibility factor fT is allowed to vary over T and
converge to zero. In the special case in which k is fixed, we may let it decay at a log T rate.
This is reflected in the deviation bound in Theorem 1.
Assumption A3 is a technical condition required for Lemmas 1 and 2. Notably, Assump-
tion A3 implies Assumption 1 in Hansen (1992), which ensures weak convergence of the sum
of the product of xt and ut in (1) to a stochastic integral. We can relax the tail condition for
the stationary regressors zt at the expense of slower rates of convergence.
It is worth noting that these conditions are quite general, accounting for the time series
features of the data and the model in Section 3. For instance, we do not require that the error
10
process is the martingale difference sequence. It suffices that {vt} and {ut} are strong-mixing.
4.1 Consistency and rate of convergence
Theorem 1 Let p = O (Tr) for some r < • and k = O⇣p
T⌘
. Then, under the regularity conditions
A1-A3 specified above,
kDBc(q� q0)k22�
T + kBc(q� q0)k1 = Op
✓s0l
fT
◆, (7)
where s0 is the number of non-zero elements in q0 and l�1 = o⇣⇣p
T/ ln p⌘^⇣p
T/ (Th ^ k ln k)⌘⌘
for any h > 0. Furthermore, if (minj |qS0,j|)�1 = o((s0l/fT)�1), then the LASSO selects all relevant
predictors with probability approaching one.
Proof. See Appendix A.
Theorem 1 establishes deviation bounds for the prediction error and the parameter esti-
mation error in terms of the `1-norm. It permits the numbers of both stationary and non-
stationary variables to diverge. However, our bound is more sensitive to the number k of
nonstationary variables in the sense that k increases the bound more quickly than does p.
Note that when k is fixed or increases only at a logarithmic rate, the deviation bound be-
comes s0/(p
TfT) up to a logarithmic factor. Furthermore, when k is allowed to increase to
infinity faster, the compatibility sequence fT may need to decay to zero at a faster rate.
Theorem 1 ensures that LASSO can serve as a predictor-selection tool in a linear predictive
regression, in that P�
S0 ⇢ S ! 1 where S0 := {j : q0
j 6= 0} for q0 with the corresponding
estimate S. This is so even when there are two different types of covariates, either stationary
(zt) or nonstationary (xt), and the dimensions for zt are very large as long as the dimen-
sion of xt increases at most at a logarithmic rate. This property is essential for stock return
predictability because the number of plausible return predictors is very large and could be
nonstationary or stationary. LASSO is an automated method that selects relevant predictors
under a sparsity assumption so as to achieve parsimony without sacrificing precision.5 This
becomes important when the number of observations is relatively small compared to the
number of available predictors.
For the case where the dimension of xt increases to infinity, Theorem 1 shows that the
convergence rate for q is slower than Op(p
s0 ln p/T), which is the usual rate of convergence
5Here, sparsity means that although there might be many potential predictors, the number of truly relevantpredictors is small.
11
of the LASSO estimator for linear models when the data is independent (see Buhlmann and
van de Geer, 2011, page 106). The slower rate of convergence occurs in our case because we
allow for quite general dependence. However, it is worth noting that due to the presence
of cointegrating relationship, the rate of convergence for d is faster than the usual rate of
convergence for LASSO estimation and can achieve the oracle rate of T�1 under some proper
conditions as shown below.
4.2 Large sample distribution of the LASSO estimator for the cointegrating vector
This section discusses the large sample distribution of the proposed estimator for the cointe-
grating vector, d, when its dimension k is fixed. From (3), when c and a are given, we can
focus on the estimator for d, the coefficient associated with Xc. The objective function (3)
becomes
d = arg mind:kdk1=1
kY � Xdk22�
T, (8)
where Y = Y � Za and X = Xc.
It is worth noting that there is no shrinkage on the cointegrating vector due to the re-
striction on the parameter space, kdk1 = 1. This feature is helpful for better prediction and
forecasting because the collective explanatory power coming from persistent predictors con-
stituting the cointegrating vector is not compromised by LASSO estimation. This is relevant
for the return predictive regression, in which it is highly likely that there exists a cointegrat-
ing relationship within persistent predictors given its seemingly unbalanced nature. As can
be seen from (8), we can derive the large sample distribution of d based on a constrained
least squares with a nonstandard norm restriction using the k · k1 norm. The large sample
distribution of appropriately scaled d � d0 can be obtained by the minimizer of a Gaussian
process.
Theorem 2 Let
D(b) = �2c0b0Z 1
0W (r)dU (r) + L
�+ |c0|2b0
Z 1
0W (r)W (r)0drb,
with
L = limT!•
1T
T�1
Âi=1
i�1
Âj=1
E(vi�jui),
12
where vt = xt � xt�1 and (W (·), U (·))0 is the weak limit of the partial sum process of (vt, ut)0. Also
let k be fixed, p = O(T), l�1 = o(p
T/ ln p), and s0l = O(T�q) for some positive q. Then, under
the regularity conditions A1-A3,
T(d � d0)d�! arg min
b:sgn(d0)0b=�Âj |bj|1{d0j =0}
D(b). (9)
Proof. See Appendix A.
Theorem 2 allows us to make inference on the cointegrating vector d. Theorem 2 is non-
standard because there is a restriction on the parameter space. If it were not for this re-
striction, we could obtain a closed-form expression by deriving the minimizer of the limit
objective function D(b) in Theorem 2. Nevertheless, we can simulate the limit distribution
for the purpose of making inference.6 We do not elaborate on inference here because our pri-
mary objective is on forecast improvement rather than making inference on the cointegrating
vector.
We illustrate the finite sample performance of the asymptotic approximation in Theorem
2 by considering the following design:
yt = a + b1x1t�1 + b2x2t�1 + ut
where a = 0.1, b = (b1, b2)0 = (1,�1)0, var(vit) = s2vi= 1 for i = 1, 2 where vit = xit � xit�1
and var(ut) = s2u = 1. We consider two cases in which T = 1000 and T = 10000 with 10000
repetitions. Without loss of generality, we assume there is no stationary regressor. Note that
the constrained limit objective function we consider is
D(b) = �2b0[Z 1
0W (r)dU (r)] + b0
Z 1
0W (r)W (r)0drb
subject to sgn(b0)0b = �Âj |b|j1{b0j =0}where b = cb and b is as in Theorem 2. The minimizer
of the above reparameterized limit objective function with a constraint yields the limit distri-
bution of T(b � b0). Following our design, the constraint becomes b1 = b2 = b. This implies
6The limit distribution cannot be tabulated as it is not asymptotically pivotal but it can be simulated. Inparticular, the restriction on b can be imposed by replacing d0j with dj = dj1{|dj |�cT} for some cT ! 0 and
T ⇥ cT ! •. Due to the super-consistent convergence rate of d, this event is equivalent to a restriction on theparameter space with a probability approaching one.
13
that
D(b) = �2b⇣
1 1⌘2
4R 1
0 W1(r)dU (r)R 1
0 W2(r)dU (r)
3
5
+b2⇣
1 1⌘2
4R 1
0 W 21 (r)dr
R 10 W1(r)W2(r)dr
R 10 W1(r)W2(r)dr
R 10 W 2
2 (r)dr
3
5
0
@ 1
1
1
A .
The minimizer of the above can then be obtained in a closed-form, and is
b⇤ =
R 10 [W1(r) +W2(r)] dU (r)
Â2i=1 Â2
j=1R 1
0 Wi(r)Wj(r)dr
0
@ 1
1
1
A .
Hence,
T(d � d0)d! b⇤/c
where d = b/c with b obtained from a LASSO estimation c = |b1|+ |b2|, and d0 = b0/c0. To
simulate the sample distribution of b⇤, note that
T�2T
Ât=1
xitxjtd!R 1
0 Wi(r)Wj(r)dr
T�1T
Ât=1
xitutd!
R 10 Wi(r)dU (r).
Then,
b⇤ ⇡ T�1 ÂTt=1(x1t + x2t)ut
T�2 Â2i=1 Â2
j=1 ÂTt=1 xitxjt
.
Meanwhile, for the sample distribution of (d � d0), we employ the LASSO estimation with a
fixed tuning parameter, l = ln p10p
T. This choice reflects the rate of convergence of the estimated
cointegrating coefficients when selecting the set of predictors. Under this setting, Figures 1
and 2 provide histograms of the limiting distribution of d and the sample distributions of
(the elements of) (d � d0) for 1000 and 10,000 observations where d is the LASSO estimator,
as well as the associated QQ-plots. We can see that the sampling distribution of (d � d0) is
centered around 0, and although the QQ plots show some differences between the sampling
and limiting distributions around the tails when T = 1, 000, these differences are barely
apparent when T = 10, 000.
14
Figure 1: Limiting distribution and sample distribution of (d � d0) and their QQ-plotT = 1000
a The first panel depicts the limiting distribution of (d1 � d01) whereas the second panel depicts its sample
distribution.b The limiting and sample distributions of (d2 � d0
2) are identical to those for (d1 � d01) by construction.
Figure 2: Limiting distribution and sample distribution of (d � d0) and their QQ-plotT = 10000
5 Simulation study
In this section, we use simulation to evaluate the ability of the LASSO estimator to select a set
of predictors that leads to out-of-sample forecasting gains. We start with a baseline predictive
regression that incorporates a mixture of just four persistent and six non-persistent predictors,
and then we expand the number of non-persistent predictors out to one hundred, which
allows us to consider high-dimensional predictive regressions as well. Our data generating
processes (DGPs) include a mixture of stationary and non-stationary cointegrated predictors
as in (1), and although (1) and the other equations in our text omit a constant to simplify our
exposition, we include a constant in all our simulated data and estimated equations.
We consider four estimation sample sizes (each generated sample follows a burn-in of
1000 observations) and ten different forecast methods. Given an estimation sample of size T,
we focus on forecasting yT+1 using different methods, including our proposed LASSO based
15
methods. We evaluate each forecast strategy in terms of the one-step-ahead forecast errors
associated with yT+1.
5.1 Simulation of our baseline DGP
The baseline DGP that we study incorporates several well known characteristics of financial
and macroeconomic data. These include high persistence in some (but not all) individual
series, interdependencies between many variables, and rich system-wide dynamics.
The specification associated with the persistent variables xt in this DGP is based on a four
variable vector error correction representation given by (10a), in which the first right-hand
side vector represents a 4 ⇥ 1 loading matrix (1 is the cointegrating rank in this setting) while
the second right-hand vector is the cointegrating vector. Hence, xt is a 4-dimensional I(1) pro-
cess with cointegrating rank 1 and from (10a), the cointegrating vector is d = (1, �0.5, 0, 0)0.
The non-persistent predictors zt in this DGP follow a 6-dimensional stationary VAR(2)
process with a complicated dependence structure. This structure, given in (10b), is partly
taken from an estimated VAR(2) model of US macroeconomic data (see Martin et al., 2012, p.
485, example 13.19), so it is representative of data obtained from a real economy.
The coefficients in the overall predictive regression model (1) are given in (10c), and these
are set to be b = cd = 0.3d and a = (�0.1, 0, 0.02, 0.4, 0, �0.2)0. All of the error terms in (10)
(i.e. (ut, #x,t, #z,t)) are i.i.d N (0, I). Putting these together, the full specification of our baseline
DGP is then
Dxt =�0 0.3 0 0.2
�0 �1 �0.5 0 0�
xt�1 + #x,t, (10a)
zt =
0
BBBBBB@
0.4 0 0 0 0 00 1.31 0.29 0.12 0.04 00 �0.21 1.25 �0.24 0.04 00 0.07 0.03 1.16 0.01 00 0.08 0.27 �0.07 1.25 00 0 0 0 0 0.4
1
CCCCCCAzt�1
+
0
BBBBBB@
0 0 0 0 0 00 �0.35 �0.28 �0.07 �0.02 00 0.19 �0.26 0.24 �0.05 00 �0.07 �0.02 �0.16 0.01 00 �0.13 �0.23 0.03 �0.31 00 0 0 0 0 0
1
CCCCCCAzt�2 + #z,t,
(10b)
yt = 0.15 + x0t�1�0.3 �0.15 0 0
�0+ z0t�1
��0.1 0 0.02 0.4 0 �0.2
�0+ ut. (10c)
16
5.2 Forecast methods
The first of our ten forecast strategies uses OLS and all predictors to estimate the standard
linear predictive regression model in (1) given by
yt = x0t�1b + z0t�1a + ut
to obtain yT+1, while our second strategy predicts yT+1 by using its historical average, i.e.
yT+1 = 1T ÂT
i=1 yi. The third method predicts yT+1 via the estimation of an AR(1) model:
yt = r yt�1 + ut. (11)
Then, since log-price series are often modelled as random walks, we also use a random walk
without drift as the fourth forecast method for comparison. We expect this model to show
inferior forecast performance when predicting returns because it restricts r = 0 in (11).
The fifth method is based on principal component analysis (PCA), as suggested by Stock
and Watson (2002) when there are many predictors. Due to the different statistical properties
of I(1) and I(0) series, we extract the first 4 principal components from the (standardised) I(0)
series, and then use them together with the 4 I(1) predictors in the forecasting exercise.
For the sixth method, we employ a forecast combination a la Timmermann (2006). More
specifically, we compute the combined forecast, yT+1, as an equally weighted average of all
forecasts that are separately obtained from distinctive linear predictive regression models that
each condition on just one predictor:
yT+1 =1M
M
Âm=1
ymT+1, (12)
where M = k + p is the total number of predictors, and ymT+1 is a forecast based on a bivariate
linear predictive regression that conditions on just the m-th predictor. We include forecast
combination as one of our competing forecast strategies because it has been widely recognized
as a way to improve forecast performance relative to a forecast based on a single model
(Timmermann, 2006; Rapach et al., 2010). Similar to the use of LASSO, the combined forecast
approach alleviates the impact of model uncertainty and parameter instability.
The next three methods are different variants of LASSO. We apply a standard LASSO
17
method with the following objective function:
arg minb,a
⇢1T
T
Ât=1
�yt � x0t�1b � z0t�1a
�2+ lkbk1 + lkak1
�, (13)
where k · k1 is the `1-norm. The penalty term l can be estimated using the cross validation
algorithm due to Qian et al. (2013), that minimizes the cross-validated error loss (LASSOmin).
We call the resulting penalty lmin. An alternative way to estimate l is via forward chaining,
which leads to l f wd. Here, we work with a grid of ls and the first 0.9T observations in the
generated series of T observations to estimate a set of coefficients for each l. We then use
each set of coefficients together with a rolling window of 0.9T observations to obtain one
step ahead forecasts for the remaining 0.1T data points, and choose l f wd to minimise the
associated forecast mean squared error (FMSE). We also consider setting l = ln p10p
Tto obtain a
penalty that we call l f ix. This latter choice of l is based on a metric associated with the set of
non-persistent variables zt (see Lemma 1), and the potential advantage of using this penalty
is that it takes the rate of convergence of the estimated cointegrating coefficients into account
when selecting the set of predictors.
Our final forecasting technique is based on the elastic net by Zou and Hastie (2005), which
is often used in macroeconomic or financial applications, when regressors are highly corre-
lated. The elastic net mixes the `1 and `2 norms for the penalty term,
arg minb,a
⇢1T
T
Ât=1
�yt � x0t�1b � z0t�1a
�2+ l
⇥h(kbk1 + kak1) + (1 � h)(kbk2
2 + kak22)⇤ �
, (14)
and in our case we set the mixing parameter, h, to be 0.5.
We generate 5000 samples of T = 100 to estimate the parameters that are needed (together
with the sample data) to predict yT+1 for each of the ten forecast methods. This generates
50000 predictions for estimation samples of size 100, and we use these, together with yT+1
to calculate 5000 forecast errors for each of the ten forecast methods. We then repeat this for
T = 200, 400, and 1000, so that a total of four estimation sample sizes are considered. Note
that we do not encounter any estimation problems when we generate data from our baseline
DGP, but later, when we start generating high-dimensional predictive DGPs in Section 5.3 and
work with small estimation samples, we will not have sufficient observations to implement
the first of our forecast methods which is based on OLS.
We compare the forecast performance for different methods by considering two measures.
18
The first of these is the forecasting mean squared error (FMSE), while the second is a measure
of out-of-sample relative fit (RFIT) and is closely related to the out-of-sample R2 measure
used by Welch and Goyal (2008). The former is defined as the expected squared difference
between the actual realization and the one-step-ahead forecast: FMSE := E[yT+1 � yT+1]2.
The latter is defined as RFIT:=1 � FMSEi/FMSEOLS for each forecasting method i. We use
the OLS based forecast method as our benchmark when calculating the out-of-sample RFIT
because this method is typically used in empirical applications of predictive regressions.7
Table 1 presents FMSE and out-of-sample RFIT for the forecasting strategies based on
5000 repetitions. Overall, the LASSO methods dominate, with LASSO f ix having lower FMSE
and higher RFIT than all other forecast methods when the sample size is small, and LASSOmin
playing this role when the sample size becomes larger. We can infer that the bias induced
by adopting the LASSO is small, whereas the reduction in the estimation variance domi-
nates. The LASSOmin, LASSO f ix and elastic net methods produce smaller FMSEs than the
benchmark OLS forecasts across the four different sample sizes (with only one exception),
and hence they produce positive out-of-sample RFIT values. On the other hand, the other
methods always lead to inferior forecasting performance compared to the OLS benchmark.
The bias induced by using the historical average is particularly costly, and it is interesting to
note that despite considerable empirical evidence that forecast combinations are often helpful,
they do not appear to help in this setting. One possible reason for this is that the simulated
DGP used here has a stable dependence structure, whereas an important practical advantage
of forecast combination is that it can ameliorate the effects of structural change. The PCA4
and the forward chaining LASSO methods generate FMSEs that are of similar magnitudes as
the three best performing forecast strategies, but they do not show any advantage.
We conjecture that the LASSO approach renders the presence of a cointegrating relation-
ship conspicuous due to the shrinkage imposed on the irrelevant non-stationary predictors,
and this, together with the small but remaining nonzero coefficients on stationary predictors
leads to better forecast performance. Since OLS can also ”pick up” cointegrating relation-
ships as the number of observations increases, the forecasting performance based on the OLS
estimation of the predictive regression becomes closer to that of the LASSO for large samples.
This is confirmed in Table 1. When we have a small sample size T = 100, the improvement
in the FMSE of LASSOmin over the OLS benchmark, as measured by the out-of-sample RFIT,
is 2.96%. However, as the sample size increases to T = 1000, the out-of-sample RFIT of7All code for our simulations in Section 5 and our forecasts in Section 6 can be found at
https://sites.google.com/site/myunghseo/research
19
LASSOmin reduces to 0.03%.
5.3 Alternative DGPs
One clear advantage of the LASSO approach is that the dimension of the stationary variables
zt can be arbitrarily large, and indeed can be larger than the sample size T. In the previous
DGP we set k = 4 and p = 6. In this section we investigate how the forecasting performance of
LASSO compares with alternative forecasting strategies as the dimension p of the stationary
predictors increases. Starting from p = 6, we expand the dimension of the non-persistent
regressors to p = 12, 20, 50, 100 and 200. Our DGP structure is similar to that in Section 5.1
in that our DGP for the nonstationary variables still follow (10a), and the stationary variables
are still generated by VAR(2) processes, but the sets of coefficients for these processes are now
randomly drawn (so that our analysis no longer focuses on just one VAR). The coefficients in
the associated loading vector a are also drawn randomly, in a way that ensures that the set of
elements of zt that actually contribute to the generation of yt is sparse. We provide more detail
on the stationary components of the DGPs below. We use the same forecast methodologies
as in the previous section (except when OLS is infeasible) and as before, we base our forecast
comparison on the FMSE associated with one-step-ahead forecasts.
The details relating to the stationary components of our DGPs are that we consider 100
different VAR(2)s for each sample size, and work with 1000 replications of each.8 The coeffi-
cients for each of these 100 VAR(2)s are drawn randomly from a normal distribution with a
zero mean and variance of less than 0.2, and we have made sure that each drawn VAR(2) pro-
cess is stationary, by checking that its maximum eigenvalue is less than 0.8 in absolute value.
We impose weak sparsity by randomly setting half of the a coefficients equal to zero and
drawing the remaining non-zero coefficients in a from a uniform distribution over [�0.3, 0.3].
The non-zero coefficients are small, making the dependence of yt on zt relatively weak. We
should emphasize that this design is not favorable to the LASSO due to the many smallish
non-zero coefficients. The average FMSE produced across 1000 replications of the 100 DGPs
are tabulated in Table 2, and the smallest FMSE in each row appears in bold letters with an
associated tick.
The results observed previously are largely preserved when we increase the dimension
of the stationary predictors zt. The three forecasting methods based on the LASSO and the
elastic net produce more accurate forecast in terms of FMSE, with LASSO f ix performing the
8These simulated VAR(2) processes are available upon request.
20
Tabl
e1:
Fore
cast
ing
perf
orm
ance
ofdi
ffer
entf
orec
asts
trat
egie
s
TO
LSA
vera
geA
R(1
)r-
wal
kPC
A4
F-co
mb
LASS
Ofw
dLA
SSO
min
LASS
Ofi
xel
-net
FMSE
100
1.26
937.
1731
2.23
692.
3820
3.99
024.
4995
1.33
821.
2316
1.2
132X
1.23
9920
01.
1051
7.92
312.
1883
2.33
494.
3106
5.27
081.
1568
1.09
521.0
923X
1.09
8740
01.
0183
7.85
352.
1278
2.30
024.
0072
5.40
671.
0305
1.0
133X
1.01
361.
0150
1000
0.99
778.
3772
2.14
692.
3593
4.19
565.
9519
1.00
450.9
974X
1.00
130.
9975
Out
-of-
sam
ple
RFI
T
100
-4.6
514
-0.7
624
-0.8
767
-2.1
437
-2.5
450
-0.0
544
0.02
960.0
442X
0.02
3220
0-6
.169
6-0
.980
2-1
.112
9-2
.900
7-3
.769
6-0
.046
80.
0089
0.0
115X
0.00
5740
0-6
.712
6-1
.089
6-1
.259
0-2
.935
3-4
.309
8-0
.012
00.0
049X
0.00
460.
0032
1000
-7.3
969
-1.1
519
-1.3
649
-3.2
055
-4.9
659
-0.0
069
0.0
003X
-0.0
037
0.00
02a
Each
colu
mn
deno
tes
one
fore
cast
ing
stra
tegy
,res
pect
ivel
y,fr
omle
ftto
righ
t:(1
)OLS
mod
el;(
2)H
isto
rica
lave
rage
;(3)
AR
(1)m
odel
;(4)
rand
omw
alk
mod
el;(
5)Pr
inci
palc
ompo
nent
anal
ysis
(PC
A);
(6)
Fore
cast
com
bina
tion;
(7)
Forw
ard
chai
ning
LASS
O;(
8)M
inim
ized
cros
sva
lidat
edLA
SSO
;(9)
Fixe
dpe
nalty
LASS
O;
(10)
Elas
ticne
t.b
The
out-
of-s
ampl
eR
FIT
isde
fined
asR
FIT=
1�
FMSE
i/FM
SEO
LSfo
rea
chm
odel
i.
21
Tabl
e2:
Com
pari
son
ofFM
SEfo
rdi
ffer
entf
orec
asts
trat
egie
sas
the
num
ber
ofst
atio
nary
pred
icto
rs(p
)gro
ws
pT
OLS
Ave
rage
AR
(1)
r-w
alk
PCA
4F-
com
bLA
SSO
fwd
LASS
Om
inLA
SSO
fix
el-n
et
6
100
1.16
371.
5259
1.46
292.
3513
1.18
751.
4714
1.21
621.
1473
1.1
401X
1.15
4220
01.
0749
1.53
111.
4586
2.35
481.
1245
1.48
871.
1275
1.07
061.0
696X
1.07
3540
01.
0335
1.52
211.
4487
2.36
671.
0876
1.48
801.
0550
1.0
325X
1.03
381.
0335
1000
1.01
221.
5294
1.45
542.
3775
1.07
311.
5011
1.02
141.0
120X
1.01
541.
0122
12
100
1.25
951.
6997
1.64
152.
6931
1.35
591.
6387
1.30
221.
2241
1.2
132X
1.23
5520
01.
1139
1.69
191.
6247
2.68
171.
2803
1.64
121.
1637
1.1
029X
1.10
351.
1086
400
1.05
221.
6998
1.63
152.
7119
1.23
881
1.65
421.
0764
1.0
501X
1.05
541.
0523
1000
1.02
801.
6877
1.61
352.
6796
1.21
751.
6466
1.03
441.0
273X
1.03
501.
0279
20
100
1.38
421.
8769
1.82
993.
0623
1.56
701.
8196
1.42
261.
3117
1.2
990X
1.32
8820
01.
1598
1.88
761.
8278
3.03
911.
4609
1.83
591.
2093
1.1
417X
1.14
471.
1508
400
1.08
011.
8829
1.81
813.
0380
1.41
611.
8360
1.09
301.0
752X
1.08
181.
0789
1000
1.02
941.
8742
1.81
373.
0688
1.39
061.
8301
1.03
851.0
283X
1.03
981.
0293
50
100
2.31
412.
8287
2.80
474.
8699
2.60
742.
7609
1.99
031.
7607
1.7
151X
1.77
6720
01.
3949
2.80
732.
7709
4.88
752.
4013
2.74
301.
4050
1.3
159X
1.32
431.
3420
400
1.17
672.
8061
2.76
364.
8898
2.32
062.
7434
1.16
871.1
553X
1.16
851.
1667
1000
1.06
592.
7956
2.74
394.
8790
2.30
092.
7347
1.06
911.0
621X
1.08
081.
0651
100
100
N/A
3.91
303.
9167
7.07
343.
8034
3.85
603.
1791
2.90
962.7
334X
2.84
5620
02.
1674
3.89
563.
8782
7.11
383.
6052
3.84
071.
8934
1.7
428X
1.74
691.
7892
400
1.36
973.
8947
3.86
887.
0424
3.47
713.
8406
1.31
821.2
996X
1.32
771.
3263
1000
1.12
083.
9227
3.88
547.
0685
3.42
843.
8688
1.11
801.1
106X
1.14
091.
1179
200
100
N/A
6.79
146.
8466
12.8
705
7.09
136.
7299
6.08
375.
9320
5.81
835.7
287X
200
N/A
6.83
926.
8440
12.7
566
6.66
396.
7768
3.72
953.
4627
3.3
619X
3.45
1740
02.
0915
6.83
806.
8446
12.8
389
6.47
346.
7762
1.78
071.7
538X
1.82
231.
8143
1000
1.27
136.
7071
6.69
6312
.875
06.
3050
6.64
621.
2351
1.2
323X
1.27
701.
2492
22
best when the sample size is small and LASSOmin performing the best when the sample size
becomes larger. Interestingly, for p = 200 and T = 100, the elastic net dominates the LASSO
methods in producing more accurate forecasts. The percentage improvement of the LASSOmin
specification relative to the OLS benchmark is always larger when the sample size is small.
Notably, as p increases, all forecast strategies considered here generate larger forecast errors.
The historical average, AR(1) model, random walk, and principal components model are all
mis-specified and hence lead to much larger forecast errors.
6 Stock return predictability
The predictability of the equity premium has been a controversial subject in the finance lit-
erature, generating considerable debate about whether predictive models of excess returns
can outperform the historical average of these returns, and if so, which predictors lead to
good out of sample performance, and over which forecast periods. Welch and Goyal (2008)
conducted a particularly influential study of this topic, providing an extensive review of the
usefulness and stability of many commonly used predictors. They concluded that return pre-
dictability is very weak, both in terms of in-sample fit and out-of-sample forecasting. Their
paper inspired a burst of more recent work on this topic, that has not only proposed new
predictors, but has also exploited innovative ways of extracting information from the existing
set of predictors (see, for example Rapach and Zhou, 2013; Neely et al., 2014; Buncic and
Tischhauser, 2016). In this section we use an extension of the data set examined in Welch and
Goyal (2008), and we demonstrate how LASSO estimation can improve the predictability of
stock returns by allowing for cointegration between persistent predictors.
6.1 Data
We follow the mainstream literature on forecasting the equity premium and use the same
fourteen predictors as Welch and Goyal (2008). An updated version of their data is posted
on the website (http://www.hec.unil.ch/agoyal/), and we use monthly data from January
1945 to December 2012 (816 observations) from this site in the work that follows. Table 3
presents a brief description of the predictors as well as their estimated first-order autocor-
relation coefficients over the entire sample period. The dependent variable of interest is the
equity premium yt, which is defined as the difference between the compounded return on
the S&P 500 index and the three month Treasury bill rate, and most of the predictors are fi-
23
Table 3: Variable definitions and estimated first order autocorrelation coefficients
Predictor Definition Auto.corr(1)d/p dividend price ratio: difference between the log of dividends
and the log of prices0.9940
d/y dividend yield: difference between the log of dividends and thelog of lagged prices
0.9940
e/p earning price ratio: difference between the log of earning andthe log of prices
0.9908
d/e dividend payout ratio: difference between the log of dividendsand the log of earnings
0.9854
b/m book-to-market ratio: ratio of book value to market value forthe Dow Jones Industrial Average
0.9926
ntis net equity expansion: ratio of 12-month moving sums of netissues by NYSE listed stocks over the total end-of-year marketcapitalization of NYSE stocks
0.9757
tbl Treasury bill rates: 3-month Treasury bill rates 0.9891lty long-term yield: long-term government bond yield 0.9935tms term spread: difference between the long term bond yield and
the Treasury bill rate0.9565
d f y default yield spread: difference between Moody’s BAA andAAA-rated corporate bond yields
0.9720
d f r default return spread: difference between the returns oflong-term corporate and government bonds
0.9726
svar stock variance: sum of squared daily returns on S&P500 index 0.4716in f l inflation: CPI inflation for all urban consumers 0.5268ltr long-term return: return of long term government bonds 0.0476
nancial or macroeconomic variables. As shown in Table 3, the majority of these predictors are
highly persistent, with eleven out of fourteen variables having a first order autocorrelation
coefficient that is higher than 0.95. The remaining three variables (the stock index, inflation
and the government bond return) have little persistence, as does the equity premium (EQ),
which has a first order autocorrelation coefficient of only 0.1494.
We use twenty year rolling windows to allow for possible parameter instability during
our sample, and since our data starts from 1945:01, our out-of-sample predictions of the
equity premium start from 1965:01. We treat the results for the twenty year windows as our
benchmark, but we also conduct some robustness analysis using ten and thirty year rolling
windows. Figure S1 in our online supplement plots the estimated AR(1) coefficients for each
predictor and the equity premium (EQ) using twenty year rolling windows, and it is not only
evident that many predictors exhibit a high persistence level over the entire sample, but also
that the time series dependence structure changes throughout the forecasting period.
We expect to observe cointegrating relationships between the persistent variables, given
24
the features of the predictors reported in Table 3 and illustrated in Figure S1. Further, fi-
nancial considerations suggest that the dividend price ratio (d/p) and the dividend yield
(d/y) should have a stable long-run relationship. If such cointegrating relationships exist,
they would demonstrate our theoretical framework in Section 3 very nicely, so we study this
aspect of our data next. Specifically, we conduct the tests proposed by Johansen (1991, 1995)
and use the model selection technique proposed by Poskitt (2000) to assess whether there is
cointegration between some of our persistent variables. We exclude the dividend payout ratio
(d/e) and long-term yield (lty) when estimating models to avoid multi-collinearity, making
our cointegration analysis relate to the nine remaining persistent predictors.
Figure 3: Test for cointegration using twenty year rolling windows
1965 1975 1985 1995 2005 2012
1234567
(a) all 9 persistent variables
1965 1975 1985 1995 2005 2012
0
1
2
3
4
5
(b) exclude d/p and d/y
1965 1975 1985 1995 2005 2012
1
2
3
(c) d/p, d/y, e/p and tbl
1965 1975 1985 1995 2005 2012
1
2
(d) d/p and d/yJohansenPoskitt
Figure 3 shows the number of cointegrating relationships (i.e. cointegrating ranks) that
are found in each of the twenty year windows in our sample.9 The blue solid line denotes the
cointegrating rank selected using the Johansen test (Johansen, 1991, 1995), and the red dashed
line denotes the selected cointegrating rank when the non-parametric method of Poskitt (2000)
is used. The Johansen test is based on estimating a vector error correction model and it is one
of the most widely used tests for cointegration among multiple time series. However, it is well
known that this test often suffers from the curse of dimensionality (Gonzalo and Pitarakis,
1999), and in particular this test tends to overestimate the cointegrating rank. This is reflected
in Figure 3. Here, we see that the Johansen test always finds at least three cointegrating
relationships between the nine persistent variables, whereas the more conservative Poskitt9The test outcomes based on ten and thirty year windows are qualitatively similar, and the relevant results
are available upon request.
25
(2000) procedure only finds a cointegrating rank of one or two. It is interesting to note that
when only d/p and d/y are included in the analysis, the Poskitt (2000) procedure always
finds that these two variables are cointegrated.
6.2 Forecast comparison
We use the same set of forecasting methods as in the simulation study in Section 5, excepting
the method based on principal components since we now have only three I(0) predictors. The
benchmark forecast is based on OLS estimation of (1). Note that all forecasts (including the
the historical averages) use just the observations in the relevant rolling window instead of the
entire past history from 1945:01. One step ahead FMSEs are tabulated in Table 4, and out-
of-sample RFIT measures are provided in Table S1 (in our online supplement), respectively.
Following Welch and Goyal (2008), we first consider forecasts based on the generic “kitchen
sink” model, where all of the predictors listed in Table 3 are included except for d/e and lty.
In addition, we base predictions on a simplified model that contains the three non-persistent
predictors (svar, in f l and ltr) and two persistent predictors d/p and d/y, that appear from
Figure 3 (d) to be cointegrated.
Tables 4 and S1 show that in terms of overall performance, the forecasts produced by
forecast combination, and use of the three LASSO methods or the elastic net produce the
smallest one-step-ahead FMSE and hence the highest out-of-sample RFIT, over the three
different window sizes. The improvement in out-of-sample RFIT can be as large as 13.7%
over the forecasts based on the kitchen sink model. However, the LASSO f ix model sometimes
fails to generate better predictions than the OLS benchmark when the persistent predictors
other than d/p and d/y are excluded from the model. Forecast combination produces the
smallest FMSE using ten year and thirty year windows in the kitchen sink case but it fails to
improve forecast accuracy upon the OLS benchmark using twenty and thirty year windows in
the simplified case that includes only two persistent predictors. In contrast to our simulated
forecasts, the elastic net performs slightly better than the LASSOmin for this simple example
based on real data, perhaps reflecting higher correlations between the actual I(0) variables
than between the simulated I(0) variables.
We conduct pairwise Giacomini and White (2006) tests of equal conditional predictive
ability of OLS forecasts and alternative forecast methods, using the one-step-ahead FMSE as
our loss function. We indicate the significance levels of these tests in Table 4. For the kitchen
sink specification, forecast combinations, LASSOmin and the elastic net produce significantly
26
Tabl
e4:
Com
pari
son
ofFM
SEfo
rdi
ffer
entf
orec
astm
etho
ds
win
dow
OLS
Ave
rage
AR
(1)
r-w
alk
F-co
mb
LASS
Ofw
dLA
SSO
min
LASS
Ofi
xel
-net
Kitc
hen
sink
vari
able
s(9
pers
iste
ntva
riab
les
asw
ella
ssv
ar,i
nf
and
ltr)
10-y
ear
0.00
2089
0.00
1884
†0.
0019
000.
0033
440.0
01803X⇤⇤
0.00
1913
⇤⇤0.
0018
31⇤⇤
0.00
1839
⇤0.
0018
49⇤⇤
20-y
ear
0.00
2046
0.00
2050
0.00
2036
0.00
3598
0.00
1929
0.00
2003
0.0
01928X
†0.
0019
510.
0019
28†
30-y
ear
0.00
2035
0.00
2051
0.00
2042
0.00
3651
0.00
1956
0.0
01928X
0.00
1968
0.00
1961
0.00
1972
Two
pers
iste
ntva
riab
les
(d/
pan
dd/
y)as
wel
las
svar
,in
flan
dlt
r
10-y
ear
0.00
1909
0.00
1884
0.00
1900
0.00
3344
0.00
1808
0.00
1818
0.0
01799X
†0.
0184
30.
0018
19†
20-y
ear
0.00
1914
0.00
2050
0.00
2036
0.00
3598
0.00
1921
0.00
1924
0.00
1866
0.00
1944
0.0
01866X
†
30-y
ear
0.00
1895
0.00
2051
0.00
2042
0.00
3651
0.00
1931
0.00
1895
0.00
1892
0.00
1929
0.0
01888X
aBo
ldle
tter
sw
itha
tick
mar
kin
dica
teth
efo
reca
stm
etho
dw
ithth
esm
alle
stfo
reca
stm
ean
squa
red
erro
rin
each
row
.b
Sign
ifica
nce
leve
lsof
pair
wis
eG
iaco
min
ian
dW
hite
(200
6)te
sts
ofeq
ual
cond
ition
alpr
edic
tive
abili
tyof
OLS
fore
cast
san
dan
alte
rnat
ive
fore
cast
met
hod
inth
eco
rres
pond
ing
colu
mn:
†,⇤
and⇤⇤
deno
te10
%,5
%an
d1%
leve
lsof
sign
ifica
nce.
cC
olor
edbl
ocks
repr
esen
tsup
erio
rfo
reca
stm
etho
dsco
mpa
red
toth
eot
hers
usin
gth
eH
anse
n(2
005)
test
for
supe
rior
pred
ictiv
eab
ility
.
27
better one-step-ahead forecasts (at the 1% level) than the corresponding OLS alternative using
the ten year window, and LASSOmin out-performs OLS at the 10% level using the twenty year
window. For the parsimonious two persistent predictors case, the LASSOmin model and elastic
net perform significantly better than the OLS benchmark at the 10% level over the the ten year
rolling window samples.
The Hansen (2005) test for superior predictive ability leads us to similar conclusion. For
each forecasting method i, we test the null that
H0 : FMSEi FMSEj, for j 6= i, j 2 J ,
where J contains all methods considered here. We use purple blocks in Table 4 to highlight
those cases, for which H0 cannot be rejected at the 5% significance level. Regardless of the
size of the estimation rolling window, we find no statistically significant evidence that the
forecasts generated by combinations, LASSOmin or elastic net are inferior to any of the other
forecast strategies, for either the kitchen sink or for the more parsimonious model. For the
thirty year estimation samples we find no evidence that the LASSO f ix and the OLS forecasts
are dominated by other forecasts either.
6.3 Robustness analysis
The tests for cointegration rank shown in Figure 3 suggest that there is usually more than one
cointegration relationship between the nine persistent predictors. We point out that although
our modelling framework assumes just a single cointegrating vector, it is easy to accommo-
date multiple cointegrating relationships. To see this, suppose that there are two normalised
cointegrating vectors are d1 and d2 that have normalisation constants c1 and c2. The fact that
we are considering one-equation predictive regression implies that b = c1d1 + c2d2 and the
LASSO is simply estimating a linear combination of the cointegrating vectors. In other words,
instead of estimating the two cointegrating relationships separately, the LASSO identifies a
linear combination of them in b. The same intuition applies when we are working with sets
of predictors that have higher cointegrating rank.
This is exactly the case for our data set relating to stock return prediction. Simple algebra
shows that the LASSO estimates of the coefficients of the linear combination of independent
cointegrating vectors are still consistent. Furthermore, all the asymptotic properties of the
LASSOmin estimates presented in Section 4 still hold. We plot the twenty year rolling window
28
LASSO estimates of some of the coefficients in a kitchen sink model in Figure 4, and provide
the full set of estimated coefficients in Figures S2 and S3 in our online supplement.
An estimated coefficient of 0 indicates that the LASSOmin model drops the associated
variable from the predictive model. Interestingly, the first two panels of Figure 4 show that
although the coefficients of d/p and d/y often move together (in opposite directions), they
do not always reflect a (1, �1) cointegrating relationship. This is due to the presence of other
persistent predictors whose predictive ability is sporadic, and as these variables enter and
leave the estimated cointegrating relationship, the coefficients of d/p and d/y change. There
is an apparent structural break in early 1996, after which more than half of the persistent
variables are discarded by the LASSO. Only the earnings price ratio (e/p), stock variance
(svar) and inflation (in f l) appear relevant in most windows after early 1996.
Figure 4, together with Figures S2 and S3 also shows that the signs of some of the esti-
mated coefficients change in different parts of the sample. For instance, the earnings price
ratio (e/p) is positively related to the equity premium prior to 1975 or after 1995, but it is
negatively correlated with the equity premium from 1975 to 1995. This reflects the changing
nature of stock return predictability through time. However, the signs of some coefficients
stay the same throughout the sample. For instance, a higher Treasury bill rate (tbl) or inflation
rate (in f l) always leads to a lower equity premium, and a higher dividend price ratio (d/p),
default yield spread (d f y) or long-term return (ltr) always leads to higher premium.
We use colour to illustrate the NBER recession dates in Figures 4, S2 and S3 to investigate
whether expansions or recessions have any systematic impact on the predictive ability of
the regressors. There does not appear to be a systematic relationship between the relevance
of any predictor and phases of the business cycle. However, several predictor coefficients
(e.g. d/p, d/y and e/p) change very sharply immediately after the 1981-1982 recession, and
then again just before the 1990-1991 recession, and several (e/p, ntis, svar and in f l) take
on different values during the 2007-2009 recession. The results demonstrate the sporadic
nature of the relationship between equity premium and the predictors commonly used in the
literature, which highlights the usefulness of the LASSO to “auto-select” the most relevant
and important predictors over successive estimation windows to achieve forecasting gains.
Finally, we examine the robustness of the forecasting results using the Welch and Goyal
(2008) data, but starting from January 1927. The FMSEs of the competing forecast methods
under consideration are tabulated in Table S2 in our online supplement. Compared with
Table 4, most of the results are qualitatively similar. Overall, the LASSOmin forecasts per-
29
Figure 4: LASSOmin estimated coefficients over twenty year windows for eight key predictors(with recession periods shaded in colour)
1965 1975 1985 1995 2005 20120
0.20.4
d/p
1965 1975 1985 1995 2005 2012-0.2-0.10
0.1
d/y
1965 1975 1985 1995 2005 2012-0.05
0
0.05
e/p
1965 1975 1985 1995 2005 2012
-1-0.50
tbl
1965 1975 1985 1995 2005 20120
2
4
dfy
1965 1975 1985 1995 2005 2012-101
dfr
1965 1975 1985 1995 2005 2012
-202
svar
1965 1975 1985 1995 2005 2012-2-10
infl
30
form the best, and pair-wise Giacomini and White (2006) tests in the kitchen sink set-up
always find that forecasts based on LASSOmin estimation are statistically better than those
based on the OLS regression. The Hansen (2005) test for superior predictive ability always
ranks LASSOmin model as one of the three best model specifications for forecasting the equity
premium. In general, we can conclude that LASSO has been successful in dealing with the
sporadic relationship between the predictors and the prediction objective. This, together with
the LASSO’s ability to capture cointegration among persistent predictors leads to superior
forecasting performance.
7 Conclusion
This paper considers the use of the LASSO in a predictive regression in which the depen-
dent variable is stationary, there are many predictors, and some of these predictors are non-
stationary and potentially cointegrated. We show that, asymptotically, the LASSO yields
a consistent estimator of the coefficients, including those for the cointegrating vector even
when both stationary and nonstationary predictors are allowed to diverge although the di-
mension of the nonstationary predictors is not as large as that of the stationary ones. We also
provide the asymptotic distribution of the cointegrating coefficients, which can be simulated
and used for statistical inference. The set of assumptions required for the consistency of the
LASSO estimator allows the dimensions of both the stationary and nonstationary predictors
to be arbitrarily large, making it feasible to implement the LASSO even when the number
of predictors is higher than the sample size. Our simulation studies show that the proposed
LASSO approach leads to superior forecasting performance relative to the commonly used
OLS kitchen sink model, an historical average, and several commonly used forecast proce-
dures.
The application of the LASSO to the problem of predicting the equity premium provides
an interesting illustrative example. Given that there are numerous predictors for the equity
premium, the variable selection function of the LASSO tackles the challenge of selecting the
most relevant variables very nicely. The LASSO also helps to deal with the non-stationarity
of some of the predictors by retaining those in a cointegrating relationship and estimating
the associated cointegrating coefficients. When we compare the performance of forecasts
of the equity premium based on models estimated via the LASSO with forecasts based on
other forecasting methods, we find that the former are usually superior, and we attribute
31
this finding, at least in part, to the LASSO’s ability to capture cointegration among persistent
predictors.
To the best of our knowledge, our paper is the first to study the statistical properties of
LASSO estimates of predictive regression coefficients in the presence of cointegration. We
should emphasize that our methodology does not exclude other ways of improving the pre-
dictability of stock returns, for example, bringing in other macroeconomic variables (Buncic
and Tischhauser, 2016) and technical indicators (Neely et al., 2014) and imposing constraints
on the return forecasts and/or estimated coefficients (Campbell and Thompson, 2008). As
documented by Buncic and Tischhauser (2016) and Li and Tsiakas (2016), combining addi-
tional information and parameter constraints with LASSO might further improve the pre-
dictability of the equity premium.
Note: We refer readers to our online supplement for empirical results not contained in this
paper.
References
Binsbergen, V., Jules, H. and Koijen, R. S. (2010), ‘Predictive regressions: A present-valueapproach’, The Journal of Finance 65(4), 1439–1471.
Buhlmann, P. and van de Geer, S. (2011), Statistics for High-Dimensional Data, New York:Springer.
Buncic, D. and Tischhauser, M. (2016), Macroeconomic factors and equity premium pre-dictability, Working paper.
Campbell, J. Y. (1987), ‘Stock returns and the term structure’, Journal of Financial Economics18(2), 373–399.
Campbell, J. Y. and Shiller, R. J. (1988), ‘The dividend-price ratio and expectations of futuredividends and discount factors’, Review of Financial Studies 1(3), 195–228.
Campbell, J. Y. and Thompson, S. B. (2008), ‘Predicting excess stock returns out of sample:Can anything beat the historical average?’, Review of Financial Studies 21(4), 1509–1531.
Campbell, J. Y. and Yogo, M. (2006), ‘Efficient tests of stock return predictability’, Journal ofFinancial Economics 81(1), 27–60.
Caner, M. and Knight, K. (2013), ‘An alternative to unit root tests: Bridge estimators differ-entiate between nonstationary versus stationary models and select optimal lag’, Journal ofStatistical Planning and Inference 143(4), 691–715.
32
Christoffersen, P. and Diebold, F. (1998), ‘Cointegration and Long-horizon Forecasting’, Jour-nal of Business and Economic Statistics 16, 450–458.
Cochrane, J. H. (1991), ‘Production-based asset pricing and the link between stock returnsand economic fluctuations’, The Journal of Finance 46(1), 209–237.
Cremers, K. M. (2002), ‘Stock return predictability: A Bayesian model selection perspective’,Review of Financial Studies 15(4), 1223–1249.
Dow, C. H. (1920), Scientific stock speculation, Magazine of Wall Street.
Elliott, G., Gargano, A. and Timmermann, A. (2013), ‘Complete subset regressions’, Journal ofEconometrics 177(2), 357–373.
Fama, E. F. and French, K. R. (1993), ‘Common risk factors in the returns on stocks and bonds’,Journal of Financial Economics 33(1), 3–56.
Fama, E. F. and Schwert, G. W. (1977), ‘Asset returns and inflation’, Journal of Financial Eco-nomics 5(2), 115–146.
Fan, J. and Lv, J. (2010), ‘A selective overview of variable selection in high dimensional featurespace’, Statistica Sinica 20(1), 101.
Giacomini, R. and White, H. (2006), ‘Tests of conditional predictive ability’, Econometrica74(6), 1545–1578.
Gonzalo, J. and Pitarakis, J.-Y. (1999), Dimensionality effect in cointegration analysis, in R. En-gle and H. White, eds, ‘Festschrift in Honour of Clive Granger’, Oxford University Press,pp. 212–229.
Hansen, B. E. (1992), ‘Convergence to stochastic integrals for dependent heterogeneous pro-cesses’, Econometric Theory 8(04), 489–500.
Hansen, P. R. (2005), ‘A test for superior predictive ability’, Journal of Business & EconomicStatistics 23, 365–380.
Johansen, S. (1991), ‘Estimation and Hypothesis Testing of Cointegration Vectors in GaussianVector Autoregressive Models’, Econometrica 59, 1551–1580.
Johansen, S. (1995), Likelihood-based Inference on Cointegrated Vector Auto-regressive Models, Ox-ford University Press, Oxford.
Kock, A. B. (2016), ‘Consistent and conservative model selection with the adaptive lasso instationary and nonstationary autoregressions’, Econometric Theory 32(01), 243–259.
Koijen, R. S. and Nieuwerburgh, S. V. (2011), ‘Predictability of returns and cash flows’, AnnualReview of Financial Economics 3(1), 467–491.
Koo, B. and Seo, M. H. (2015), ‘Structural-break models under mis-specification: implicationsfor forecasting’, Journal of Econometrics 188(1), 166–181.
Lamont, O. (1998), ‘Earnings and expected returns’, The Journal of Finance 53(5), 1563–1587.
Lettau, M. and Ludvigson, S. (2001), ‘Consumption, aggregate wealth, and expected stockreturns’, The Journal of Finance 56(3), 815–849.
33
Li, J. and Tsiakas, I. (2016), A new approach to equity premium prediction: Economic funda-mentals matter!, Working paper.
Liao, Z. and Phillips, P. C. (2015), ‘Automated estimation of vector error correction models’,Econometric Theory 31(03), 581–646.
Ludvigson, S. C. and Ng, S. (2007), ‘The empirical risk–return relation: A factor analysisapproach’, Journal of Financial Economics 83(1), 171–222.
Martin, V., Hurn, S. and Harris, D. (2012), Econometric Modelling with Time Series, number9780521196604 in ‘Cambridge Books’, Cambridge University Press.
Neely, C., Rapach, D., Tu, J. and Zhou, G. (2014), ‘Forecasting the equity risk premium: Therole of technical indicators’, Management Science 60(7), 1772–1791.
Pastor, L. and Stambaugh, R. F. (2009), ‘Predictive systems: Living with imperfect predictors’,The Journal of Finance 64(4), 1583–1628.
Phillips, P. C. (2014), ‘On confidence intervals for autoregressive roots and predictive regres-sion’, Econometrica 82(3), 1177–1195.
Phillips, P. C. (2015), ‘Halbert white jr. memorial jfec lecture: Pitfalls and possibilities inpredictive regression’, Journal of Financial Econometrics 13(3), 521–555.
Pontiff, J. and Schall, L. D. (1998), ‘Book-to-market ratios as predictors of market returns’,Journal of Financial Economics 49(2), 141–160.
Poskitt, D. S. (2000), ‘Strongly Consistent Determination of Cointegrating Rank via CanonicalCorrelations’, Journal of Business and Economic Statistics 18, 77–90.
Qian, J., Hastie, T., Friedman, J., Tibshirani, R. and Simon, N. (2013), Glmnet for matlab,http://www.stanford.edu/ hastie/glmnetmatlab/.
Rapach, D. E., Ringgenberg, M. C. and Zhou, G. (2016), ‘Short interest and aggregate stockreturns’, Journal of Financial Economics 121(1), 46–65.
Rapach, D. E., Strauss, J. K. and Zhou, G. (2010), ‘Out-of-sample equity premium prediction:Combination forecasts and links to the real economy’, Review of Financial Studies 23(2), 821–862.
Rapach, D. and Zhou, G. (2013), Forecasting stock returns, Vol. 2, Elsevier.
Rossi, B. (2013), ‘Advances in forecasting under instability’, Handbook of Economic Forecasting(111), 1203–1324.
Stambaugh, R. F. (1999), ‘Predictive regressions’, Journal of Financial Economics 54(3), 375–421.
Stock, J. H. and Watson, M. W. (2002), ‘Forecasting using principal components from a largenumber of predictors’, Journal of the American Statistical Association 97(460), 1167–1179.
Tibshirani, R. (1996), ‘Regression shrinkage and selection via the lasso’, Journal of the RoyalStatistical Society. Series B (Methodological) 58(1), 267–288.
Tibshirani, R. (2011), ‘Regression shrinkage and selection via the lasso: a retrospective’, Journalof the Royal Statistical Society: Series B (Statistical Methodology) 73(3), 273–282.
34
Timmermann, A. (2006), ‘Forecast combinations’, Handbook of Economic Forecasting 1, 135–196.
Welch, I. and Goyal, A. (2008), ‘A comprehensive look at the empirical performance of equitypremium prediction’, Review of Financial Studies 21(4), 1455–1508.
Wilms, I. and Croux, C. (2016), ‘Forecasting using sparse cointegration’, International Journalof Forecasting 32, 1256 – 1267.
Zou, H. and Hastie, T. (2005), ‘Regularization and variable selection via the elastic net’, Journalof the Royal Statistical Society. Series B (Statistical Methodology) 67(2), 301–320.
35
Appendix A
A.1 Proofs of Theorems
Proof of Theorem 1. Since by definition
ku � DBc(q� q0)k22�
T + lkgk1 kuk22�
T + lkg0k1,
rearranging the above equation yields
kDBc(q� q0)k22�
T + lkgk1 2u0DBc(q� q0)
�T + lkg0k1
(q� q0)0BcD0DBc
T(q� q0) + lkgk1 2u
0DBc(q� q0)�
T + lkg0k1. (15)
We start with the first term of the right hand side in (15). Given A1 (i.e. c belongs to acompact parameter space), c = O(1) and thus for any h > 0
Pr�
2c��u
0X�
T��
• (Th ^ k ln k) ! 1
by Lemma 2. Further, Lemma 1 implies that there exists a l such that
Pr�
2ku0Gk•
�T l
! 1
and l�1 = o(p
T/ ln p). Hence, with probability approaching one,
2ku0DBc(q� q0)k•
�T lkg � g0k1 + cOp (Th ^ k ln k) kd � d0k1
due to Lemma 1 and the Holder inequality.Therefore, regarding (15), for l �
⇣2l _
⇣(Th ^ k ln k) /3
pT⌘⌘
,
2kDBc(q� q0)k22�
T + 2lkgk1 lkg � g0k1 + 2lkg0k1 + cOp (Th ^ k ln k) kd � d0k1.
Note that due to the triangle inequality,
kgk1 = kgS0k1 + kgSc
0k1 � kg0
S0k1 � kgS0
� g0S0k1 + kgSc
0k1
where S0 is the support of the parameter g0, i.e. S0 := {j : g0j 6= 0} for g, and by construction
kg � g0k1 = kgS0� g0
S0k1 + kgSc
0k1.
Then, as k/p
T 3l,
2kDBc(q� q0)k22�
T + lkgSc0k1 3lkgS0
� g0S0k1 + Op((Th ^ k ln k) /
pT)ck
pT(d � d0)k1
3l⇣kgS0
� g0S0k1 + ck
pT(d � d0)k1
⌘
= 3l���B⇤
⇣q
0S⇤ � q0
S⇤
⌘���1
, (16)
where S⇤ is the index set combining S0 and the indices for d, and B⇤ is a diagonal matrix ofsize S⇤ whose first k elements are c and the others are 1. Also, the first k elements of q� q0 donot lie within the span of d0 due to the norm normalization. That is, if d � d0 = ad0 for somea, then d = (1 + a)d0, which in turn implies that a = 0 due to the norm normalization. This
36
enables us to apply the compatibility condition A2, implying that there exists a sequence ofpositive constants fT such that
���B⇤⇣
q0S⇤ � q0
S⇤
⌘���1 kDBc(q� q0)k2
2s0(l, q)�
Tf2T
= (q� q0)0BcD0DBc
T(q� q0)s0(l, q)
�f2
T, (17)
where s0(l, q) denotes cardinality of the index set S⇤ for q, i.e. s0 = |S⇤|. Then
2kDBc(q� q0)k22�
T + l��Bc
�q� q0���
1
=2kDBc(q� q0)k22�
T + lkgSc0k1 + l
���B⇤⇣
q0S⇤ � q0
S⇤
⌘���1
4l���B⇤
⇣q
0S⇤ � q0
S⇤
⌘���1
4lq
s0(l, q)kDBc(q� q0)k2�p
TfT
kDBc(q� q0)k22�
T + 4l2s0(l, q)�
f2T.
The first inequality comes from (16), the second inequality is due to (17) and the last inequalityuses 4uv u2 + 4v2. Rearranging the last inequality,
kDBc(q� q0)k22�
T + l��Bc
�q� q0���
1 4l2s0(l, d)/f2T.
When b0 = 0, c =���b���
1= Op
⇣l/
pT⌘
, since��Bc
�q� q0���
1 = cp
T���d � d0
���1+ kg � g0k1
and d is not consistent due to the lack of identification.
Proof of Theorem 2.
First, we refine the convergence rate of d under the additional assumption that k is fixed.Recall that
D0DT
=
✓X0X
�T2 X0G
�Tp
TG0X
�Tp
T G0G�
T
◆
and note that
kX0G�
Tp
Tk = Op(T�1/2+q)
where q is any positive number due to Lemma 2. Removing the positive term involvingG0G and substituting this and the convergence rate of q that we just obtained into the firstinequality in (16), we can deduce that
c1Tkd � d0k21 Op(1)kd � d0k1 + Op(s0l2T�1+q)
This yields the desired rate for d.Let
ST(g, d) =1T
T
Ât=1
�yt � g0
tg � x0td�2 .
It is clear from Theorem 1 that
supd
��ST(g, d)� ST(g0, d)
�� = op(T�1),
37
where the supremum is taken over any T�1 neighborhood of d0. Therefore, let ST(d) =ST(g0, d) and consider weak convergence of the following stochastic process
DT(b) = T⇣
ST(d0 + T�1b)� ST(d
0)⌘
,
= �2c0
T
T
Ât=1
utx0tb +c0
Tb0
T
Ât=1
xtx0tbc0
) �2c0b0Z 1
0W (r)dU (r) + L
�+ c0b0
Z 1
0W (r)W (r)0drbc0
:= D(b)
over any given compact space for b.Furthermore, the estimator d is the minimizer of ST(g, d) under the constraint that kdk1 =
1. This makes the parameter space for b orthogonal to d0 to prevent the quadratic term in DTfrom degenerating. The precise formulation takes some elaboration. Consistency of d meansthe sign-consistency of its nonzero elements as well. Thus,
0 = T(kdk1 � kd0k1)
= Âj(sgn(d0
j ))T(dj � d0j ) + Â
jT|dj|1{d0
j =0}
= sgn(d0)0b + Âj|bj|1{d0
j =0}
with probability approaching 1. Therefore, we can conclude that
T(d � d0)d! arg min
b:sgn(d0)0b=�Âj |bj|1{d0j =0}
D(b),
by the argmax continuous mapping theorem since the limit process is quadratic over theparameter space.
A.2 Lemmata
Lemma 1 Let�
ztj
, j = 1, ..., p, be centered a-mixing sequences satisfying the following: for somepositive g1, g2 > 1, a, and b,
a (m) exp (�amg1)
P���ztj
�� > z H (z) := exp
�1 � (z/b)g2
�for all j (18)
g =⇣
g�11 + g�1
2
⌘�1< 1,
ln p = O⇣
Tg1^(1�2g1(1�g)/g)⌘
. (19)
Then,
maxjp
�����
T
Ât=1
ztj
����� = Op
⇣pT ln p
⌘.
Proof of Lemma 1. Consider the sums of truncated variables first. Consider zMtj =
ztj1���ztj
�� M + M1
���ztj�� > M
with M = (1 + Tg1)1/g2 . Due to Proposition 2 in Mer-
38
levede et al. (2011), we have
E
exp
T�1/2
T
Ât=1
⇣zM
tj � EzMtj
⌘!!= O (1)
uniformly in j, and thus
E maxjp
�����T�1/2
T
Ât=1
⇣zM
tj � EzMtj
⌘����� = O (ln p) . (20)
For the remainder term, note that
E maxjp
T
Ât=1
���ztj � zMtj + EzM
tj
��� TpE���ztj � zM
tj + EzMtj
���
2TpZ •
MH (x) dx
= O (Tp exp (�Tg1)) ,
where the last equality is obtained by bounding the integral by MH(M) = exp (�Tg1) ingeneral since H (x) decays at an exponential rate.
Thus,
E maxjp
�����
T
Ât=1
ztj
����� E maxjp
�����
T
Ât=1
zMtj � EzM
tj
�����+ E maxjp
�����
T
Ât=1
hztj �
⇣zM
tj � EzMtj
⌘i�����
= O⇣p
T ln p⌘+ O (Tp exp (�Tg1)) .
Finally, recall that ln p = o (Tg1) by (19).Next, we derive another maximal inequality involving integrated processes by means of
martingale approximation (e.g. Hansen 1992).
Lemma 2 Let {xti} be an integrated process such that Dxti is centered and x0i = Op (1). Let {Dxti}and
�ztj
, i = 1, ..., k, j = 1, ..., p, be a-mixing sequences satisfying the conditions in Lemma 1. Alsolet k = O
⇣pT⌘
and p = O (Tr) for some r < •. Then
max1ik
max1jp
�����
T
Ât=1
ztjxti
����� = Op
⇣T1+d ^ Tkp
⌘
for any d > 0.
Proof of Lemma 2. When k and p are of order O (lnc T) for some finite c, boundingthe maximum by the summation yields the bound of Tkp since ÂT
t=1 ztjxti = Op (T), whichis standard see e.g. Hansen (1992). Thus, we turn to the case where k and p grows at apolynomial rate of T and tighten this bound.
We apply the martingale difference approximation to ztj following Hansen (1992). Let Ftdenote the generated sigma field from all zt�s,j and xt+1�s,i, s, i, j = 1, 2, ... and let Et (·) be theconditional expectation based on Ft. Define
etj =•
Âs=0
�Etzt+s,j � Et�1zt+s,j
�and ztj =
•
Âs=1
Etzt+s,j.
39
Then it is clear thatztj = etj + zt�1,j � zt,j
and etj is a martingale difference sequence.We begin with deriving a tail probability bound for the sum of a martingale difference
arrayn
#t = etjxti/p
To
for each i and j, where we suppress the dependence of #t on i and jto ease notation. Consider a truncated sequence
#Ct = #t1 {|#t| < C}+ C1 {#t � C}� C1 {#t �C}
and its centered version#C
t = #Ct � E
⇣#C
t |Ft�1
⌘.
Then, by construction #Ct is a martingale difference sequence. Furthermore,
���#Ct
��� ���#C
t
���+ E⇣���#C
t
��� |Ft�1
⌘ 2C
by the triangle and Jensen’s inequalities, and by the fact that��#C
t�� C. Furthermore, let
rCt = #t � #C
t =⇣
#t � #Ct
⌘� E
⇣⇣#t � #C
t
⌘|Ft�1
⌘,
where the last equality follows from the fact that {#t} is an MDS and thus
#t = #Ct + rC
t .
By Azuma’s (1967) inequality,
Pr
(�����1pT
T
Ât=1
#Ct
����� � c
) 2 exp
✓� c2
4C2
◆.
On the other hand, since rCt = 0 if |#t| C, we have for any c > 0,
Pr
(�����1pT
T
Ât=1
rCt
����� > c
) Pr
⇢max
t|#t| > C
�. (21)
However,
Pr⇢
maxt
|#t| > C�
Pr⇢
maxt
��etj��max
t
����xtip
T
���� > C�
Ât
Prn��etj
�� >p
Co+ Pr
⇢max
t
����xTip
T
���� >p
C�
C�qT
Ât=1
E��etj��2q
+ 2T exp
� (TC)g/2
C1
!+ exp
✓� C
C2
◆,
where the first term comes from the Markov inequality and the second and third terms aregiven by Merlevede et al.’s (2011) equation (1.11).
Next, we derive some moments bounds for the sums of ztj and etj. By (A.1) of Hansen
40
(1992), if ztj is a-mixing of size �qb/ (q � b) with E��ztj��q < •,
⇣E��ztj��b⌘1/b
6•
Âk=1
a1/b�1/qk
⇣E��ztj��q⌘1/q
.
Further, by the triangle inequality,
⇣E��etj��q⌘1/q
=⇣
E��ztj + ztj � zt�1,j
��q⌘1/q
⇣
E��ztj��q⌘1/q
+ 2⇣
E��ztj��q⌘1/q
.
Under our moment conditions and the mixing decay rate in Lemma 1 , these quantities arebounded for any b and q such that q > b. Then, combining the preceding arguments weconclude that for any aT,
Pr
(�����1
aTp
T
T
Ât=1
#t
����� � c
)
Pr
(�����1pT
T
Ât=1
#Ct
����� � aTc2
)+ Pr
⇢max
t|#t| > C
�
2 exp✓� a2
Tc2
4C2
◆+ C�q
T
Ât=1
E��etj��2q
+ 2T exp
� (TC)g/2
C1
!+ exp
✓� C
C2
◆.
Given this bound, we can deduce from the union bound that
Pr
(max1ik
max1jp
�����1
aTT
T
Ât=1
etjxti
����� > c1
)
kp Pr
(�����1
aTT
T
Ât=1
etjxti
����� > c1
)
kp
2 exp
✓� a2
Tc21
4C2
◆+ C�q
T
Ât=1
E��etj��2q
+ 2T exp
� (TC)g/2
C1
!+ exp
✓� C
C2
◆!.
Since E��etj��2q
< •, we begin with choosing C such that kpTC�q ! 0. This choice of C makesthe last two terms degenerate, and for any c1 > 0 the first term also degenerates as long asaT � bT (kpT)
12q ln 2kp for some bT ! •. Since q can be chosen arbitrarirly big, we can set aT
at any polynomial rate of T to conclude that
max1ik
max1jp
�����1T
T
Ât=1
etjxti
����� = Op
⇣Td⌘
, (22)
for any d > 0.
41
Next, we note that by the union bound,
max1ik
max1jp
�����1T
T
Ât=1
�zt�1,j � zt,j
�xti
����� (23)
max1ik
max1jp
��z0,j�� |x1i|
T+ max
1ikmax1jp
��zT,j�� |xTi|
T
+ max1ik
max1jp
�����1T
T
Ât=2
�zt�1,jDxti � Ezt�1,jDxti
������+ max
1ikmax1jp
Ezt�1,jDxt
= Op
✓k log pp
T
◆+ Op
✓log kpp
T
◆+ O (1) ,
since supt |xt| /p
T = Op (1) , maxj��z0,j
�� = Op�p
log p�
, and we can use Lemma 1 with{|xy| > c} ⇢
�|x| >
pc [�|y| >
pc
. Note that the bound in (22) dominates the one in (23)for k = O
⇣pT⌘
and any d > 0.
42