Forecast evaluation with shared data sets

International Journal of Forecasting 19 (2003) 217–227www.elsevier.com/ locate/ ijforecast

Forecast evaluation with shared data sets

a b , a*Ryan Sullivan , Allan Timmermann , Halbert WhiteaBate, White & Ballentine, 2001 K Street NW, Washington, DC 20006,USA

bDepartment of Economics, University of California San Diego, 9500 Gilman Drive, La Jolla, CA 92093-0508,USA

Abstract

Data sharing is common practice in forecasting experiments in situations where fresh data samples are difficult or expensive to generate.This means that forecasters often analyze the same data set using a host of different models and sets of explanatory variables. This practiceintroduces statistical dependencies across forecasting studies that can severely distort statistical inference. Here we examine a new andinexpensive recursive bootstrap procedure that allows forecasters to account explicitly for these dependencies. The procedure allowsforecasters to merge empirical evidence and draw inference in the light of previously accumulated results. In an empirical example, wemerge results from predictions of daily stock prices based on (1) technical trading rules and (2) calendar rules, demonstrating both thesignificance of problems arising from data sharing and the simplicity of accounting for data sharing using these new methods. 2001 International Institute of Forecasters. Published by Elsevier Science B.V. All rights reserved.

Keywords: Forecast evaluation; Bootstrap; Data sharing; Calendar effects; Technical trading; Data mining

1. Introduction explanatory variables or methodology, and one re-search design may not entirely dominate others. In

Data sharing is common practice in forecasting this situation it is important to have available aexperiments. This practice is often inevitable. There procedure that allows outsiders to merge the reportedis only one time series for the US gross domestic forecasting results and draw conclusions acrossproduct and only one history of stock market prices studies.for US firms. Investigators intending to predict a Closely related to the problems arising from datagiven economic time series will typically not be the sharing are biases in statistical inference arising fromfirst or only persons to have looked at the data. data mining. Data mining is the practice of re-using aStudies may generate conflicting results although given set of data for purposes of inference or modelthey use the same data set. Such an outcome could selection. It is widely regarded as an importantbe a result of the use of different functional forms, procedure for extracting information from data sets

in the engineering, physical and biological sciences.The reason for this is obvious: important regularitiesin available data that may not have been predicted bytheory or appear as a result of a simple preliminary

*Corresponding author. Tel.:11-858-534-4860; fax:11-858-statistical analysis may nevertheless emerge from534-7040.systematic analysis of the data. However, ChatfieldE-mail addresses: [email protected] (A. Timmermann),

http: / /www.econ.ucsd.edu/atimmerm (A. Timmermann). (1995) demonstrates forcefully that if the same data

0169-2070/01/$ – see front matter 2001 International Institute of Forecasters. Published by Elsevier Science B.V. All rights reserved.PI I : S0169-2070( 01 )00140-6

218 R. Sullivan et al. / International Journal of Forecasting 19 (2003) 217–227

set is used to formulate, estimate and test a model, sulting from data sharing and data mining in forecastserious biases in inference are likely to result. As evaluation. We describe some of the more commonincreasingly advanced statistical fitting procedures of these practices in turn and explain their potentialare brought to bear on particular data sets, the shortcomings.dangers of over-fitting induced by data miningbecome ever greater and a procedure that controls for

2.1. Report the full set of models investigated in athe resulting biases is urgently needed.

particular studyIf forecasters subsequently use fresh data samples

to test propositions based on close inspection of anThis would seem to be the best research strategy

initial data sample, then the standard assumptionsavailable to researchers who want to account for

underlying statistical inference need not be violated.possible contamination of their analysis due to the

This ideal situation exists whenever new data caninspection of a large number of competing models.

easily be generated or the initial data set is suffi-Nevertheless, even if a researcher only entertains a

ciently large that a hold-out sample can be reservedsmall number of models and honestly reports the full

for subsequent prediction analysis. However, in theset of test results, he is still likely to be strongly

common situation where new data are unavailableinfluenced by his peers, who may previously have

and the same data are used to form an hypothesis andinspected the same or closely related data sets. This

test it, then standard assumptions underlying classi-is a real problem as science often progresses by

cal statistical inference will be violated. Moreover, ifcumulating evidence from earlier studies conducted

sufficiently many models are fitted to a finite databy different groups of researchers, each group build-

sample, then by pure chance some of these modelsing on previously reported work. Even if a particular

are bound to detect patterns, even if these are trulyresearcher controls for his own data mining, it may

spurious. This phenomenon has been well docu-be impossible to control for other researchers’ ef-

mented in the literature on subset selection, cf. Millerforts. This contaminates the data for purposes of

(1990).hypothesis testing. How important this source of

Distortions in the research community’s inferencecontamination is depends on how many models

as a result of data mining is widely recognized.previous researchers have studied (cf. Lo & MacKin-

Economics Nobel Laureate Robert Merton puts itlay, 1990), the correlation between the combined set

this way: ‘‘Is it reasonable to use the standardt-of (new and old) models, the sample size and the

statistic as a valid measure of significance when thestochastic structure of the shared data set. Unless all

test is conducted on the same data used by manythese effects are accounted for and quantified, correct

earlier studies whose results influenced the choice ofinference becomes extremely difficult.

theory to be tested?’’ (Merton, 1987, p. 107).The plan for the paper is as follows. Section 2

describes the existing procedures for handling data 2.2. Bonferroni boundsmining. Section 3 outlines the distributional theoryunderlying our proposal and compares alternative Bonferroni bounds provide a statistical procedureMonte Carlo and bootstrap procedures. Section 4 for bounding the probability that the best availableprovides an empirical forecasting application and model does not outperform the benchmark againstSection 5 concludes. which it is compared. Letp be the probabilityi

( p-value) associated with the null hypothesis that theith model does not outperform the benchmark and

2. Existing procedures for dealing with data- suppose thatl forecasting models have been consid-snooping biases ered by the research community. Further suppose

that interest lies in testing the joint null hypothesisBecause it constitutes such a widespread problem, that none of the models is superior to some given

a variety of procedures have been developed to benchmark. The Bonferroni bound simply states that,account for the biases in statistical inference re- for arbitrary correlations between the performance

R. Sullivan et al. / International Journal of Forecasting 19 (2003) 217–227 219

statistics used to evaluate the models, thep-value for ing independent samples. Second, because institu-the joint test is given by tional structures differ across markets, the hypoth-

esized effect could well be present in the US data butp #Min(l ?Min( p , . . . ,p ),1). (1)1 l not in other markets. In this case failure to find

evidence of the effect from other markets does notHence the procedure scales the smallestp-value, necessarily lead to a revision in thep-value com-representing the strongest evidence against the nullputed from the US study.hypothesis, by the number of models under consid-eration to obtain an upper bound on the probabilitythat the best model does not reject the null. Un- 3. Accounting for dependencies acrossfortunately, this practice can be extremely conserva- forecasting studiestive, particularly when there are strong positivecorrelations between performance measures as is We argued in the Introduction that dependenciesoften the case. In applications where many studies across studies potentially contaminate statistical in-have been conducted andl is consequently very ference about forecasting performance and in thelarge, the bound simply states thatp is less than one. previous section we suggested that common current

practices to deal with this contamination are not2.3. Wait for new data to become available satisfactory. In this section we set up a classical

statistical framework for quantifying such dependen-While this is sometimes feasible, it also can be a cies and discuss procedures that can quantify the

very costly and slow procedure. For example, econ- biases. Our interest lies in conducting inferenceomic theory may predict dynamic structures in using many models on thesame data set and thus iseconomic variables at the business cycle frequency. distinct from meta analysis which is concerned withWith only eight post-war recessions, researchers the combination of data from multiple sources (cf.would have to wait for an extraordinarily long period Mosteller & Chalmers, 1992). The principle is toof time before getting sufficient new data on which account for dependencies across models by evaluat-to test the validity of such hypotheses. Furthermore, ing the probability distribution of the performanceeven if time is not of the essence, structural breaks or measure of interest in the context of the full universeregime switches of the type modeled by Hamilton of models leading to the best-performing model. This(1989) could render new data useless for purposes of is the logical way to handle data-mining distortionstesting the original effect. induced by conducting statistical inferenceafter (and

hence conditional on) model selection. By consider-2.4. Use similar data from other sources ing the full set of models entertained by the com-

munity of researchers we effectively undo the con-This is a common practice in the social sciences. ditioning on the model selection. A comparison of

For example, suppose that a researcher has discov- the distribution of the performance statistic of theered that a certain variable predicts US stock prices. best model when standing alone (and hence con-In the absence of new data from the US, the ditional on the model selection process) to theresearcher may use international data from other distribution of the performance statistic in the con-stock markets to see if the finding holds only for US text of the full universe of models quantifies the sizestock prices or holds more generally in other mar- of the model selection bias.kets. Presence of the pattern in other markets is then Here we evaluate model performance by predic-considered strong corroborative evidence for the tive accuracy but we emphasize that the approach ishypothesis, while absence of the pattern is inter- readily applicable in other contexts. Suppose thenpreted as evidence against the hypothesis. There are that each model produces a sequence of forecaststwo problems with this approach. First, the time that depend on a set of predictor variables andseries of the dependent and explanatory variables are recursively updated parameter estimates. Supposeoften strongly correlated and are far from represent- further thatl models of a time series process have


been (jointly) considered by the research community. follow an asymptotic normal distribution. SupposeTo account for dependencies across models, the test now that the best model has been selected from aprocedure considers the distribution of thel 3 1 universe ofl candidate models, all of which may beperformance statistic correlated. Then itsp-value should be evaluated

based on the maximum value drawn from anl-T] 21 dimensional normal distribution with mean zero andˆf 5 n O f , (2)t11

t5R a covariance matrix reflecting the correlations acrossthe included models’ performances.wheren is the number of prediction periods indexed

ˆfrom R through T so that T 5R1 n 2 1, f 5t113.1. Monte Carlo methodsˆf(Z ,b ) is the observed performance measure fort11 t

ˆperiod t 1 1, and b is a vector of recursivelytThe distribution of max hC j for generalVk51, . . . l kupdated parameter estimates. The firstR observations

is analytically intractable except whenl 51 or l 5 2âre used to obtain the initial estimate,b . TheRas White (2000) discusses. Hence one has to resort toêstimation window is then expanded such thatbR11numerical evaluation of this statistic. One possibilityis estimated from the firstR1 1 observations, and sois to use Monte Carlo simulation methods. These canon. Z is a vector of variables containing botht11proceed as follows. First consider the case wheretarget and predictor variables. Target variables arel 5 1. Here the problem is simply to find theobserved for the first time in periodt 1 1, whereasdistribution ofpredictor variables are available at timet. The setup

2is assumed to satisfy the conditions of Diebold and C 5maxC |N(0,s ). (5)1 1 1Mariano (1995) or West (1996). The elementsfk,t11

In this case Monte Carlo methods are unnecessaryof f measure the performance of the individualt11

(although easily carried out) because standardizingmodels k 5 1, . . . ,l relative to some given bench-]1 / 2 2yields a statisticn f/s that is standard normal.smark. The relevant null hypothesis is that the best 1 1

is unknown but can be replaced by a consistentmodel is not capable of outperforming the bench-estimator that converges in probability tomark:

p2 2 2 2ˆ ˆs : s →s . Estimation ofs is complicated by its1 1 1 1*H : max hE[ f ]jA0, (3)0 k dependency on the full set of autocovariances of thek51, . . . ,l

underlying performance measure and requires esti-ˆ* * *whereE[ f ] ;E[ f (Z ,b )], b 5 plimb ).k k t t mation of a spectral density at frequency zero.The seminal papers by Diebold and Mariano Nevertheless, there are a variety of procedures

(1995) and West (1996) develop methods for asses- appropriate for doing this. For example, we can usesing the out-of-sample predictions generated by the stationary bootstrap estimates of Politis andeconomic models. Corradi, Swanson and Olivetti Romano (1994) or the HAC estimates of Newey and(2001) compare the assumptions underlying these West (1987).two papers, the key being that West (1996) accounts Next consider the case in whichl 52. This casefor parameter estimation errors. Building on this requires computation of the distribution ofwork, White (2000) shows that, under the element of max hC j. The 23 1 vector C |N(0,V ) hask51,2 kH least favorable to the alternative,0 covariance matrix

d]1 / 2 2max n f → max hC j, (4) s sk k 1 12k51, . . . ,l k51, . . . ,l V 5 ,S D2

s s12 2whereC |N(0,V ) is an l 3 1 multivariate normally

2 2ˆ ˆ ând estimatess , s , ands must be obtained. Todistributed random vector with elementsC having a 1 12 2kdestimate these consistently, we need to have avail-specific covariance matrixV, and → denotes con-

âble f andk 5 1,2, t 5 1, . . . ,T. Once the estimatesvergence in distribution. This result is intuitively k,t

have been obtained, the Monte Carlo proceeds assatisfying. If only a single model is considered, thenfollows. Drawby standard results its performance measure would


2 indexes the B bootstrap samples. Then we canˆC |N(0,s ),1i 1(6) construct the following statistics:ˆ ˆC 5b C 1g ´ , ´ | iid N(0,1),2i 21 1i 2 2i 2i ]] ]ŒV 5 max h n f j, (8)l k

k51, . . . ,l2 2ˆ ˆ ˆwhereb , g are chosen such thatE[C ] 5s and21 2 2i 22 ] ]]ˆ ]ˆ ˆ ˆ Œ *E[C C ] 5s . This implies thatb 5s /s and *V 5 max h n(f 2f )j, i 51, . . . ,B. (9)1i 2i 12 21 12 1 l,i k,i k2 2 2 21 k51, . . . ,lˆ ˆ ˆ ˆ ˆg 5s 2s (s ) s . From the simulated series2 2 21 1 12

we can evaluate the performance measure Comparing the performance measure from the actual]maxhC ,C j. Repeating this procedureB times data (V ) to the quantiles from the bootstrap experi-1i 2i l]yields a histogram for the maximum of the mean *ment (V ) one obtains White’s bootstrap realityl,i

performance measure against which the mean of thecheck p-value for the null hypothesis. Using thebest-performing model from the actual sample can be maximum value of the performance measure acrosscompared. all l models ensures that the effects of data mining

In the general case with an arbitrarily large value are accounted for.of l, we have to evaluate the distribution of The sample statistic can also be based on functions

]max hC j. For the lth term this amounts to g of sample momentsh , k 50, . . . ,l:k51, . . . ,l k k2estimating s ,s , . . . ,s and then simulatingCl1 l2 l li ] ]f 5 g(h )2 g(h ), (10)k k 0from the equation

] ]ˆ ˆ ˆ where h and h are averages computed over theˆC 5b C 1b C 1 ? ? ? 1 b C 1 g ´ k 0li l1 li l2 2i l,l21 l21,i l li

prediction sample for thekth model and the bench-(7) mark, respectively:

ˆ Tˆfor suitable choices ofb andg . Again a histogramlj l ] 21h 5 n O h . (11)for maxhC ,C , . . . ,C j can be formed usingB k k,t111i 2i lit5Rseparate Monte Carlo simulations, and this can be

In this case the bootstrap procedure is applied tocompared to the actual performance of the best ] ] *yield B bootstrapped values off , denoted asf ,model to form ap-value. k k,i

whereWe conclude from this description of the Monte] ] ]Carlo procedure that, in order to evaluate the data- * * *f 5 g(h )2 g(h ), i 51, . . . ,B, (12)k,i k,i 0,isnooping bias, first, one would need to keep alll ? n

ˆ Tvalues of f in memory. Secondly, one would needk,t ] 21* *ˆ h 5 n O h , i 5 1, . . . ,B. (13)to estimatel(l 11) /2 covariance termss explicitly. k,i k,t11,ijkt5RThird, one would effectively have to perform regres-

ˆˆsions based ons in order to get the estimatesb The distribution needed to evaluate thep-value isjk jk ]1 / 2used in the simulations. Finally, every time an extra *that of n (f 2E[ f ]). The bootstrap computes] ] ]1 / 2 21 T* *model is added to the set, one would have to either *n (f 2f ), where f 5 n o f , and t 5t5R ut

keep all previous Monte Carlo draws in memory or R, . . . ,T. The random indexesu are chosen fromtredraw. See White (2000) for further discussion. R, . . . ,T according to the stationary bootstrap of

Politis and Romano (1994):3.2. The bootstrap methods

for t 5R, u is drawn uniformly on [0,1],t

for t .R, U is drawn uniformly on [0,1],As White (2000) shows, an alternative to theif U . q pick u uniformly on hR, . . . ,T j,Monte Carlo approach is to simulate the needed t

if UAq pick u 5u 11, and reset toR ifdistribution by means of the stationary bootstrap of t t21

u 5 T 1 1,Politis and Romano (1994) applied to the observed tˆvalues off . Resampling the performance measuresk,t

from the forecasting rules yieldsB bootstrapped whereq is a smoothing parameter chosen to accom-] ] *values of f , which we denote byf , where i modate dependence in the data. White’s Theoremk k,i


value ofk performance statistic compare to the distribution of2.3, which we reproduce below, states that thed1 / 2 1 / 2¯ ¯ ¯¯ ¯ * *k51 V 5n f V 5n ( f 2 f )→N(0,V )distribution of the performance measure can be 1 1 1 1 1 11

1 / 2 1 / 2¯ ¯ ¯¯ ¯ ¯ ¯evaluated by means of the stationary bootstrap: * * *k52 V 5max(n f ,V ) V 5max(n ( f 2 f ),V )2 2 1 2,i 2,i 2 1,i

. . . . . . . . .1 / 2 1 / 2¯ ¯ ¯¯ ¯ ¯ ¯* * *k5l V5max(n f ,V ) V 5max(n ( f 2 f ),V )Theorem 2.3. (White, 2000)Suppose West’s (1996) l l l21 l,i l,i l l21,i

conditions and Politis and Romano’s (1994) con-An important consequence of this economy of*ditions hold for the elements of f(Z ,b ). Alsotinformation is that it facilitates the communication ofsuppose that, for all k-vectors l, l9l5 1research results across forecasting experiments.

1 / 2 ˆt ul9(b 2b )u When additional models are fitted to the same datat]]]]]]]]Pr lim sup 51 51,S Dt 1 / 2 set, updating the bootstrap simply requires a knowl-(l9Gl log(log(l9Gl)t)) ]1 / 2 *edge of the histogram of max n (f 2k51, . . . ,l21 k,i]d1 / 2ˆ ˆ f ), i 5 1, . . . ,B. The recursions (15) make it clear*where b is such that t (b 2b ) →N(0,G). Let kt t

that to continue a specification search using theF 50 or (n /R) log(log(R)) → 0. Then*bootstrap, it suffices to knowV , V and thel21 l21,i] ]1 / 2 *r(+(n (f 2f )uZ , . . . ,Z ),1 T11 indexesu , i 5 1, . . . ,B. For the latter, knowledge ofit

p]1 / 2 R, T, q, B, the random number generator (RNG) and*+(n (f 2E( f ))uZ , . . . ,Z )) →0,1 T11 ˆthe RNG seed suffice. Knowing or storingf isk,t11ˆwhere + denotes the probability law of its argument unnecessary, nor do we need to compute or storeV

and r is any metric metrizing convergence in or the previous Monte Carlo draws. This demon-distribution. strates not only a significant computational advan-

tage for the bootstrap method over the Monte CarloAn immediate consequence of this result, which version but also the possibility for researchers at

justifies using the bootstrap procedure, is that different locations or at different times to further] ] understanding of the phenomenon modeled without1 / 2 *r(+( max n (f 2f )uZ , . . . ,Z ),k k 1 T11

k51, . . . ,l needing to know the specifications tested by theirp] ] collaborators or competitors. Some cooperation is1 / 2 *+( max n (f 2E(f ))uZ , . . . ,Z )) →0.k k 1 T11

k51, . . . ,l nevertheless required, asR, T, q, B, the RNG, the*RNG seed andV , V , i 51, . . . ,B, must still be(14) l21 l21,i

shared, along with the data and the specification andImplementation of the bootstrap is simple and in- estimation method for the benchmark model.]] 1 / 2volves comparingV 5max hn f j to thel k51, . . . l k ] ]] 1 / 2 **distribution of V 5max hn (f 2f )j.l k51, . . . ,l k k] ]*Here f is the kth component off obtained byk 4. Empirical application: evaluating the efficientˆresampling f according to the stationary boot-k,t11 market hypothesis using daily stock price datastrap. A very attractive property of the bootstrap isits recursive structure, which greatly economizes on

To demonstrate how easily data can be shared forthe information required to update thep-value for thepurposes of conducting inference that accounts fornull hypothesis that no model outperforms the bench-dependencies across studies, we demonstrate in thismark. This can be seen from the expressionsection how to merge two separate forecasting

] ]1 / 2 *max n (f 2f ) experiments applied to daily stock prices on the Dowk,i kk51, . . . ,l

Jones Industrial Average and reported in more detail] ] ] ]1 / 2 1 / 2* *5max(n (f 2f ), max n (f 2f )), i in Sullivan, Timmermann and White (1999, 2001).l,i l k,i kk51, . . . ,l21

The sample spans the period January 1, 1897 to June51, . . . ,B. (15)

30, 1998 and has 27,447 observations. To the best ofFor a given bootstrap iteration (i) the p-value for the our knowledge, this is the first time a statisticalnull hypothesis can be updated recursively using a experiment of this nature has been conducted.simple spreadsheet: The first experiment considers forecasting daily


stock prices by means of technical trading rules. ing range of signal functions: 1 represents a longThese rules attempt to discover repeated patterns in position, 0 a neutral position (i.e., being out of thestock prices, for example by comparing short-term market), and21 a short position (selling the assetmovements to long-term trends. We consider 497 for future delivery).filter rules, 2049 moving average rules, 1220 support The efficient market hypothesis is usually taken toand resistance rules, 2040 channel break-out rules imply that there does not exist a trading system thatand 2040 on-balance volume average rules. Each can outperform the market index. Hence the relevantrule uses a different parameterization and the total null hypothesis is that there does not exist a forecast-number of technical trading rules, which we refer to ing rule that generates performance superior toas l , is 7846. always being in the market:1

We also consider separately the extent to whichH : max hE( f )jA0. (17)there are calendar effects in stock returns. An 0 k

k51, . . . ,lextensive literature in finance reports evidence ofday-of-the-week, week-of-the-month, month-of-the- Rejection of this null hypothesis implies that the bestyear and holiday effects in mean returns on stocks. model is capable of outperforming the benchmarkSee Lakonishok and Smidt (1988) for a summary of according to the performance measures embodied inthe literature. The best known calendar effect is f. Although we follow the finance literature inprobably the Monday effect: stocks have been found measuring performance through mean returns, weto have lower mean returns on Mondays than on notice that performance measures that account forother days of the week. We consider a total of risk, such as the Sharpe ratio, can easily be com-l 5 9452 calendar rules, all of which reasonably puted.2

could have been explored by an investor in search of Table 1 shows that the best technical trading rulea calendar pattern. These include 60 day of the week (mean return of 17.07%) appears to be capable ofrules, 60 week of the month rules, 8188 month of the producing genuine outperformance relative to theyear rules, 100 semi-month rules, eight holiday rules,12 end of December rules, and 1024 turn of the Table 1month rules. The combined experiment is comprised Dow Jones Industrial Average: full sample (January 1897–Juneof 784619452517,298 (l 1 l 5 l) trading rules. 1998) calendar frequency and technical trading rules1 2

Since the prediction methods used in both studies do Mean return criterionnot require estimation of any parameters, the param-

Calendar frequency trading rules (9452 rules)eterizations of the resulting decision rules (b , k 5k Benchmark 4.80%1, . . . ,l) directly generate returns that are then used Best rule Monday effect (neutral onto measure performance. Mondays, long otherwise)

Performance 8.55%Performance is measured using the continuouslyNominalp-value 0.00compounded returns on a forecasting rule relative toWhite’s p-value 0.29

returns from following a passive strategy of alwaysTechnical trading rules (7846 rules)being in the market:Benchmark 4.80%Best rule 2-day-on-balance volumef 5 ln(11 y S (x ,b ))k,t11 t11 k t k Performance 17.07%Nominalp-value 0.002 ln(11 y S (x ,b )), k 5 1, . . . ,l, (16)t11 0 t 0White’s p-value 0.00

wherex is the information that forms the basis for The performance on the full sample of the Dow Jones Industrialt

Average, according to the mean return criterion, is provided fordeciding the position to hold from the end of periodttwo sets of rules: (1) the calendar frequency trading rules, and (2)to the end of periodt 1 1, y 5 (X 2X ) /X ist11 t11 t tthe technical trading rules. The performance measure is displayedthe holding period return, andX is the stock pricet for the benchmark (buy-and-hold) trading strategy and the best

series under investigation.S and S are signalk 0 performing rule, along with the nominal and data-mining adjustedfunctions that convertx into trading positions for p-values for the best performing rule. The type of rule thatt

exhibits the best performance is also described.given system parametersb . We consider the follow-k


Table 2 benchmark of always being invested in the marketDow Jones Industrial Average: full sample (January 1897–June portfolio (mean return of 4.80%). Thep-value cor-1998) combined universe of trading rules

rected for data mining is zero for practical purposesMean return criterion even when the search for the best technical trading

Combined universe of trading rules (17,298rules) rule across 7846 models is considered. In starkBenchmark 4.80% contrast there is no evidence that the best calendarBest rule 2-day-on-balance volume model (mean return of 8.55%) outperforms thePerformance 17.07%

benchmark, once data mining is taken into accountNominalp-value 0.00within the set of calendar rules. Notice that theWhite’s p-value 0.00data-mining correction makes a big difference in this

The performance on the full sample of the Dow Jones Industrialsecond case: when the best calendar rule (the Mon-Average, according to the mean return criterion, is provided forday rule) is considered in isolation it appears to bethe combined universe of trading rules which include both

calendar frequency trading rules and technical trading rules. The highly significant with a p-value close to zero.performance measure is displayed for the benchmark (buy-and- However, when the data-mining effects are takenhold) trading strategy and the best performing rule, along with the into consideration, thep-value rises to 0.29 and thenominal and data-mining adjustedp-values for the best perform-

first conclusion is completely overturned. Furthering rule. The type of rule that exhibits the best performance is alsodetails of these separate experiments are reported indescribed.Sullivan et al. (1999, 2001).

Fig. 1. Mean return criterion. Dow Jones Industrial Average (1897–1998).


l 1 l 517,298. No study in economics has comeSeparate consideration of these studies thus leaves 1 2

even close to considering simultaneously such athe issue of testing the efficient markets hypothesislarge number of models. The results show that theunsettled. One study leads to a rejection of the nullmean return performance of the best technical trad-while the other study does not reject the null.ing rule in fact is so good over the full sample periodSuppose that the calendar effect study was firstthat it continues to be significant at conventionalconducted and showed no evidence of genuinecritical levels even when considered in the fullpredictability. Subsequently the technical tradinguniverse comprising both calendar rules and techni-experiment is conducted and shows evidence ofcal trading rules.predictability. Ultimately the purpose of the two

Fig. 1 shows each model’s mean return perform-studies is the same, namely to test the efficientance. Two separate lines track the performance ofmarket hypothesis. How should a researcher revisethe best model updated across all models as thehis beliefs about the validity of the (shared) nulluniverse of models expands as well as thep-valuehypothesis that stock prices cannot be predicted?adjusted for data mining in this expanding universe.Inspection of the separate analyses alone cannotThe first 9452 models comprise the calendar rulesdecide the issue. Clearly there is a need for awhile the subsequent 7846 models comprise theprocedure that allows us to merge the results fromtechnical trading rules. Since the ordering of thethe two studies.models is arbitrary, only the terminalp-value ulti-In Table 2 we present the results from themately matters. Nevertheless, the picture providesbootstrap experiment that uses the same randomfascinating insights into data-mining effects. Amongnumber seed to combine the two studies. The totalthe set of calendar rules, the best-performing fore-number of models under consideration is nowl 5

Fig. 2. Mean return criterion. S&P 500 futures (1984–1996).


casting model is identified early on and since no liquid S&P 500 futures price index. Our sampleimprovement occurs within this subset of rules, the starts 1 year after the contract started public trading,p-value slowly drifts upward, reflecting the effect of namely in 1984, and finishes in 1996. This recenthaving drawn the best model from a distribution with sample is likely to provide a more realistic test ofa wider support. The picture changes dramatically market efficiency. Results analogous to those justonce the technical trading rules are introduced. After described are presented in Table 3. We see that thereapproximately 9700 models have been considered, a is no longer any evidence of predictable patterns intechnical trading rule with a very significant out- stock returns. Even the best model does not out-performance reduces the bootstrapp-value to a perform the benchmark on a mean return basis. As anumber close to zero. Since the performance of this further confirmation of the lack of evidence againstmodel is very strong, adding further technical trading the null hypothesis, Fig. 2 show that, as the universerules does not lead to any visible increase in the of models expands, thep-value never goes belowbootstrapp-value. 0.60.

The above results are valid for the lengthy sample These results suggest the following. First, a sci-January 1897 to June 1998. However, the quality of entist who ignores the effects of data mining wouldthe data at the beginning of the sample is unlikely to be led to believe that both calendar rules andbe sufficiently high to represent prices at which technical trading rules that outperform the marketinvestors could in fact have carried out transactions. benchmark do indeed exist. However, our inves-The practical implications for market efficiency of tigation suggests that this conclusion is premature.these full-sample tests may thus be limited. To deal While there is evidence in the full historical samplewith this, we consider separately data on the very from 1897 to 1998 that the bootstrapp-value is

significant at standard critical levels, the more recentTable 3 data sample is likely to produce the most reliableDJIA and S&P futures: out-of-sample combined universe of results since they are based on futures contracts thattrading rules actually traded in public markets.

Mean return criterion

DJIA, 1987–1996:combined universe of tradingrules (17,298rules) 5. Conclusion

Benchmark 13.55%Best rule Week of the month This paper has proposed a simple recursive boot-

(1, 2, 3, 4, 551, 1, 1, 0, 1)strap methodology that allows researchers to updatePerformance 17.27%their p-values of a shared null hypothesis once newNominal p-value 0.10

White’s p-value 0.98 forecasting models are fitted to an existing data set.Given that the data are shared along with theS& P 500, 1984–1996:combined universe of tradingspecification and the estimation method for therules (17,298rules)

Benchmark 8.01% benchmark model, only (i) the start and end observa-Best rule Week of the month tion indexes for the best out-of-sample window (R

(1, 2, 3, 4, 551, 1, 1, 0, 1) and T ), (ii) the number of bootstrap resamples (B)Performance 10.69%

and the bootstrap smoothing parameter (q), (iii) theNominal p-value 0.22random number generator and its seed, and (iv) theWhite’s p-value 0.99

]best sample and resample performance valuesVl21The performance on the out-of-sample Dow Jones Industrial ] *and V , i 5 1, . . . ,B, need to be communicatedAverage and the S&P 500 futures, according to the mean return l21,i

criterion, is provided for the combined universe of trading rules between researchers. Remarkably, details of thewhich includes both calendar frequency trading rules and technical specific models investigated do not need to betrading rules. The performance measure is displayed for the shared, so that scientific progress is possible usingbenchmark (buy-and-hold) trading strategy and the best-perform-

shared data without full disclosure. Thus, progressing rule, along with the nominal and data-mining adjustedp-can be made even when details of earlier forecastingvalues for the best performing rule. The type of rule that exhibits

the best performance is also described. experiments became lost or when commercial inter-


Mosteller, F., & Chalmers, T. (1992). Some progress and prob-est dictates that specific forecasting models remainlems in meta-analysis of clinical trials.Statistical Science, 7,proprietary or confidential. The procedure is thus227–236.

ideally suited to accumulating empirical evidence on Newey, W., & West, K. (1987). A simple positive semi-definite,issues of common interest to an entire community of heteroskedasticity and autocorrelation consistent covarianceresearchers separated in space and time. matrix. Econometrica, 55, 703–708.

Politis, D., & Romano, J. (1994). The stationary bootstrap.Journal of the American Statistical Association, 89, 1303–1313.Acknowledgements

Sullivan, R., Timmermann, A., & White, H. (1999). Data-snoop-ing, technical trading rule performance and the bootstrap.

The authors acknowledge research support pro- Journal of Finance, 54, 1647–1692.Sullivan, R., Timmermann, A., & White, H. (2001). Dangers ofvided by QuantMetrics R&D Associates, LLC. The

data-mining: the case of calendar effects in stock returns.methods described in this paper are covered by U.S.Journal of Econometrics, 105, 249–286.Patent 5,893,069.

West, K. (1996). Asymptotic inference about predictive ability.Econometrica, 64, 1067–1084.

White, H. (2000). A reality check for data snooping.Econo-metrica, 68, 1097–1126.References

Biographies: Ryan SULLIVAN is a Manager with Bates White &Chatfield, C. (1995). Model uncertainty, data mining and statisti-Ballentine, a national economics consulting firm with headquarterscal inference.Journal of the Royal Statistical Society (A), 158,in Washington, DC. Dr. Sullivan holds a Ph.D. from the Universi-419–466.ty of California, San Diego. He performs quantitative analyses andCorradi, V., Swanson, N., & Olivetti, C. (2001). Predictive abilityeconomic studies in litigation and industry consulting environ-with cointegrated variables.Journal of Econometrics, 104,ments.315–358.

Diebold, F., & Mariano, R. (1995). Comparing predictive accura-cy. Journal of Business and Economic Statistics, 13, 253–265. Allan TIMMERMANN is Professor of Economics at the

Hamilton, J. (1989). A new approach to the economic analysis of University of California, San Diego and a fellow of the Center fornonstationary time series and the business cycle.Econometrica, Economic Policy Research in London. Professor Timmermann57, 357–384. holds a PhD from the University of Cambridge. He has published

Lakonishok, J., & Smidt, S. (1988). Are seasonal anomalies real? in the fields of econometrics, finance and forecasting. He is anA ninety-year perspective.Review of Financial Studies, 1, associate editor ofJournal of Business and Economic Statistics403–425. and a department editor ofJournal of Forecasting.

Lo, A. W., & MacKinlay, A. C. (1990). Data-snooping biases intests of financial asset pricing models.The Review of Financial

Halbert WHITE is Professor of Economics at the University ofStudies, 3, 431–467.California, San Diego. He is a Fellow of the Econometric Society,Merton, R. (1987). On the state of the efficient market hypothesisa Guggenheim Fellow, a Fellow of the American Academy of Artsin financial economics. In Dornbusch, R., Fisher, S., &and Sciences, and a member of the International Institute ofBossons, J. (Eds.),Macroeconomics and finance: essays inForecasters. His research interests include econometrics, forecast-honor of Franco Modigliani. Cambridge, MA: MIT Press, pp.ing, and financial markets.93–124.

Miller, A. (1990). Subset selection in regression. London: Chap-man and Hall.

Documents

Forecast evaluation with shared data sets