Behaviorial Finance.pdf

8/17/2019 Behaviorial Finance.pdf

1/30

Can internet search queries help to predict

stock market volatility?∗

Thomas Dimpfl and Stephan Jank∗∗

May 14, 2012

Abstract

This paper studies the dynamics of stock market volatility and retail investors’attention to the stock market, where attention to the stock market is measured byinternet search queries related to the leading stock market index. We find a strongco-movement of the Dow Jones’ realized volatility and the volume of search queriesfor its name. Furthermore, search queries Granger cause volatility: a heightenednumber of searches today is followed by an increase in volatility tomorrow. We utilizethis finding to improve several models of realized volatility. Including search queriesin autoregressive models of realized volatility helps to improve volatility forecastsin-sample and out-of-sample as well as for different forecasting horizons. Searchqueries are particularly useful in high-volatility phases when a precise prediction isvital.

Key words: realized volatility, forecasting, investor behavior,limited attention, noise trader, search engine data

JEL: G10, G14, G17

∗We thank Google for making their search volume data publicly available through Google Trends .

We would like to thank Jeremy Ginsberg, Joachim Grammig, Isaac Hacamo, Terrance Odean, FranziskaPeter, Tomas Reyes, Maik Schmeling and Hal Varian for helpful comments and suggestions. We retainresponsibility for all remaining errors. Financial support of the German Research Foundation (DeutscheForschungsgemeinschaft, DFG) is gratefully acknowledged.

∗∗Thomas Dimpfl: University of Tübingen, Stephan Jank: University of Tübingen and Centre forFinancial Research (CFR), Cologne. Contact: University of Tübingen, Department of Economics andSocial Sciences, Mohlstr. 36, D-72074 Tübingen, Germany. E-mail: [email protected],[email protected] .


2/30

1 Introduction

Large stock market movements capture investors’ attention. This can be seen in Figure 1

which depicts a strong co-movement between the volatility of the Dow Jones stock market

index and internet search queries for its name. If volatility is low, searches related to theDow Jones are generally at their overall average level, but in turbulent times search

queries surge to levels which exceed their average by multiple times. For example, when

the volatility of the Dow Jones spiked at an almost record high of over 150% annualized

on October 10, 2008, the number of submitted searches for the index rose to more than

eleven times the average.

[Insert Figure 1 about here]

Internet search queries related to stock market index names can be interpreted as ameasure for retail investors’ attention to the stock market (cf. Da, Engelberg and Gao

2011). While professional investors monitor the leading index all the time, retail investors

presumably do not. Once the latter perceive an increased demand for information about

the stock index, they are likely to use the internet via a search engine like Google as a

source of information. Professional investors most probably, if not certainly, are not using

a search engine to obtain information about the leading stock market index. Thus, search

queries qualify as a good proxy for retail investors’ attention to the stock market. After

having searched for the stock market index, some individuals might be inclined to act and

trade at the same or the following day.

In this paper we study in detail the dynamics of searches for the Dow Jones and its

volatility. The key finding is that search queries today Granger cause volatility tomorrow.

We exploit this finding to improve volatility forecasts and augment various autoregressive

models of realized volatility with search query data. The forecasting precision can be

significantly enhanced when data on search queries enter the prediction equation. The

improvement is evident both for in-sample as well as for out-of-sample forecasts. The

longer the forecast horizon, the more efficiency gains are apparent. Furthermore, the data

on internet search queries help to predict volatility more accurately in periods of highvolatility, when a precise prediction is of utmost importance.

These findings contribute to our knowledge of stock market volatility and its long

memory characteristics, documented for example by Andersen and Bollerslev (1997). In

particular, the results are consistent with agent-based models of stock market volatility

(e.g. Lux and Marchesi 1999, Alfarano and Lux 2007). In the model by Lux and Marchesi

1


3/30

(1999), noise traders are seen as a source of additional volatility in the stock market. A

fundamental shock in volatility triggers noise trading which in turn induces an increase

of volatility. Taking internet searches as a measure of retail investors’ attention, we

observe patterns that are consistent with the agent and noise trader theory of stock market

volatility: high volatility entails high retail investor attention, which is in turn followed

by high volatility. Our results are also in line with recent empirical evidence by Foucault,

Sraer and Thesmar (2011), who – drawing on a natural experiment in France – find that

retail investors’ trading activity leads to a higher level of volatility in individual stocks.

Note that this paper does not focus on an exogenously evoked change in attention, which

for example is studied by Engelberg and Parsons (2011). In our analysis a fundamental

volatility shock induces an elevated interest of retail investors in the stock market and we

study how retail investors’ attention, along with possible trading influences and amplifies

volatility in the following.Our key findings are supported by an investigation of trading volume. In the agent-

based model of Lux and Marchesi (1999) trading activity by noise traders induces volatil-

ity. While we are unable to observe trading by this group directly, we document that

the overall trading volume of the stocks comprised in the Dow Jones rises subsequent to

increases in search queries for the index, which is consistent with the noise trader model.

In a joint model for volatility, trading volume and search queries, the latter turn out to

be a predictor for volatility while trading volume does not, which is in line with findings

in the literature (see, for example, Brooks 1998).

In a forecasting context, other studies have sucessfully used Google search volume data.

For example Ginsberg et al. (2009) use search query data to predict influenza epidemics

and Choi and Varian (2009a,b) employ Google search data to forecast unemployment

rates and retail sales. In the field of finance, search query data are used to measure retail

investor attention (Bank, Larch and Peter 2011, Da et al. 2011, Jacobs and Weber 2011,

Vlastakis and Markellos 2012) and to predict earnings (Da, Engelberg and Gao 2010a,

Drake, Roulstone and Thornock 2011). Da, Engelberg and Gao (2010b) use search queries

related to household concerns to measure investor sentiment.

2


4/30

2 Data and descriptive statistics

Our analysis focuses on the Dow Jones Industrial Average from July 1, 2006 to December

31, 2011. Intraday index prices are obtained from RC Research Price-Data.1 We construct

a time series of daily realized volatility RV t as introduced by Andersen, Bollerslev, Dieboldand Labys (2003) in the following way:

RV t =

n j=1

r2t,j, (1)

where r2t,j are squared intraday log-price changes of the index on day t during interval j

and n is the number of such intraday return intervals. We compute these price changes

over 10 minute intervals in order to circumvent the well-documented microstructure effects

(see e.g. Andersen et al. 2003, Andersen, Bollerslev and Meddahi 2011, Ghysels and Sinko

2011).

Descriptive statistics of Dow Jones realized volatility are presented in Panel A of

Table 1. As is evident from the skewness and kurtosis measures, the volatility time

series is heavily skewed and far from being normally distributed. We therefore resort

to the log of realized volatility as, amongst others, suggested by Andersen, Bollerslev,

Diebold and Ebens (2001) and Andersen et al. (2003). Even though normality of the data

is still rejected, the data are by far better behaved than before the transformation; in

particular, excess kurtosis is reduced significantly. The left graph in Figure 2 presents theautocorrelation function for realized volatility. The plot reveals the well-known pattern

of only slowly decaying autocorrelations (compare e.g. Andersen et al. 2001).

[Insert Table 1 and Figure 2 about here]

The data on Google search queries are obtained through Google Trends .2 We use daily

data on search volume from July 2006 to December 2011 for the keyword “dow”. The

volume measure is based on the number of searches which were submitted within the US.

We start our sample in the second half of 2006, since before July 2006 search volume dataat daily frequency exhibit numerous missing values. To match searches to the respective

1An earlier version of the paper also considered European stock market indices (the FTSE 100, theCAC 40 and the DAX). We find similar patterns of searches and realized volatilities for these indices,however, the present timing convention of Google Trends, which uses Pacific Standard Time for allcountries, makes the Google search query data less useful in a daily forecasting setting.

2Source: Google Trends (http://www.google.com/trends ).

3

http://www.google.com/trendshttp://www.google.com/trends


5/30

time series of realized volatility, we only consider trading days. Google Trends measures

searches for a particular day from 12 pm to 12 pm Pacific Standard Time.

The data which are provided by Google are relative in nature. This means that Google

does not provide the effective total number of searches, but a search volume index only.

We standardize the search queries such that the average search frequency over the sample

period equals one, allowing for an easy interpretation.

An important issue when measuring investors’ attention to the Dow Jones Industrial

Average index, is that it is known under a variety of names and acronyms. In this paper

we use the search term “dow” to measure retail investors’ attention to this index, since

we find that this short name is the most widely used search term when investors are

interested in the Dow Jones index.

[Insert Table 2 about here]

Table 2 provides an overview of search terms that have the highest correlation with the

term “dow” according to Google Correlate .3 The table also compares the level of search

volume relative to the search term “dow”. Almost perfectly correlated with the search

term “dow” is “dow jones”, but its volume is considerably less, amounting to 47.7% of

the search volume of “dow”. In general, search engine users seem to prefer shorter search

terms. The search term “dow jones industrial” is also highly correlated with “dow”, but

amounts only to around 9.9% of the search volume of “dow”. The ticker “djia” is also a

popular search term and its searches are around 18.4% compared to “dow”. Other searchterms such as “dow close”, “current dow”, “stock market now” are also highly correlated

with “dow”but their search volume is even less, i.e. below one percent compared to “dow”.

Search terms with low search volume present two problems: First, the search volume

index by Google Trends is less accurate due to the statistical methods used in constructing

the search volume index. Second, for some terms search traffic is too low, resulting in

missing values of the search volume index. Because of the high correlation of these search

terms, the other search terms do not add further information to the searches we use. For

this reason and to keep matters simple, we confine our analysis to the search term “dow”.

An alternative to using the Dow Jones would be the S&P 500, which is commonly

modeled in the realized volatility literature. However, the S&P 500 is less suited for

our purposes, because we find it to be less followed by retail investors compared to the

Dow Jones. In our sample period, the search term “dow” has been submitted to Google

3Source: Google Correlate (http://www.google.com/trends/correlate/ ).

4

http://www.google.com/trends/correlate/http://www.google.com/trends/correlate/


6/30

approximately ten times as often as the term “s&p 500”. Moreover, the acronym “S&P”

is generally used in the context of rating downgrades in our sample.

Panel A of Table 1 also holds summary statistics for the data on search queries. Just

as the realized volatility time series, the data on searches exhibit distinctive levels of

skewness and kurtosis. We therefore follow Da et al. (2011) by also taking logarithms of

the search data. This procedure reduces both skewness and excess kurtosis, however, it is

not as successful as in the case of realized volatility. The right graph in Figure 2 plots the

autocorrelations of Dow search queries. The autocorrelation function is decaying fairly

geometrically and much faster compared to autocorrelations of realized volatility depicted

in the left graph of Figure 2.

The data on trading volume of the Dow Jones are obtained from Datastream. It is

measured as the aggregate number of shares traded for the underlying stocks of the index

per day (in millions). Descriptive statistics are presented in Panel A of Table 1. We alsoresort to taking the logarithm of volume to reduce skewness and excess kurtosis.

As already apparent from Figure 1, search queries and realized volatility exhibit a

strong co-movement over time. The contemporaneous correlation of search queries for

“dow” and realized volatility of the Dow Jones in our sample is quite high as can be seen

from Panel B of Table 1. We find a correlation coefficient of 0.82 for the level time series

and 0.74 for the data in logarithms. Volume is correlated to a slightly lower extent with

search queries (0.51 in levels and 0.44 in logarithms) and realized volatility (0.57, both

for levels and logarithms).

3 The dynamics of volatility and searches

In the following we study the dynamics between realized volatility, search queries and

trading volume by estimating three vector autoregressive (VAR) models. Let xt be a

vector which contains the variables of interest, then the VAR(3) reads as follows:

xt = c +3

j=1

β jxt− j + εt , (2)

where c is a vector of constants and εt is a vector of white noise innovations. Model 1,

which is central to our analysis, investigates the dynamics between realized volatility and

search queries, i.e. xt = (log-RV t log-SQt). In Model 2, we will shed light on the

interaction between search queries and trading volume, i.e. xt = (log-SQt log-V Ot).

5


7/30

Finally, Model 3 is a joint model of realized volatility, search queries and trading volume

such that xt = (log-RV t log-SQt log-V Ot).

Table 3 holds the results of the three VAR models. Coefficient estimates are presented

in Panel A, the results of the Granger causality Lagrange multiplier (LM) tests are shown

in Panel B.


In Model 1 we find significant estimates of the autoregressive parameters for the real-

ized volatility for all included lags. Search queries show significant autoregressive terms

for lags 1 and 3. The estimation results also reveal that past volatility to some extent

positively influences present search queries. This effect is concentrated to the first lag

which is significant on the 5% level only while lags of order 2 and 3 are not found to

be significant. However, the Granger causality test reveals that realized volatility doesnot Granger cause search queries. Note that there is a high contemporaneous correlation

between volatility and searches indicating that searches immediately react to a volatility

shock at the same day.

The focus of our interest is how past search activity influences present volatility. We

find that search queries enter the RV-equation significantly at lag 1 while higher order

lags are insignificant. Furthermore, the Granger causality test indicates that past searches

provide significant information about future volatility: we reject the null hypothesis of no

influence of past log-SQ on current log-RV on all conventional significance levels. These

results support the noise trader model of volatility by Lux and Marchesi (1999). After an

initial shock in volatility, search intensity rises, which in turn is followed by heightened

levels of volatility.

As a further test of the model of Lux and Marchesi (1999), we analyze the interaction

of trading volume and search queries. Because trading volume of noise traders is not

directly observable, we turn to an indirect test using total trading volume. If the interest

of retail investors in the stock market manifests in trading activity, an increase in the

number of searches should be followed by rising levels of overall trading volume. We test

this hypothesis in Model 2. We find that volume has only marginal predictive power forsearch queries, in particular only the third lag is statistically significant. In contrast, search

queries of the previous day enter significantly in the volume equation which supports our

prediction that once information has been sought via Google, trading volume rises. The

Granger causality test also shows that trading volume and search queries are Granger

causal to each other.

6


8/30

The estimated coefficients of the joint model of realized volatility, search queries and

trading volume (Model 3) are similar to the estimates of the respective equations in

Models 1 and 2. Again, search queries enter the realized volatility equation significantly

on lag 1 (higher order lags are not significant) while volume does not contribute to the

explanation of the variation in realized volatility which is in line with Brooks (1998).

Moreover, realized volatility is only a weak predictor of search queries: only the first lag

of log-SQ is significant on a 10% significance level. In this model we find that search queries

do Granger cause realized volatility while volume does not. Both realized volatility and

volume Granger cause search queries and realized volatility and search queries Granger

cause volume.

Figure 3 provides the impulse response functions corresponding to the model of in-

terest, Model 1. For the calculation of impulse response functions, we use a Cholesky

decomposition with the economically meaningful restriction of volatility being contem-poraneously exogenous, i.e. volatility can affect search queries immediately, but search

queries do not contemporaneously affect volatility. The intuition behind this ordering is

that there is first a fundamental volatility shock that in turn triggers retail investor atten-

tion and, thus, search queries. Search queries, on the other hand, would not rise without

a preceding event on the market (see also the argumentation in Lux and Marchesi 1999).

[Insert Figure 3 about here]

The top left and the bottom right graphs in Figure 3 present the response of log-RV and log-SQ, respectively, to their own shocks. As becomes evident from the slowly

decaying function, these shocks are highly persistent, which corresponds to the slowly

decaying autocorrelation functions presented in Figure 2. The top right figure holds the

impulse response of search queries to a shock in realized volatility. After a volatility

shock, attention is elevated for a considerable amount of time which is reflected in the

slowly decaying impulse response function. The bottom left figure presents the response

of volatility to a shock in search queries. Most noteworthy, the impact of retail investors’

attention on volatility is long-lasting and only dies out after 40-60 trading days.

4 Can search queries improve volatility forecasts?

The key result of the VAR estimation is that search queries help to predict future volatility

in addition to the information contained in lagged volatility and that a shock in search

7


9/30

volume has an impact on volatility that persists for a considerable span of time. Now we

focus on the prediction of volatility, and thus investigate if and how searches can help to

improve volatility forecasts. In the following, we use different modeling approaches which

are commonly used to capture the time series properties of realized volatility and include

lagged search queries in each model. We evaluate the forecasting ability of these models

in- and out-of-sample, for multiple horizons as well as across different states of volatility.

For one-step ahead predictions an autoregressive model for realized volatility which

includes search queries as an explanatory variable would be sufficient. For multi-step

forecasts, however, we need to forecast log-SQ as well. For this reason we model the

time series properties of search queries along with realized volatility. We estimate vector

autoregressive models with different lag lengths which are specified as follows:

log-RV t = c1 +

p

j=1

β 1,jlog-RV t− j + γ 1,1log-SQt−1 + ε1,t (3a)

log-SQt = c2 + β 2,1log-RV t−1 +

q j=1

γ 2,jlog-SQt− j + ε2,t. (3b)

Building on the univariate autoregressive models of Andersen, Bollerslev, Christoffersen

and Diebold (2006) and Bollen and Inder (2002), we set the lag length for the realized

volatility equation (3a) to either p = 1 or p = 3. As the results of estimating the VAR

model in Equation (2) show no significant effect of searches on volatility for higher order

lags, we only include one lag of searches in Equation (3a). For the same reason weonly allow a one lag impact of realized volatility on search queries in Equation (3b). In

accordance to the autoregressive structure of the realized volatility equation, we set the

lag order q of the autoregressive terms to 1 if p = 1 and 3 if p = 3.

In addition to these ordinary autoregressive models we also consider Corsi’s (2009)

heterogeneous autoregressive (HAR) model. The HAR model has been found to capture

the long-memory properties of realized volatility very well and has recently been used

for example by Andersen, Bollerslev and Diebold (2007), Chen and Ghysels (2011) and

Chiriac and Voev (2011). Since we need a forecast of search queries for the multi-step

8


10/30

volatility prediction, we extend the HAR model to a Vector-HAR model which reads as

follows:

log-RV t = c1 + β dlog-RV t−1 + β wlog-RV wt−1 + β mlog-RV

mt−1 + γ 1,1log-SQt−1 + ε1,t (4a)

log-SQt = c2 + β 2,1log-RV t−1 +3

j=1

γ 2,jlog-SQt− j + ε2,t. (4b)

where log-RV wt = 1

5

4

j=0 log-RV t− j and log-RV mt =

1

22

21

j=0 log-RV t− j . The search

queries Equation (4b) is the same as Equation (3b) since we find that the time series

properties of searches are well described by three autoregressive terms and one lag of

realized volatility.

The models described above are contrasted to pure realized volatility models, i.e. an

AR(1) or AR(3) model in case of Equation (3) and a HAR(3) model in case of Equation (4).The univariate models of realized volatility are simply the Equations (3a) and (4a) with

γ 1,1 restricted to zero.

In order to assess the forecasting performance, we consider two loss functions which

are robust to possible noise in the volatility measure (see Patton 2011). These are the

mean squared error (MSE) and the quasi-likelihood (QL) for which the corresponding loss

functions are defined as follows:

MSE: L

RV t+1,

RV t+1|t

=

RV t+1 −

RV t+1|t

2, (5)

QL: L

RV t+1, RV t+1|t = RV t+1 RV t+1|t − log RV t+1 RV t+1|t − 1, (6)

where RV t+1|t is the respective forecast of realized volatility based upon informationavailable up to and including time t. We also use the R2 of a Mincer and Zarnowitz (1969;

hereafter MZ) regression of the actual realized volatilities on their predicted values:

RV t+1 = b0 + b1 RV t+1|t + et+1. (7)

9


11/30

Following the literature (e.g. Aı̈t-Sahalia and Mancini 2008, Andersen et al. 2003, Ghysels,

Santa-Clara and Valkanov 2006), we model log realized volatility, but evaluate the forecast

by comparing realized volatility and its prediction.4

4.1 In-sample forecast evaluation

Table 4 holds the results of the in-sample forecast evaluation of one-step ahead forecasts of

realized volatility. The models we consider are the univariate AR(1), AR(3) and HAR(3)

models and the respective augmented models including lagged search queries.


Looking only at the univariate models, we see that the AR(3) produces more precise

forecasts than the AR(1). The higher order lag structure leads to a reduction in MSE of more than 11%. The HAR(3) is the best amongst the univariate models. The MZ R2 is

0.96 percentage points higher than for the AR(3) and even 4.38 percentage points higher

than for the AR(1) model. These findings are in line with the literature on volatility

prediction, which shows that the heterogeneous autoregressive model, including lagged

weekly and monthly volatility, captures the long memory properties of realized volatility

best (cf. Corsi 2009).

Comparing the univariate models (AR(1), AR(3), HAR(3)) to the SQ-augmented mod-

els (AR(1)+SQ, AR(3)+SQ, HAR(3)+SQ), we observe for all models an improvement in

performance. The addition of search queries leads to a reduction of MSE between almost

5% in the AR(1) case and 3.4% in case of the HAR(3) model. Again, we find that overall

the HAR model augmented with search queries shows the best fit, i.e. we find the lowest

MSE, the lowest value of the QL loss function and the highest R2 of the MZ regression

(marked in bold in Table 4).

4.2 Out-of-sample forecast evaluation

We now turn to the out-of-sample forecasts and provide 1 day, 1 week and 2 weeks volatility

forecasts. Weekly and bi-weekly volatility is measured as the aggregated volatility over 5and 10 business days, respectively. For our initial out-of-sample forecast we estimate the

models using the first two years (500 trading days) of our sample, i.e. from July 2006 to

4When reversing the log transformation, the forecasts are formally not optimal (Granger and Newbold1976). However, Lütkepohl and Xu (2010) show by means of an extensive simulation study that this näıveforecast generally performs just as well as the formally optimal forecast.

10


12/30

June 2008. We then re-estimate the model for every subsequent day in the sample using

all past observations available, i.e. we increase the estimation window over time. The

estimation period of the very first run ends in June 2008. Thus, we are able to compare

the forecasting performance of volatility models during the almost record-high volatility of

October 2008. The initial two year estimation period is still long enough and has enough

variation in both volatility and search activity (cf. Figure 1) for a reliable estimation of

model parameters.


Results of the out-of-sample prediction are summarized in Table 5. For the univariate

models our results are consistent with the findings of Corsi (2009). The HAR(3) model is

better at predicting realized volatility compared to the AR(3) or AR(1) model in terms of

MSE, MZ R2 and QL statistic. The advantage of the HAR modeling emerges particularly

when predicting volatility at longer horizons of one or two weeks. For the latter the

reduction of the MSE amounts to a considerable 25% as compared to the AR(3) model.

Turning to the multivariate models, we find that those which include searches as an

explanatory variable always outperform the univariate, pure realized volatility models.

This means that these models have a lower MSE, a lower value of the QL loss function

and a higher R2 in the Mincer-Zarnowitz regression. Adding searches is most beneficial for

longer-horizon forecasts: the MZ R2 is by 3.1 percentage points higher in the multivariate

VHAR(3) than in the univariate HAR(3). In case of the AR-models, this difference caneven be larger.

Overall, the best performing univariate model for realized volatility is the HAR model.

Augmenting the HAR model with search query data further improves the forecasting

performance, in particular at longer horizons. The intuition behind is the following:

the VHAR model benefits from modeling the dynamics of retail investors’ searches and

volatility. The VHAR gains from the fact that a shock in searches has a significant

impact on volatility that is persistent as can be seen from the impulse-response function

of Figure 3. Thus, searches can improve long-run predictions. Furthermore, search queries

are well described by the autoregressive time series model allowing for good predictions

of searches when the system is iterated forward.

11


13/30

4.3 Forecast evaluation across different times of volatility

Precise volatility predictions are of particular importance during times of financial turmoil.

Thus, it is of great interest how the models perform in phases of high volatility compared to

calmer periods. In order to shed light on this question, we sort days according to realizedvolatility and calculate MSE and the QL loss function for several quantiles separately.

Our analysis focuses on the two best performing models, the univariate HAR model and

the VHAR model including search queries.


Table 6 presents the results of this sorting. It shows the MSE and QL loss function for

the days with the lowest volatility in the sample (Bottom 1%, 5% to 50%) as well as the

days with the highest volatility in the sample (Top 50%, 25% to 1%). Panel A displaysthe in-sample results, Panel B the out-of-sample results.

Not surprisingly, volatility prediction in calm times is very accurate. The in-sample

MSE for the 50% least volatile days is merely 0.027 for the univariate HAR model. The

MSE for the HAR model augmented with search queries is virtually the same with 0.026.

In general, the prediction accuracy of the two models does not differ much throughout the

bottom quintiles. Thus, the autoregressive model provides a good fit of volatility in calm

times and search queries do not add information about future volatility in these times.

Turning to the most volatile days, we see that prediction accuracy weakens for both

models but at the same time we observe that the model including searches increasingly

outperforms the univariate realized volatility model. The HAR model in-sample MSE for

the 50% most volatile days amounts to 0.267 and increases to 6.875 for the 1% most volatile

days. In high volatility times it becomes increasingly more difficult for the autoregressive

model to predict volatility. Also the MSE for the bivariate HAR model of realized volatility

and search queries increases in more volatile times, but to a lower extent. The MSE for

the model including searches is 6.396 compared to 6.875 for the univariate model, which

is a reduction of 7.0%. A similar picture can be seen for the QL loss function. For

the 1% most volatile days the QL loss function of the SQ-augmented model yields 33 .16compared to 36.36 for the univariate model, which is a reduction of 8.8%. The difference

in the bottom quintiles between the QL and MSE originates from the different weights

the loss functions puts on the prediction errors.

A similar pattern is found for the out-of-sample prediction. Results are presented in

Panel B of Table 6. Again, for the bottom quantiles the difference in MSE is negligible

12


14/30

while the QL loss function slightly favors the pure volatility model over the model which

includes search queries. In phases of high volatility, however, the model with searches

clearly outperforms the HAR(3). The performance increase is even more pronounced

than in the in-sample case: in the 1% highest volatility days, the prediction MSE can be

reduced by 13% if search queries are included. Also, the value of the QL is significantly

reduced by over 12%.

In summary, the cascading structure of the heterogeneous autoregressive model seems

to capture the long-memory properties of realized volatility very well. However, in a crisis

period (when volatility is high) retail investors’ attention is an important component and

predictor of volatility. If we interpret the HAR model as a model of agents with different

time horizons (namely daily, weekly and monthly) as suggested by Corsi (2009), we can

understand retail investors as a fourth investor group that adds to volatility in very tur-

bulent times.

4.4 Robustness of out-of-sample forecast

One possible concern about the practical usefulness of search data in predicting volatility

is the time of publication of the data. Up to now, Google Trends publishes search volume

daily with a lag of only one day. Even though lagged publication may reduce the fore-

casting power of search queries for volatility, the analysis of the forecasting relationship

presented so far helps us to gain a better understanding of the dynamics of retail investors’attention to the stock market and their role in stock market volatility. Furthermore, note

that technically it would be possible to publish search volume even faster. At present,

Google publishes the search volume for the fastest rising searches in the US every hour

through Google Hot Trends with only a few hours delay.5

Nevertheless, to account for the delay in the publication of search query data we also

look at an out-of-sample forecast based on search query data of two lags, i.e. volatility

tomorrow is predicted using volatility data until today and search data until yesterday.

Table A.1 in the Appendix presents the results of this robustness check. The general

results of Table 5 are confirmed: models with search query data generally outperform the

univariate realized volatility models, but (as expected) the forecasting power of search

queries is reduced. One exception is the VAR(3) model, where the MZ R2 is slightly lower

than that of the AR(3) model. However, the MSE and QL loss function again favor the

5Source: Google Hot Trends (http://www.google.com/trends/hottrends )

13

http://www.google.com/trends/hottrendshttp://www.google.com/trends/hottrends


15/30

model with search queries. Overall, the HAR model including search queries remains the

best performing model.

5 Concluding remarksInternet search data can describe the interest of individuals as recently advocated by Choi

and Varian (2009a) or Da et al. (2011). In this paper we use daily search query data to

measure the individuals’ interest in the aggregate stock market. We find that investors’

attention to the stock market rises in times of strong market movements. Moreover, a rise

in investors’ attention is followed by higher volatility. These findings are consistent with

agent-based models of volatility (Lux and Marchesi 1999, Alfarano and Lux 2007).

Exploiting the fact that search queries Granger cause volatility, we incorporate searches

in several prediction models for realized volatility. Augmenting these models with searchqueries leads to more precise in- and out-of-sample forecasts, in particular in the long

run and in high volatility phases. Thus, search queries constitute a valuable source of

information for predicting future volatility which can essentially be used in real time.

14


16/30

References

Äıt-Sahalia, Y. and Mancini, L.: 2008, Out of sample forecasts of quadratic variation,

Journal of Econometrics 147(1), 17–33.

Alfarano, S. and Lux, T.: 2007, A noise trader model as a generator of apparent financial

power laws and long memory, Macroeconomic Dynamics 11(Supplement S1), 80–101.

Andersen, T. G. and Bollerslev, T.: 1997, Heterogeneous Information Arrivals and Re-

turn Volatility Dynamics: Uncovering the Long-Run in High Frequency Returns, The

Journal of Finance 52(3), 975–1005.

Andersen, T. G., Bollerslev, T., Christoffersen, P. F. and Diebold, F. X.: 2006, Prac-

tical Volatility and Correlation Modeling for Financial Market Risk Management, in

M. Carey and R. M. Stulz (eds), The Risks of Financial Institutions , University of

Chicago Press, Chicago, Illinois, chapter 17, pp. 513–548.

Andersen, T. G., Bollerslev, T. and Diebold, F. X.: 2007, Roughing It Up: Including Jump

Components in the Measurement, Modeling, and Forecasting of Return Volatility, The

Review of Economics and Statistics 89(4), 701–720.

Andersen, T. G., Bollerslev, T., Diebold, F. X. and Ebens, H.: 2001, The distribution of

realized stock return volatility, Journal of Financial Economics 61(1), 43–76.

Andersen, T. G., Bollerslev, T., Diebold, F. X. and Labys, P.: 2003, Modeling and Fore-

casting Realized Volatility, Econometrica 71(2), 529–626.

Andersen, T. G., Bollerslev, T. and Meddahi, N.: 2011, Realized Volatility Forecasting

and Market Microstructure Noise, Journal of Econometrics 160(1), 220–234.

Bank, M., Larch, M. and Peter, G.: 2011, Google search volume and its influence on

liquidity and returns of German stocks, Financial Markets and Portfolio Management

25(3), 239–264.

Bollen, B. and Inder, B.: 2002, Estimating daily volatility in financial markets utilizing

intraday data, Journal of Empirical Finance 9(5), 551–562.

Brooks, C.: 1998, Predicting stock index volatility: can market volume help?, Journal of

Forecasting 17(1), 59–80.

15


17/30

Chen, X. and Ghysels, E.: 2011, News - Good or Bad - and its Impact on Volatility

Predictions over Multiple Horizons, Review of Financial Studies 24(1), 46–81.

Chiriac, R. and Voev, V.: 2011, Modelling and forecasting multivariate realized volatility,

Journal of Applied Econometrics 26(6), 922–947.

Choi, H. and Varian, H.: 2009a, Predicting initial claims for unemployment benefits,

Working Paper .

Choi, H. and Varian, H.: 2009b, Predicting the present with Google trends, Working

Paper pp. 1–23.

Corsi, F.: 2009, A Simple Approximate Long-Memory Model of Realized Volatility, Jour-

nal of Financial Econometrics 7(2), 174–196.

Da, Z., Engelberg, J. and Gao, P.: 2010a, In search of earnings predictability, Working

Paper .

Da, Z., Engelberg, J. and Gao, P.: 2010b, The Sum of All FEARS: Investor Sentiment

and Asset Prices, Working Paper .

Da, Z., Engelberg, J. and Gao, P.: 2011, In Search of Attention, The Journal of Finance

66(5), 1461–1499.

Drake, M., Roulstone, D. and Thornock, J.: 2011, Investor Information Demand: Evi-dence from Google Searches around Earnings Announcements, Working Paper .

Engelberg, J. and Parsons, C.: 2011, The Causal Impact of Media in Financial Markets,

The Journal of Finance 66(1), 67–97.

Foucault, T., Sraer, D. and Thesmar, D. J.: 2011, Individual Investors and Volatility, The

Journal of Finance 66(4), 1369–1406.

Ghysels, E., Santa-Clara, P. and Valkanov, R.: 2006, Predicting Volatility: Getting the

Most out of Return Data Sampled at Different Frequencies, Journal of Econometrics 131(1-2), 59–95.

Ghysels, E. and Sinko, A.: 2011, Volatility Forecasting and Microstructure Noise, Journal

of Econometrics 160(1), 257–271.

16


18/30

Ginsberg, J., Mohebbi, M. H., Patel, R. S., Brammer, L., Smolinski, M. S. and Bril-

liant, L.: 2009, Detecting influenza epidemics using search engine query data, Nature

457, 1012–1014.

Granger, C. W. J.: 1969, Investigating Causal Relations by Econometric Models andCross-spectral Methods, Econometrica 37(3), 424–438.

Granger, C. W. J. and Newbold, P.: 1976, Forecasting Transformed Series, Journal of the

Royal Statistical Society. Series B (Methodological) 38(2), 189–203.

Jacobs, H. and Weber, M.: 2011, The Trading Volume Impact of Local Bias: Evidence

from a Natural Experiment, Review of Finance (forthcoming).

Lütkepohl, H. and Xu, F.: 2010, The role of the log transformation in forecasting economic

variables, Empirical Economics pp. 1–20.

Lux, T. and Marchesi, M.: 1999, Scaling and criticality in a stochastic multi-agent model

of a financial market, Nature 397, 498–500.

Mincer, J. A. and Zarnowitz, V.: 1969, The Evaluation of Economic Forecasts, in J. A.

Mincer (ed.), Economic Forecasts and Expectations: Analysis of Forecasting Behavior

and Performance , Studies in Business Cycles, NBER.

Patton, A. J.: 2011, Volatility forecast comparison using imperfect volatility proxies,

Journal of Econometrics 160(1), 246–256.

Vlastakis, N. and Markellos, R. N.: 2012, Information demand and stock market volatility,

Journal of Banking and Finance 36(6), 1808 – 1821.

17


19/30

Tables and Figures

0

. 0 2

. 0 4

. 0 6

. 0 8

. 1

R e a l i z e d v o l a t i l i t y

2007 2008 2009 2010 2011 2012

Realized volatility of the Dow Jones

0

5

1 0

S e a r c h V o l u m e I n d e x

2007 2008 2009 2010 2011 2012

Search queries for the index name

Figure 1:

Realized volatility and search activityThis figure displays daily realized volatility of the Dow Jones (upper graphic) and the volumeof search queries (lower graphic) for its name from July 1, 2006 to December 31, 2011. Searchqueries are standardized such that the sample average equals one.

18


20/30

− 0 . 2

0

0 . 0

0

0 . 2

0

0 . 4

0

0 . 6

0

0 . 8

0

1 . 0

0

A u t o c o r r e l a t i o

n s

0 20 40 60 80Lag

Realized volatility

− 0 . 2

0

0 . 0

0

0 . 2

0

0 . 4

0

0 . 6

0

0 . 8

0

1 . 0

0

A u t o c o r r e l a t i o

n s

0 20 40 60 80Lag

Search queries

Figure 2:Autocorrelations of realized volatility and search queriesThis figure displays the autocorrelations of the Dow Jones’ realized volatility (left graph)

and search queries for its name (right graph) in the sample period. Shaded areas indicate95% confidence bounds.

19


21/30

0

. 1

. 2

. 3

0 20 40 60 80 100Days

Response of volatility to a shock in volatility

0

. 0 2

. 0 4

. 0 6

. 0 8

0 20 40 60 80 100Days

Response of searches to a shock in volatility

0

. 0 2

. 0 4

. 0 6

0 20 40 60 80 100Days

Response of volatility to a shock in searches

0

. 0 5

. 1

. 1 5

0 20 40 60 80 100Days

Response of searches to a shock in searches

Figure 3:Impulse response functionsThe table displays the impulse response functions of the VAR(3) of realized volatility andsearch queries (Table 3, Model 1). Shaded areas indicate 95% confidence bounds.

20


22/30

Table 1:Summary statistics and correlationsThis table provides descriptive statistics of realized volatility (RV), search queries (SQ) andtrading volume (VO) of the Dow Jones between July 1, 2006 and December 31, 2011. Theupper panel holds summary statistics while the lower panel presents sample correlations.

Panel A: Summary statistics

Mean S.D. Skewness Kurtosis Min. Max.

RV 0.009 0.007 3.90 30.75 0.002 0.096

log-RV -4.87 0.56 0.57 3.57 -6.38 -2.34

SQ 1.00 0.72 5.21 48.21 0.29 11.27

log-SQ -0.13 0.46 1.25 5.70 -1.23 2.42

VO 234.0 86.8 1.48 6.10 52.6 672.9

log-VO 5.39 0.34 0.16 3.82 3.96 6.51

Panel B: Correlations

(1) (2) (3) (4) (5) (6)

RV (1) 1

log-RV (2) 0.89 1

SQ (3) 0.82 0.67 1

log-SQ (4) 0.76 0.74 0.89 1

VO (5) 0.57 0.58 0.51 0.49 1

log-VO (6) 0.53 0.57 0.44 0.44 0.96 1

21


23/30

Table 2:Correlation of “dow” and other search termsThis table shows the top ten search terms that are correlated with the search term “dow”during the sample period July 2006 to December 2011 at weekly frequency. The second

column shows the correlation coefficient, which was obtained from Google Correlate . Thethird column shows the search volume scaled relative to the average search volume of “dow”,which is standardized to 100. For example, 46.67 in row 1 means that the search volume of “dow jones” relative to the search volume of “dow” is 46.67%. The search volume scale wasobtained from Google Trends and re-scaled for the sample size.

Rank Correlation Rel. Volume (%) Search term1 0.9981 46.67 dow jones2 0.9707 18.40 djia3 0.9672 6.13 dow stock4 0.9612 0.33 dow close5 0.9612 0.31 current dow

6 0.9591 9.91 dow jones industrial7 0.9565 0.16 current dow jones8 0.9545 0.39 google dow9 0.9481 0.04 stock market now

10 0.9476 9.61 industrial average

22


24/30

T a b l e 3 :

V A R

m o d e l e s t i m a t i o n r e s u l t s

T h i s t a b l e d i s p l a y s t h e e s t i m a t i o n r e s u l t s o f a V e c t o r A u t o r e g r e s s i v e m o d e l , V A R ( 3 ) ,

f o r l o g r e a l i z e d v o l a t i l i t y ( l o g - R V ) , l o g s e a r c h

q u e r i e s ( l o g - S Q ) a n d l o g t r a d i n g v o l u m e ( l o g - V O ) f o r t h e D o w J o n e s i n d e x . P a n e l A p r o v i d e s c o e ffi c i e n t e s t i m a t e s , P a n e l B

t h e r e s u l t s o f

a G r a n g e r c a u s a l i t y L a g r a n g e M u l t i p l i e r t e s t . p - v a l u e s a r e g i v e n i n

p a r e n t h e s e s . S i g n i fi c a n c e a t t h e 5 % l e v e l i s

h i g h l i g h t e d i n

b o l d f a c e .

P a n e l A : V A R

e s t i m a t i o n

M o d e l 1

M o d e l 2

M o d e l 3

l o g - R V

t

l o g - S Q

t

l o g - S Q

t

l o g - V O

t

l o g

- R V

t

l o g - S Q

t

l o g - V O

t

l o g - R V

t − 1

0 . 4

5

0 . 0

3

0

. 4 5

0 . 0 3

0 . 0

7

( 0 . 0 0 0 )

( 0 . 0 4 3 )

( 0 . 0 0 0 )

( 0 . 0 9 0 )

( 0 . 0 0 6 )

l o g - R V

t − 2

0 . 2

1

- 0 . 0 0

0

. 2 3

0 . 0 0

- 0 . 0 2

( 0 . 0 0 0 )

( 0 . 8 5 9 )

( 0 . 0 0 0 )

( 0 . 8 8 2 )

( 0 . 4 5 7 )

l o g - R V

t − 3

0 . 1

6

- 0 . 0 1

0

. 1 5

0 . 0 1

- 0 . 0 1

( 0 . 0 0 0 )

( 0 . 5 4 4 )

( 0 . 0 0 0 )

( 0 . 6 0 9 )

( 0 . 7 5 2 )

l o g - S Q

t − 1

0 . 2

3

0 . 8

1

0 . 8

2

0 . 1

7

0

. 2 2

0 . 8

0

0 . 1

4

( 0 . 0 0 0 )

( 0 . 0 0 0 )

( 0 . 0 0 0 )

( 0 . 0 0 0 )

( 0 . 0 0 0 )

( 0 . 0 0 0 )

( 0 . 0 0 2 )

l o g - S Q

t − 2

- 0 . 1 0

- 0 . 0 5

- 0 . 0 4

- 0 . 0 9

- 0 . 0 8

- 0 . 0 5

- 0 . 0 7

( 0 . 1 5 0 )

( 0 . 1 6 8 )

( 0 . 2 2 7 )

( 0 . 1 1 1 )

( 0 . 2 4 3 )

( 0 . 2 3 0 )

( 0 . 1 8 2 )

l o g - S Q

t − 3

- 0 . 0 1

0 . 1

7

0 . 1

9

- 0 . 0 6

- 0 . 0 2

0 . 1

7

- 0 . 0 6

( 0 . 9 0 9 )

( 0 . 0 0 0 )

( 0 . 0 0 0 )

( 0 . 1 8 4 )

( 0 . 7 1 1 )

( 0 . 0 0 0 )

( 0 . 1 5 3 )

l o g - V O

t − 1

0 . 0 2

0 . 4

1

- 0 . 0 0

0 . 0 1

0 . 3

7

( 0 . 2 1 6 )

( 0 . 0 0 0 )

( 0 . 9 8 4 )

( 0 . 6 6 7 )

( 0 . 0 0 0 )

l o g - V O

t − 2

- 0 . 0 1

0 . 2

2

- 0 . 0 5

- 0 . 0 2

0 . 2

3

( 0 . 4 8 4 )

( 0 . 0 0 0 )

( 0 . 2 0 7 )

( 0 . 4 2 7 )

( 0 . 0 0 0 )

l o g - V O

t − 3

- 0 . 0

5

0 . 1

5

0

. 0 2

- 0 . 0

5

0 . 1

6

( 0 . 0 1 0 )

( 0 . 0 0 0 )

( 0 . 6 4 9 )

( 0 . 0 1 1 )

( 0 . 0 0 0 )

C o n s t a n t

- 0 . 8

6

0 . 0 9

0 . 2

1

1 . 2

0

- 0 . 6

2

0 . 5

3

1 . 5

3

( 0 . 0 0 0 )

( 0 . 1 6 4 )

( 0 . 0 1 8 )

( 0 . 0 0 0 )

( 0 . 0 1 7 )

( 0 . 0 0 0 )

( 0 . 0 0 0 )

23


25/30

T a b l e 3 - C o n t i n u e d

P a n e

l B : G r a n g e r c a u s a l i t y t e s

t

M o d e l 1

M o d e l 2

M o d e l 3

E x c l u d e d :

l o g - R V

t

l o g - S Q

t

l o g - S Q

t

l o g - V O

t

l o g - R V

t

l o g - S Q

t

l o g - V O

t

l o g - R V

5 . 2 0

8 . 7

8

9 . 1

2

( 0 . 1 5 8 )

( 0 . 0 3 2 )

( 0 . 0 2 8 )

l o g - S Q

3 1 . 6

1

1 9 . 7

8

3 0 . 6

5

1 0 . 8

0

( 0 . 0 0 0 )

( 0 . 0 0 0 )

( 0

. 0 0 0 )

( 0 . 0 1 3 )

l o g - V O

1 1 . 4

1

2 . 2 1

1 5 . 0

1

( 0 . 0 1 0 )

( 0

. 5 3 1 )

( 0 . 0 0 2 )

24


26/30

Table 4:In-sample forecast evaluationThe table compares the in-sample forecasts of the models described in the first column.AR(1), AR(3) and HAR(3) are univariate models of realized volatility only, AR(1)+SQ,AR(3)+SQ and HAR(3)+SQ are the models augmented with lagged search queries. Perfor-

mance measures are the mean squared error (MSE,×

10

4

), the quasi-likelihood loss function(QL, ×102) and the R2 (in percent) of the Mincer-Zarnowitz regression. The preferredmodel (minimum of QL loss function and MSE, maximum of R2) is indicated through boldnumbers.

Model: MSE QL R2

AR(1) 0.172 22.819 65.96

AR(1) + SQ 0.164 22.408 66.64

AR(3) 0.153 21.273 69.38

AR(3) + SQ 0.148 21.109 69.97HAR(3) 0.148 21.155 70.34

HAR(3) + SQ 0.143 21.071 71.11

25


27/30

T a b l e 5 :

O u t - o f - s a m

p l e f o r e c a s t e v a l u a t i o n

T h e t a b l e c o m

p a r e s t h e 1 d a y , 1 w e e k a n d 2 w e e

k s o u t - o f - s a m p l e f o r e c a s t s o f t h e m

o d e l s d e s c r i b e d i n t h e fi r s t c o l u m n

. A R ( 1 ) ,

A R ( 3 ) a n d H A

R ( 3 ) a r e u n i v a r i a t e m o d e l s o f r e a l i z e d v o l a t i l i t y o n l y , V A R ( 1 ) , V A R ( 3 ) a n d V H A R ( 3 ) a r e b i v a r i a t e m o d e l s o f r e a l i z e d

v o l a t i l i t y ( R V ) a n d s e a r c h q u e r i e s ( S Q ) . P e r f o r m a

n c e m e a s u r e s a r e t h e m e a n s q u a r e d e r r o r ( M S E , × 1 0 4 ) , t h e q u a s i - l i k e l i h o o d l o s s

f u n c t i o n ( Q L ,

× 1 0 2 ) a n d t h e R 2

( i n p e r c e n t ) o f

t h e M i n c e r - Z a r n o w i t z r e g r e s s i o n .

T h e p r e f e r r e d m o d e l ( m i n i m u m o

f Q L l o s s

f u n c t i o n a n d M

S E , m a x i m u m o f R 2 ) i s i n d i c a t e d

t h r o u g h b o l d n u m b e r s .

1 d a y

1 w e e k

2 w e e k s

M o d e l :

M S E

Q L

R 2

M S E

Q L

R 2

M S E

Q L

R

2

A R ( 1 )

R V

0 . 2 4 0

5 . 4 1 3

6 4 . 1 4

6 . 7 3 1

6 . 1 4 0

6

1 . 5 0

3 4 . 3 0 6

9 . 1 3 0

5 0

. 0 8

V A R ( 1 )

R V , S Q

0 . 2 2 4

4 . 8 0 7

6 4 . 7 1

4 . 8 9 6

4 . 6 9 8

6

3 . 7 2

2 4 . 0 6 3

6 . 4 9 8

5 5

. 0 4

A R ( 3 )

R V

0 . 2 1 0

4 . 5 2 7

6 7 . 8 0

4 . 3 6 7

3 . 9 4 9

6

9 . 5 6

2 1 . 1 2 9

5 . 3 3 9

6 2

. 7 5

V A R ( 3 )

R V , S Q

0 . 2 0 2

4 . 2 8 0

6 8 . 1 2

3 . 9 1 6

3 . 4 7 2

6

9 . 7 7

1 7 . 4 3 6

4 . 4 9 5

6 3

. 9 6

H A R ( 3 )

R V

0 . 1 9 8

4 . 3 2 4

6 9 . 0 7

3 . 6 5 5

3 . 4 5 0

7

2 . 0 7

1 5 . 6 9 9

4 . 1 9 8

6 7

. 7 4

V H A R (

3 )

R V , S Q

0 . 1

9 4

4 . 1

4 2

6 9 . 7

4

3 . 5

4 8

3 . 1

7 9 7

3 . 6

0

1 4 . 8

7 6

3 . 7

6 3

7 0

. 8 1

26


28/30

T a b l e 6 :

F o r e c a s t e v a l u a t i o n a c r o s s d i ff e r e n t t

i m e s o f v o l a t i l i t y

T h e t a b l e c o m

p a r e s t h e f o r e c a s t i n g p e r f o r m a n c e

o f t h e u n i v a r i a t e r e a l i z e d v o l a t i l i t y H A R ( 3 ) m o d e l t o t h e H A R ( 3 ) o

f r e a l i z e d

v o l a t i l i t y ( R V ) a n d s e a r c h q u e r i e s ( S Q ) a c r o s s s t a t e s o f d i ff e r e n t v o l a t i l i t y . W e s o r t a l l d a y s i n t o t h e b o t t o m a n d t o

p 1 % 5 %

t o 5 0 % o f v o l a t i l i t y a n d c a l c u l a t e m e a n s q u a r e d

e r r o r ( M S E , × 1 0 4 ) a n d t h e q u a s i - l i k e l i h o o d l o s s f u n c t i o n ( Q L , × 1 0 2 ) f o r e a c h

p e r c e n t i l e .

P a n e l A : I n - s a m p l e p r e d i c t i o n

R e a l i z e d V o l a t

i l i t y

B o t t o m

T o p

1 %

5 %

1 0 %

2 5 %

5 0 %

5 0 %

2 5 %

1 0 %

5 %

1 %

M S E :

H A R ( 3 )

0 . 0 4 0

0 . 0 2 4

0 . 0 2 1

0 . 0 2 0

0 . 0 2 7

0 . 2 6 7

0 . 4 8 7

1 . 0 4 7

1 . 8 7 0

6 . 8 7 5

M S E :

H A R ( 3 ) +

S Q

0 . 0 3 8

0 . 0 2 5

0 . 0 2 1

0 . 0 1 9

0 . 0 2 6

0 . 2 5 8

0 . 4 6 9

0 . 9 9 5

1 . 7 4 8

6 . 3 9 6

D i ff e r e n c e i n M S E

0 . 0 0 2

- 0 . 0 0 1

0 . 0 0 0

0 . 0 0 1

0 . 0 0 1

0 . 0 1 0

0 . 0 1 8

0 . 0 5 2

0 . 1 2 2

0 . 4 7 9

Q L

H A R ( 3 )

1 5 . 2 3

7 . 8 4

5 . 7 0

4 . 0 7

3 . 5 4

5

. 5 1

7 . 5 3

1 1 . 0 6

1 3 . 7 3

3 6

. 3 6

Q L

H A R ( 3 ) +

S Q

1 5 . 0 7

8 . 0 1

5 . 7 9

4 . 0 3

3 . 4 4

5

. 4 4

7 . 4 2

1 0 . 5 2

1 2 . 4 2

3 3

. 1 6

D i ff e r e n c e i n Q L

0 . 1 6

- 0 . 1 6

- 0 . 0 9

0 . 0 4

0 . 0 9

0

. 0 7

0 . 1 1

0 . 5 4

1 . 3 1

3

. 2 0

27


29/30

T a b l e 6 - C o n t i n u e d

P a n e l

B : O u t - o f - s a m p l e p r e d i c t i o n

R e a l i z e d V o

l a t i l i t y

B o t t o m

T o p

1 %

5

%

1 0 %

2 5 %

5 0 %

5 0 %

2 5 %

1 0 %

5 %

1 %

M S E :

H A R

( 3 )

0 . 0 1 9

0 . 0

2 4

0 . 0 2 2

0 . 0 2 3

0 . 0 2 7

0 . 1 5 1

0 . 2 5 5

0 . 5 1 3

0 . 8 6 7

2 . 7 9 9

M S E :

V H A

R ( 3 ) ( R V , S Q )

0 . 0 2 2

0 . 0

2 8

0 . 0 2 5

0 . 0 2 5

0 . 0 2 8

0 . 1 4 2

0 . 2 3 6

0 . 4 6 8

0 . 7 8 6

2 . 4 4 1

D i ff e r e n c e i n

M S E

- 0 . 0 0 3

- 0 . 0

0 3

- 0 . 0 0 3

- 0 . 0 0 2

- 0 . 0 0 1

0 . 0 0 9

0 . 0 1 9

0 . 0 4 5

0 . 0 8 2

0 . 3 5 8

Q L

H A R

( 3 )

7 . 9 9

6 .

8 5

5 . 3 6

4 . 3 0

3 . 5 8

5 . 9 3

8 . 2 0

1 2 . 3 0

1 8 . 1 3

5 2 . 5 6

Q L

V H A

R ( 3 ) ( R V , S Q )

9 . 2 1

7 .

7 1

6 . 0 0

4 . 6 3

3 . 6 6

5 . 4 9

7 . 5 9

1 1 . 3 0

1 6 . 5 0

4 6 . 0 9

D i ff e r e n c e i n

Q L

- 1 . 2 2

- 0 .

8 5

- 0 . 6 4

- 0 . 3 3

- 0 . 0 7

0 . 4 4

0 . 6 1

1 . 0 1

1 . 6 3

6 . 4 7

28


30/30

A Appendix

Table A.1:Robustness of the out-of-sample forecasts: Accounting for the de-

layed publication of search query dataThe table repeats the 1 day out-of-sample volatility forecast presented in Table 5 usingsearch query data of two lags, i.e. volatility tomorrow is predicted with volatility data untiltoday and search data until yesterday to account for delays in the publication of search querydata. Performance measures are the mean squared error (MSE,×104), the quasi-likelihoodloss function (QL, ×102) and the R2 (in percent) of the Mincer-Zarnowitz regression. Thepreferred model (minimum of QL loss function and MSE, maximum of R2) is indicatedthrough bold numbers.

Model: MSE QL R2

AR(1) RV 0.240 5.413 64.14VAR(1) RV, SQ 0.214 4.745 66.23

AR(3) RV 0.210 4.527 67.80

VAR(3) RV, SQ 0.206 4.397 67.73

HAR(3) RV 0.198 4.324 69.07

VHAR(3) RV, SQ 0.194 4.223 69.43

29

Documents

Behaviorial Finance.pdf