22
Nowcasting Indian GDP with Google Search data Eric Qian March 19, 2018 Abstract Gross domestic product is an important indicator for measuring overall economic health, and accurate forecasts are essential in crafting monetary and economic policy. Developing economies are constrained by the availability and quality of economic time series data. With increasing global Internet penetration, search data is an appealing source for supplementing traditional economic data. Using India as a case study, this paper uses a dynamic factor model with with a Kalman filter to generate predictions for GDP. This paper has two contributions to the Indian context. First, a multifactor model yielded better predictions than a single factor. Second, incorporating Google Search data did not improve prediction accuracy. 1 Introduction Monetary and economic policy heavily rely on accurate and timely information. How- ever, since data releases are often delayed, current values are often not known until later months. This produces a need for short run predictions. Academics and central banks often use an approach called “nowcasting,”, which gives a framework for forecasting the past, present, and near-future. This approach is especially important in predicting low frequency macroeconomic variables with long publication lags, and is often used for pro- ducing forecasts for GDP. A key advantage to this approach is the ability to incorporate and interpret data releases in real-time. Much of the work is based on the model described in Banbura, Giannone and Reichlin (2011). The growth in search engines and data collection has made Internet data an attractive source for supplementing traditional economic time series. As highlighted in the seminal work by Choi and Varian (2012), aggregated Google search data provides a timely and 1

Nowcasting Indian GDP with Google Search data

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Nowcasting Indian GDP with Google Search data

Eric Qian

March 19, 2018

Abstract

Gross domestic product is an important indicator for measuring overall economic

health, and accurate forecasts are essential in crafting monetary and economic policy.

Developing economies are constrained by the availability and quality of economic time

series data. With increasing global Internet penetration, search data is an appealing

source for supplementing traditional economic data. Using India as a case study, this

paper uses a dynamic factor model with with a Kalman filter to generate predictions

for GDP. This paper has two contributions to the Indian context. First, a multifactor

model yielded better predictions than a single factor. Second, incorporating Google

Search data did not improve prediction accuracy.

1 Introduction

Monetary and economic policy heavily rely on accurate and timely information. How-

ever, since data releases are often delayed, current values are often not known until later

months. This produces a need for short run predictions. Academics and central banks

often use an approach called “nowcasting,”, which gives a framework for forecasting the

past, present, and near-future. This approach is especially important in predicting low

frequency macroeconomic variables with long publication lags, and is often used for pro-

ducing forecasts for GDP. A key advantage to this approach is the ability to incorporate

and interpret data releases in real-time. Much of the work is based on the model described

in Banbura, Giannone and Reichlin (2011).

The growth in search engines and data collection has made Internet data an attractive

source for supplementing traditional economic time series. As highlighted in the seminal

work by Choi and Varian (2012), aggregated Google search data provides a timely and

1

unfiltered measure for people’s preferences. Data are available in near real-time, and peo-

ple are unlikely to filter their search queries. These features are especially attractive for

macroeconomic prediction. As described in Vosen and Schmidt (2011), Google Search can

be used to measure economic sentiment ahead of traditional economic surveys.

Its direct connection to consumer sentiment makes Search data particularly appeal-

ing for forecasting prices (Pan, Chenguang Wu and Song, 2012; Goldfarb, Greenstein and

Tucker, 2015). More generally, Google data has also been used to study the Great Re-

cession (Suhoy, 2009; Schmidt and Vosen, 2012). Despite its advantages, Search has more

limited power when data quality is high and publication delays are low. For example, Li

(2016) found that in forecasting US initial jobless claims, incorporating Search data did

not improve forecasts. The comparative advantage to using search data is lower in contexts

where conventional data sources sufficiently capture consumer sentiment.

With the constraints in data timeliness and quality, there is promise in using Internet

data for emerging markets. Existing data sources are often not published at the same

standards as in developed economies. Swallow and Labbe (2010) and Seabold and Coppola

(2015) showcase the potential for Internet data in improving predictions by supplementing

traditional sources. In places where survey data are unavailable, search data can act as

a substitute for capturing consumer sentiment. This advantage has become more salient

over time, as mobile phone usage and Internet penetration have increased (World Bank,

2017).

The goal of this paper is to apply the nowcasting framework to the development context

and to study whether Internet data can be used as a solution for the imperfect data problem.

In particular, I aim to forecast Real Gross Domestic Product for India. India gives a

case study to creating predictions under imperfect conditions. Compared to developed

economies fewer economic time series are available and existing data are often lower quality

and more volatile. Moreover, high Internet penetration gives conditions where Google

Search data can quantify consumer sentiment.

2

2 Data

This paper incorporates two main sources of data. I first describe the traditional data

sources, which are series typically used in economic forecasting applications. I then discuss

the aggregated Google Search data.

2.1 Traditional data sources

The dataset containing traditional data sources was aggregated and collected from Bloomberg

and contains monthly time series data from January 2000 to August 2017. The series were

selected to give a full representation of the Indian economy and to mimic how market par-

ticipants respond to information. The choice of series is partially taken from Bragoli and

Fosten (2017) and (Giannone, Agrippino and Modugno, 2013a) which examine Indian and

Chinese GDP respectively. Because of the relative scarcity of qualitative sentiment data,

both papers primarily use hard data. While these series are less noisy, these are subject to

greater publication lag.

Different from developed economies, data quality is a substantial concern in the India

context. Even though the service sector represents approximately 58 % of the economy,

activity is undercounted in the economic census and through the unorganized sector (Bard-

han, 2013). Moreover, due to issues with survey coverage, price-indexing is often applied

inconsistently, with underrepresentation of low income groups. As a result, in the predic-

tion context, forecasting is a greater challenge. Economic time series information is less

consistent and noisier, so there is greater variation in the predictions themselves.

GDP in particular is subject to some uncertainty. Similar to other series, inputs to

publishing official GDP figures by the Central Statistics Organization (CSO) have quality

problems. As a result, published figures are subject to substantial data revisions. In some

cases, revisions between initial and final yearly growth rates were up to 2.5% (Sapre and

Sengupta, 2017). As a result, prediction values are also noisier.

The series fall into two primary categories: hard data and financial data. The hard data

are quantitative economic indicators. These series are meant to capture economic activity

in India broadly. Several series capture domestic production (India Industrial Production,

Crude Oil Production, Steel Production). Considering the connectedness of the broader

Indian economy, several several of the series focus on trade (Indian imports and Indian

exports). Similarly, several international series are included (US Industrial Production,

3

Figure 1: GDP plot: Raw and transformed series

10

15

20

25

30

2000 2005 2010 2015

Date

Rup

ee (

trill

ions

)

Indian GDP in chained (2000) rupees

0.00

0.05

0.10

2000 2005 2010 2015

Date

YoY

pct

. cha

nge

European Industrial Production). Financial variables are available at a higher frequency

(some are available at the hourly level). For this paper, these are aggregated to the monthly

level. Similar to the hard indicators, these are intended to give a broad economic view.

BSE 30 and Asia Sentix are indicators of the Indian and Asian markets respectively. The

exchange rate gives a hard indicator for global trade.

A few considerations were made for the data to enter the model. First, the data were

transformed to ensure stationarity just as in Bragoli and Fosten (2017). The GDP series

found in figure 1 illustrates a sample transformation. The top graph shows the untrans-

formed series with a clear positive trend. The bottom shows the result after applying a

year over year transformation. We see that the positive trend disappears.

Using year-over-year changes differs from some of the nowcasting literature (Li, 2016;

Banbura, Giannone and Reichlin, 2011; Giannone, Reichlin and Small, 2008). While many

papers focus on developed economies with high quality timely data, the developing context

4

Tab

le1:

Ser

ies,

tran

sfor

mat

ion

s,an

dlo

adin

gsN

ame

Mn

emon

icT

ran

sform

ati

on

Fre

qu

ency

Com

mon

Hard

Fin

an

cial

Ind

iaIn

du

stri

alP

rod

uct

ion

INP

IIN

DU

pcy

M1

10

Ind

iaE

xp

orts

INM

TE

XU

pcy

M1

10

Ind

iaIm

por

tsIN

MT

IMU

pcy

M1

10

Ind

iaP

MI

Ser

vic

esM

PM

IIN

SA

pcy

M1

10

Ind

iaP

MI

Man

ufa

ctu

ring

MP

MII

NM

Ap

cyM

11

0In

dia

Cru

de

Oil

Pro

du

ctio

nIN

RG

GT

GT

pcy

M1

10

Ind

iaS

teel

Pro

du

ctio

nIN

FR

ST

EL

pcy

M1

10

Ind

iaE

lect

rici

tyG

ener

atio

nIN

FR

EL

EC

pcy

M1

01

Ind

ian

Mon

eyS

up

ply

(M1)

INM

SM

1p

cyM

10

1In

dia

Exch

ange

Rat

e(R

up

ee/U

S$,

EO

P)

US

DIN

Rp

cyM

10

1In

dia

91-D

ayT

reas

ury

Bil

lIY

TB

3M

pcy

M1

01

Ind

iaS

tock

Pri

ces

Sen

sex/B

SE

30S

EN

SE

Xp

cyM

10

1In

dia

Ind

ust

rial

Wor

kers

Con

sum

erP

rice

INC

PII

ND

pcy

M1

10

Ind

iaA

gric

ult

ura

lL

abor

ers

Con

sum

erP

rice

INC

PA

GL

Bp

cyM

11

0In

dia

Ru

ral

Lab

orer

sC

onsu

mer

Pri

ceIN

CP

RU

LB

pcy

M1

10

Ind

iaC

onsu

mer

Pri

ceIN

FU

TO

Tp

cyM

11

0In

dia

Wh

oles

ale

Pri

ceIn

dex

(All

)IN

FIN

Fp

cyM

11

0In

dia

Wh

oles

ale

Pri

ceIn

dex

(Fu

elP

ower

)IN

FIF

UE

pcy

M1

10

US

Ind

ust

rial

Pro

du

ctio

nIn

dx

IPp

cyM

11

0U

SIS

MM

anu

fact

uri

ng

PM

IN

AP

MP

MI

lvl

M1

10

Eu

ro19

Indu

stri

alP

rod

uct

ion

EU

ITE

MU

Mlv

lM

11

0A

sia

(ex

Jap

an)

Sen

tix

over

all

Ind

exS

NT

EA

SG

Xp

cyM

10

1C

urr

ent

Pri

ceG

ross

Dom

esti

cP

rod

uct

inIn

dia

IND

GD

PN

QD

SM

EI

pcy

Q1

10

5

differs. Most time series lack seasonal adjustment, so including raw series in the model

would primarily capture seasonal variation rather than underlying trends. Taking a yearly

view addresses these concerns, and has previously been used in the development context

(Swallow and Labbe, 2010; Giannone, Agrippino and Modugno, 2013a). While some papers

have adjusted for seasonality upon input (such as implementing the US Census Bureau’s

X-13ARIMA-SEATS seasonal adjustment approach), this approach is less salient for the

GDP prediction, as two-sided filters are less context appropriate (Swallow and Labbe, 2010;

Vosen and Schmidt, 2011; Li, 2016).

2.2 Google Search data

I include Google Search data, which are publicly available through the Google Trends

interface 1. Each series gives relative search volume for a particular term, where values

are scaled so that the maximum search volume is 100. As a consequence, for some volatile

terms, the lowest level is 0 (even when raw search volume itself is nonzero). Users can

adjust search scope — globally, by country, or by state — and series frequency — hourly,

daily, weekly, and monthly. For this study, I examine search volume in India by monthly

frequency.

Its relatively high usage in India allows for Google search to be used to approximate

consumer sentiment. Figure 2 gives Google’s search market share in India over time.

By traffic volume, Google commands 96-97% of the market from 2009-2018 StatCounter

(2018). One explanation to Google’s spread is English’s widespread role, especially among

younger people. While other languages are are very regional (e.g. Hindi in the north,

Bengali in the east, Telugu in the south, Marathi in the west), English plays a large role

in schools nationally as its second official national language (Census of India, 2001). This

gives Google an advantage over regional language search variants, and provides a means

to measure preferences nationally.

Several studies have used Search data to capture consumer sentiment. In this context,

search data acts as a signal for people’s unfiltered expectations. The advantage to this

approach is in timeliness; search data are available in real-time.

The terms are categorized into three categories: economy, jobs, and household. These

are intended to holistically capture people’s feelings about the economy as a whole, similar

1https://trends.google.com/trends/

6

Figure 2: Google Search has maintained high market share over time

0

25

50

75

100

2010 2012 2014 2016 2018

Year

Mar

ket s

hare

(%

)

Search engine

Bing

Google

Other

Indian search engine market share

to the approach found in Seabold and Coppola (2015). Search terms related to the economy

help indicate economic anxiety generally, where people search terms like “recession” and

“petrol prices” when nervous about the future. Similarly, in the “jobs” category, people

are likely to search for terms such as “salary” and “raise” for similar reasons. Finally, the

“household” category is intended to capture consumer aspirations, with searches such as

“cars” appearing during more positive economic outlooks.

Over its history, Google has provide varying degrees of language support for India.

Despite the increase in support, traffic volume is largely in English. Figure 3 gives an

illustration. Over time, English search volume is several times greater than that of mulya.

There are several approaches used for incorporating Google Trends data into the fore-

casting framework. The primary hurdle is extracting signal from many time series. Several

papers suggest creating composite indices from the Search terms, effectively reducing mul-

tiple time series to a single indicator via least squares (Vosen and Schmidt, 2011; Seabold

and Coppola, 2015). An advantage to this approach is its ease for being incorporated di-

rectly into VAR models. Considering choice of evaluation framework, factor models allow

these series to be directly inputted just as in Li (2016). This approach is consistent with

how the traditional data sources are incorporated. The factor model itself provides a form

7

Figure 3: Search volume for “price” is greater than that of mulya

0

25

50

75

100

2005 2010 2015

Year

Sea

rch

volu

me

(max

= 1

00)

Terms

price (English)

price (Hindi)

Relative search volume for 'price'

Table 2: Google Search terms and categoriesCategory Terms

Economy price, prices, cost, inflation, recession, petrol priceJobs salary, raise, unemployment, bankruptcy, insolvency, job, jobs,

job vacancy, jobs inHousehold rent, rent in, job salary, cars, rent house

of dimension reduction.

One shortcoming to Search data is the lack of seasonal adjustment. Consistent with the

other economic time series, I transform the series using a year-over-year percent change

transformation. Figure 4 displays a sample transformation. Qualitatively, the untrans-

formed series appears to be increasing over time. After performing the transformation, the

resulting series appears to be stationary.

8

Figure 4: Google Search: Example of year-over-year percent change transformation

40

60

80

100

2005 2010 2015

Date

Inde

x (m

ax =

100

)Google Search: "Salary"

0.0

0.4

0.8

2005 2010 2015

Date

Tran

sfor

med

(%

YoY

)

3 Model

A dynamic factor model with Kalman filtering was used to model the series dynamics

(Banbura, Giannone and Reichlin, 2011; Giannone, Reichlin and Small, 2008). This setup is

common in the economic forecasting context, with past works focusing on different countries

(China, India, U.K., Eurozone, United States) and for different time series (Li, 2016; Bragoli

and Fosten, 2017; Swallow and Labbe, 2010; Giannone, Agrippino and Modugno, 2013b).

The model is driven by two relationships. First, the monthly series have the follow the

following factor structure:

xMt = µ+ Λft + εt (1)

where xt gives the transformed data series (yearly growth rate or differences), µ gives

a constant, ft is a r × 1 vector containing r unobserved factors, and Λ gives the factor

coefficients. εt gives the idiosyncratic error. Call this factor structure the observation

equation. This approach effectively is a means for dimension reduction, as each data series

can be decomposed into several common components. In practice, from transformations,

the constant µ = 0.

9

Second, the factors themselves can be modeled as an AR(1) process where

ft = Aft−1 + ut (2)

for i.i.d. ut ∼ N(0, Q). Matrix A is r × r and contains autoregressive coefficients. This

relationship is called the transition equation, which can be understood as determining the

factor relationship across time.

Intuitively, for each period t, the value for each time series (e.g. Consumer Price Index)

is determined by the value of the unobserved factors fi. The values fi change over time

according to the dynamics of the transition equation.

The idiosyncratic component can also be modeled as an AR(1) process where

εi,t = αiεi,t−1 + ei,t (3)

for i.i.d ei,t ∼ N(0, σ2i ).

For this model, consistent with existing literature, we further impose structure onto

the factors driven by economic theory (Giannone, Reichlin and Small, 2008; Giannone,

Agrippino and Modugno, 2013a). In our case, we take number of factors r = 3 where

each factor can be written as ft = (fGt , fHt , f

Ft ) representing the “global”, “hard”, and

“financial” factors. Furthermore, the loadings matrix Λ is structured so that the global

factor loads onto each series. Depending on the type of data, each series also takes on

either a hard fHt or financial fFt loading:

Λ =

(ΛH,G ΛH,H 0

ΛF,G 0 ΛF,F

).

For the models incorporating Google Trends, four factors are used: “global”, “hard”,

“financial”, and “soft” where Trends data are assigned to the soft factor.

For the transition equation, factor dynamics are related to each factor’s own previous

state, so

Ai =

Ai,G 0 0

0 Ai,H 0

0 0 Ai,F

.

In order to incorporate mixed frequency data, quarterly data can be modeled as the

10

sum of monthly data. By convention, these series are placed at the end of the quarter

while preserving monthly indexing t. Using the approach found in Mariano and Murasawa

(2003), the quarterly series can be modeled as

yQt = yt + 2yt−1 + 3yt−2 + 2yt−3 + yt−4.

With the monthly analogue, the quarterly series can enter the transition and observation

equations using the same dynamics.

The model is estimated using an expectation maximization algorithm Giannone, Reich-

lin and Small (2008). This approach is often used in situations where missing observations

make direct maximization of the likelihood function computationally intractable. Broadly,

there are several steps. Factor estimates are first initialized using principal components.

Then, these initial conditions enter the expectation-maximization loop. In the expectation

step, the log-likelihood conditional on the data is computed based on values from the pre-

vious iteration. In the maximization step, new parameters are estimated by maximizing

the expected log-likelihood. Expectation and maximization are looped until estimates con-

verge, measured by the change in log-likelihood. Details on this procedure can be found in

Doz, Giannone and Reichlin (2006); Banbura and Modugno (2014).

11

Figure 5: Forecasting error by month number

RMSFE by month (difference from first month)

1 2 3

Month

-1

-0.8

-0.6

-0.4

-0.2

0

RM

SF

E (

diffe

ren

ce

fro

m f

irst

mo

nth

)10

-3

4 Results

To evaluate model performance, I use a pseudo-data vintage approach. The goal is to

simulate the real-world forecast creation process; as more information become available,

the forecasts are updated. Each month, a prediction for current quarter GDP is created

using information available at that month. Therefore, three predictions are generated for

the current quarter GDP (month 1, month 2, and month 3). This simulates the publication

delay for GDP figures, but abstracts away irregular release intervals (monthly series are

not updated concurrently) and data revisions (estimates are often revised after initial

estimates are published). While there is substantial literature taking data revisions and

release schedules into account, this will not be the focus of this paper (Giannone, Reichlin

and Small, 2008; Giannone, Agrippino and Modugno, 2013c).

Figure 5 illustrates improvements in model performance as more information becomes

available. On average, as time progresses, the forecasting error decreases. Month 1 has the

highest error, month 2 has lower error, and month 3 has the lowest error.

Figure 6 gives an illustration of the model’s mechanics. Each transformed series is

standardized and plotted (the colored series) alongside the common factor (in black). No-

tice that the common factor captures some comovement for the each of the series, which

12

Figure 6: Standardized data and common factor plot

2000 2002 2004 2006 2008 2010 2012 2014 2016 2018

-12

-10

-8

-6

-4

-2

0

2

4

6Standardized data and common factor

is especially apparent during and after the Great Recession. During the downturn, many

economic indicators downturns decreased in value, and afterwards, many decreased. This

figure does not give predictions: it illustrates that series have some degree of commonality.

Figures 7 and 8 display the four prediction models over time. On inspection, the

four approaches roughly track GDP throughout the period and perform best during low

volatility intervals. All four models perform relatively well from 2006-2008 and from 2012-

2015.

The four models share similar characteristics during high volatility periods. The models

overpredict GDP during the beginning of the Great Recession and underpredict its recovery

afterwards. Similarly, GDP is slightly overpredicted during the 2012 slowdown. Moreover,

all four slightly underpredict GDP from 2016 onwards, despite being a low volatility period.

One potential reason is a change in methodology in price measurement. In 2016, the Cen-

tral Statistics Office modified estimates for Wholesale Price Index and Index of Industrial

Production, which are used to compute GDP in constant prices. These changes resulted

in an upwards GDP revision of .5%. It is possible that the model provides predictions

without accounting for changes in methodology.

Figure 9 gives a more detailed account for prediction error over time. While the move-

13

Figure 7: Within quarter predictions (no search data)

2004 2006 2008 2010 2012 2014

Date

-0.02

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

Yo

Y G

DP

Current quarter predictions (baseline)

Data

Three factors

One factor

Figure 8: Within quarter predictions (with search data)

2004 2006 2008 2010 2012 2014

Date

-0.02

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

Yo

Y G

DP

Current quarter predictions (Google Trends)

Data

Four factors

One factor

14

Figure 9: Prediction error over time (no search data)

2006 2008 2010 2012 2014 2016

-0.06

-0.04

-0.02

0

0.02

0.04Residuals for 3 factor and 1 factor models (no Google trends)

3 factors

1 factor

Figure 10: GDP Root Mean Square ErrorSeries 3 factors 1 factor 4 factors (with trends) 1 factor (with trends)

RMSE 0.0139 0.0157 0.0143 0.0165

ments for the multifactor and single factor models are similar, the multifactor model is

more effective in predicting the higher volatility periods. For a more formal comparison

of forecasting performance, figure 10 shows root mean square error. For each set of data,

the single factor model performs worse than the multifactor model. Comparing the Trends

and non-Trends model, it appears that adding the Search data does not improve predic-

tions. From figure 11, notice that these performance differences are constant overtime.

Displaying the RMSFE, the three factor no Trends model consistently outperforms the

other three models from 2006-2018. Similarly, the other three models maintain the same

relative position, with all four models moving together.

Next, I generated predictions for the Indian Consumer Price Index series. This exercise

gives insight to the Google Trends “consumer sentiment” measurement hypothesis. Prices

are tied more closely to movements in sentiment than for GDP. If the Trends data is a

good indicator for sentiment, we would expect Trends data to increase the performance of

15

Figure 11: Models perform similarly over time

2006 2008 2010 2012 2014 2016

Year

0

0.005

0.01

0.015

0.02

0.025

RM

SF

E

Root mean squared forecast error over time

3 factor

1 factor

4 factor (Trends)

1 factor (Trends)

the model. Note that CPI is a monthly, not a quarterly series, so predictions are created

for the current month.

Figures 12 and 13 display predictions for CPI. The four models predict CPI fairly

well, as the prediction error is low. The improved prediction performance indicates that

forecasting CPI is an easier prediction problem than GDP. CPI’s relatively low volatility

and higher frequency make for better predictions. Monthly values make the series more

predictable. Moreover, the higher performance in the baseline models using only economic

data give little room for performance improvements in including Trends data. Figure 14

gives root mean squared error for CPI predictions. Just as before, the Google Trends

models perform slightly worse than the traditional economic models.

16

Figure 12: CPI predictions using economic data

2004 2006 2008 2010 2012 2014

Date

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

Yo

Y C

PI

Current quarter predictions

Data

Baseline (1 factor)

Baseline (3 factors)

Figure 13: CPI predictions using economic data plus search data

2004 2006 2008 2010 2012 2014

Date

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

Yo

Y C

PI

Current quarter predictions

Data

Trends (four factors)

Trends (one factor)

Figure 14: Root mean squared error for CPISeries 3 factors 1 factor 4 factors (with trends) 1 factor (with trends)

RMSE 0.0048 0.0059 0.0050 0.0048

17

5 Discussion

For Indian GDP, incorporating search data into the prediction procedure provides no im-

provement over the baseline model. However, incorporating a multifactor setup appears to

improve forecasts.

These differences reflect the difficulty of the problem. In some cases, the GDP figure

itself changed by greater than 1% (Sapre and Sengupta, 2017). Advanced estimates can

differ substantially from final numbers. As a result, forecasting can suffer from imprecision,

complementing existing problems with noisy and unreliable data.

Also, the current does not take into account the effect of staggered data releases on

changes to prediction. While releasing data concurrently is parsimonious, a more realistic

approach would be to add data according to a release calendar. Instead of modifying the

dataset each month, data can added on their release dates. This would allow for a more

careful study for how particular series releases move forecasts.

Perhaps more importantly, following a release calendar can help take timeliness into

account. Release sequencing is especially important for studying qualitative data. While

soft indicators generally are noisier, they are subject to shorter publication lag. These

series give an early signal (albeit less precise) to the direction of the target series. Without

release sequencing, these advantages disappear. The current simulation does not give an

environment for seeing the strengths of the Google Search data.

For the future, the study can be improved by more closely examining the mechanism

for measuring consumer sentiment. While this study includes a broad selection of terms,

more careful consideration for cultural and linguistic norms would likely aid the analysis.

India has the second most English speakers, but more closely examining ground-level search

patterns would shed light onto the specific transmission of economic attitudes.

18

References

Banbura, Marta, and Michele Modugno. 2014. “Maximum likelihood estimation of

factor models on datasets with arbitrary pattern of missing data.” Journal of Applied

Econometrics, 29(1): 133–160.

Banbura, Marta, Domenico Giannone, and Lucrezia Reichlin. 2011. “Nowcasting.”

Oxford Handbook of Economic Forecasting.

Bardhan, Pranab. 2013. “The State of Indian Economic Statistics : Data Quantity and

Quality Issues.”

Bragoli, Daniela, and Jack Fosten. 2017. “Nowcasting Indian GDP.” Oxford Bulletin

of Economics and Statistics (forthcoming).

Census of India. 2001. “Population by Bilingualism, Trilingualism, Age and Sex.”

Choi, Hyunyoung, and Hal Varian. 2012. “Predicting the Present with Google

Trends.” Economic Record, 88(SUPPL.1): 2–9.

Doz, Catherine, Domenico Giannone, and Lucrezia Reichlin. 2006. “A Quasi Max-

imum Likelihood Approach for Large Approximate Dynamic Factor Models.” European

Central Bank Working Paper Series, , (674).

Giannone, Domenico, Lucrezia Reichlin, and David Small. 2008. “Nowcasting:

The real-time informational content of macroeconomic data.” Journal of Monetary Eco-

nomics, 55(4): 665–676.

Giannone, Domenico, Silvia Miranda Agrippino, and Michele Modugno. 2013a.

“Nowcasting China Real GDP.”

Giannone, Domenico, Silvia Miranda Agrippino, and Michele Modugno. 2013b.

“Nowcasting China Real GDP.”

Giannone, Domenico, Silvia Miranda Agrippino, and Michele Modugno. 2013c.

“Nowcasting China Real GDP Forecast-Nowcast-Backcast : Q3-2013.” , (February): 1–

26.

19

Goldfarb, Avi, Shane M Greenstein, and Catherine E Tucker. 2015. Economic

Analysis of the Digital Economy Volume.

Li, Xinyuan. 2016. “Nowcasting with Big Data : is Google useful in Presence of other

Information ?” 1–41.

Mariano, Roberto S., and Yasutomo Murasawa. 2003. “A new coincident index of

business cycles based on monthly and quarterly series.” Journal of Applied Econometrics,

18(4): 427–443.

Pan, Bing, Doris Chenguang Wu, and Haiyan Song. 2012. Forecasting hotel room

demand using search engine data. Vol. 3.

Sapre, Amey, and Rajeswari Sengupta. 2017. “An analysis of revisions in Indian GDP

data.” , (September).

Schmidt, Torsten, and Simeon Vosen. 2012. “Using internet data to account for special

events in economic forecasting.” Ruhr Economic Paper, , (382).

Seabold, Skipper, and Andrea Coppola. 2015. “Nowcasting Prices Using Google

Trends An Application to Central America.” World Bank Policy Research Working Pa-

per, , (7398).

StatCounter. 2018. “Search Engine Market Share India.”

Suhoy, Tanya. 2009. “Query Indices and a 2008 Downturn: Israeli Data.” Bank of Israel

Discussion Paper Series, , (6).

Swallow, Yan Carriere, and Felipe Labbe. 2010. “Nowcasting with Google Trends in

an Emerging Market.” Documentos de Trabajo (Banco Central de Chile), 298(588): 1.

Vosen, Simeon, and Torsten Schmidt. 2011. “Forecasting private consumption:

Survey-based indicators vs. Google trends.” Journal of Forecasting, 30(6): 565–578.

World Bank. 2017. “World Development Indicators.”

20

A Structure of model

In practice, the monthly, quarterly, and idiosyncratic errors are stacked through block

matrices. The stacked version of the observation equation is displayed below. Notice that

the monthly series are ordered before the quarterly series by convention, and that the

monthly to quarterly interpolation scheme is listed.

(xt

yQt

)=

µQ

)+

(Λ 0 0 0 0 In 0 0 0 0 0

ΛQ 2ΛQ 3ΛQ 2ΛQ ΛQ 0 1 2 3 2 1

)

ft

ft−1

ft−2

ft−3

ft−4

εt

εQt

εQt−1

εQt−2

εQt−3

εQt−4

(4)

Similarly, the idiosyncratic and factor relationships are stacked for the transition equa-

tion. To be consistent with the quarterly to monthly interpolation scheme, five period for

factors f are used.

21

ft

ft−1

ft−2

ft−3

ft−4

εt

εQt

εQt−1

εQt−2

εQt−3

εQt−4

=

A1 0 0 0 0 0 0 0 0 0 0

Ir 0 0 0 0 0 0 0 0 0 0

0 Ir 0 0 0 0 0 0 0 0 0

0 0 Ir 0 0 0 0 0 0 0 0

0 0 0 Ir 0 0 0 0 0 0 0

0 0 0 0 0 diag(α1, . . . , αn) αQ 0 0 0 0

0 0 0 0 0 0 1 0 0 0 0

0 0 0 0 0 0 0 1 0 0 0

0 0 0 0 0 0 0 0 1 0 0

0 0 0 0 0 0 0 0 0 1 0

ft−1

ft−2

ft−3

ft−4

ft−5

εt−1

εQt−1

εQt−2

εQt−3

εQt−4

εQt−5

+

ut

0

0

0

0

et

eQt

0

0

0

0

(5)

22