Big Data and Macroeconomic Nowcasting · Big Data and Macroeconomic Nowcasting George Kapetanios, Massimiliano Marcellino & Fotis Papailias King’s College London, UK Bocconi University,

Big Data and Macroeconomic Nowcasting

George Kapetanios, Massimiliano Marcellino & Fotis Papailias

King’s College London, UKBocconi University, IT

Queen’s Management School, UK

Kapetanios, Marcellino, Papailias Big Data & Nowcasting April 7th 2016 1 / 79

Outline


Outline

IntroductionI Types of Big DataI Pros and Cons of Big Data for Macroeconomic NowcastingI Data & Methodological Issues

Part 1: Literature ReviewI Existing literature on the use of Big DataI Existing literature on Macroeconomic NowcastingI What about their combination?

Part 2: Big Data ModellingI Machine LearningI Heuristic OptimisationI Dimensionality ReductionI Forecast Combination & Model AveragingI Mixed Frequency Methods for Big DataI Big Data Access, Cleaning and Preparation


Outline

Part 3: Empirical AnalysisI Data DescriptionI Nowcasting & Short-Term Forecasting Evaluation

Conclusions


Introduction


Introduction

The recent crisis has emphasized the importance for policy-makers and eco-nomic agents of a real-time assessment of the current state of the economyand its expected developments, when a large but incomplete information setis available.

The main obstacle is the delay with which key macroeconomic indicatorssuch as GDP and its components, but also fiscal variables, regional/sectoralindicators and disaggregate data, are released.

Example: GDP data are only available on a quarterly basis and the ad-vance/flash estimate is only published with a 4-6 week delay or longer,depending on the country. Moreover, preliminary data are often revisedafterwards, in particular around turning points of the business cycle.


Introduction

On the other hand, a large and growing number of more timely leadingand coincident indicators is available, at a monthly, daily or even higherfrequency, based in particular on financial, survey and internet data, thoughsometimes subject to short samples, missing observations and other datairregularities.

This has stimulated a vast amount of statistical and econometric researchon how to exploit the large, timely and higher frequency but irregular infor-mation to provide estimates for key low frequency economic indicators.

A parallel, more empirical, literature has instead focused specifically on theuse of big data for nowcasting economic indicators, often using rather simpleeconometric techniques and specific big data sources, mainly Google Trends.


Introduction

Finally, a more theoretical literature has developed new, or adapted old,statistical and econometric methods to handle very large sets of explanatoryvariables, such as those associated with big data.

Broadly speaking, they are based either on summarizing the many variables,or on selecting them, or on combining many small models, after a properdata pre-treatment.

Some techniques are borrowed from machine learning, where prediction is ofkey interest but data are typically assumed to be i.i.d. rather than seriallycorrelated and possibly with changing variances over time, so that thesetechniques have to be properly extended prior to use on economic data.


Introduction: Big Data Types

The literature provides various definitions of big data. Part of the problemis that the meaning of big data can differ significantly across disciplines.

One possibility to obtain a general classification is to adopt the “4 Vs”classification, originated by the IBM, which relates to:

(i) Volume (Scale of data),

(ii) Velocity (Analysis of streaming data),

(iii) Variety (Different forms of data) and

(iv) Veracity (Uncertainty of data).

However, this classification seems too general to guide empirical nowcastingapplications.



A second option is to focus on numerical data only, which can either be theoriginal big data or the result of a transformation of unstructured data, andrefer to the size of the dataset.

Unstructured data such as, e.g., credit card transactions or other data thatdescribe the disaggregated actions or characteristics of many agents, arelikely to need to be transformed to a two dimensional panel structure wheretime is usually one dimension.



Following, e.g. Doornik and Hendry (2015), we can distinguish three maintypes of big data:

1 Tall

2 Fat

3 Huge


Introduction: Big Data Types “Tall”

“Tall”: not so many variables, N, but many observations, T, with T � N.

This is for example the case with tick by tick data on selected financialtransactions or search queries.

In this case T is indeed very large in the original time scale, say seconds, butit should be considered whether it is also large enough in the time scale ofthe target macroeconomic variable of the nowcasting exercise, say quarters.


Introduction: Big Data Types “Tall”

Tall datasets at very high frequency are not easily obtained, as they aregenerally owned by private companies.

Moreover, they generally require substantial pre-treatment, as indicatorstypically present particular temporal structures (related, e.g., to marketmicro-structure) and other types of irregularities, such as outliers, jumpsand missing observations.


Introduction: Big Data Types “Fat”

“Fat”: many variables, but not so many observations, N � T.

Large cross-sectional databases fall into this category, which is not so inter-esting from an economic nowcasting point of view, unless either T is alsolarge enough or the variables are homogeneous enough to allow proper modelestimation (e.g., by means of panel methods) and nowcast evaluation.

However, Fat datasets can be of interest in many other applications ofbig data, both inside official statistics, e.g., for surveys construction, andoutside, e.g., in marketing or medical studies.

Also, as the collection of big data started only rather recently, Fat datasetsare perhaps the most commonly available type.


Introduction: Big Data Types “Fat”

Actually, statistical methods for big data are mainly meant for Fat datasets,e.g., those developed in the machine learning literature, as they only requirea large cross-section of i.i.d. variables.

When a (limited) temporal dimension is also present, panel estimation meth-ods are typically adopted in the economic literature, but factor based meth-ods can be also applied.

Classical estimation methods are not so suited, as their finite (T) sampleproperties are generally hardly known, while Bayesian estimation seems morepromising, as it can easily handle a fixed T sample size and, with properpriors, also a large cross-sectional dimension.


Introduction: Big Data Types “Huge”

“Huge”: many variables and many observations, i.e., very large N and T.

This is perhaps the most interesting type of data in a nowcasting contexteven if, unfortunately, it is not so often available.

Big data collection started only recently, while collection of the target eco-nomic indicators started long ago, generally as far back as the 1950s or1960s for many developed countries.

Google Trends, publicly available summaries of a huge number of specificsearch queries in Google, are perhaps the best example in this category, andnot by chance the most commonly used indicators in economic nowcastingexercises.



Contrary to basic econometrics and statistics, in Huge datasets both T andN diverge, and proper techniques must be adopted to take this feature intoaccount at the level of model specification, estimation and evaluation.

For example, in principle it is still possible to consider information criteriasuch as BIC or AIC for model specification (indicator selection for the targetvariable in the nowcasting equation), although it is the case that modifi-cations may be needed to account for the fact that N is comparable orlarger than T, as opposed to much smaller as assumed in the derivations ofinformation criteria.



Further, in the case of a linear regression model with N regressors, 2N al-ternative models should be compared, which is not computationally feasiblewhen N is very large, so that efficient algorithms that only search specificsubsets of the 2N possible models have been developed.

Moreover, the standard properties of the OLS estimator in the regressionmodel are derived assuming that N is fixed and (much) smaller than T.

Some properties are preserved, under certain conditions, also when N di-verges but empirically the OLS estimator does not perform well due tocollinearity problems that require a proper regularization of the second mo-ment matrix of the regressors.



This is in turn relevant for nowcasting, as the parameter estimators are usedto construct the nowcast (or forecast).

As a result of these problems for OLS, a number of regularisation, or penal-isation methods have been suggested.

An early approach, referred to as Ridge regression, uses shrinkage to ensurea well behaved regressor sample second moment matrix.

More recently, other penalisation methods have been developed. A promi-nent example is LASSO where a penalty is added to the OLS objectivefunction in the form of the sum of the absolute values of the coefficients.Many related penalisation methods have since been proposed and analysed.



As an alternative to variable selection, the indicators could be summarizedby means of principal components (or estimated factors) or related methodssuch as dynamic principal components or partial least squares.

However, standard principal component analysis is also problematic when Ngets very large, but fixes are available, such as the use of sparse principalcomponent analysis.

Finally, rather than selecting or summarizing the indicators, they could beall inserted in the nowcasting regression but imposing tight priors on theirassociated coefficients, which leads to specific Bayesian estimators.


Introduction: Big Data Pros/Cons

In a nowcasting context, we think of big data as complements rather thansubstitutes for more common coincident and leading indicators.

Some of the problems discussed for internet based big data also apply tolarge datasets of conventional indicators.

For example, collecting disaggregated macroeconomic and financial variablesfor an EU country easily leads to a few hundred indicators, and multiplyingthat for all the EU countries, e.g. in order to conduct a comparative analy-sis, leads to thousands of variables to be considered in formal econometricmodels, which can be hardly done with standard techniques.


Introduction: Big Data Issues

Data Availability. Most data pass through private providers and are relatedto personal aspects.

Hence, continuity of data provision could not be guaranteed.

For example, Google could stop providing Google Trends, or at least nolonger make them available for free.

Or online retail stores could forbid access to their websites to crawlers forautomatic price collection.

Or individuals could extend the use of softwares that prevent tracking theirinternet activities, or tracking could be more tightly regulated by law forprivacy reasons.



Digital Divide. The fact that a sizable fraction of the population still hasno or limited internet access.

This implies that the available data are subject to a sample selection bias,and this can matter for their use.

Suppose, for example, that we want to nowcast unemployment at a disag-gregate level, either by age or by regions.

Internet data relative to older people or people resident in poorer regionscould lead to underestimation of their unemployment level, as they haverelatively little access to internet based search tools.



Changing Size & Quality. A third issue is that both the size and thequality of internet data keeps changing over time, in general much fasterthan for standard data collection.

For example, applications such as Twitter or WhatsApp were not availablejust a few years ago, and the number of their users increased exponentially,in particular in the first period after their introduction.

Similarly, other applications can be gradually dismissed or used for differentuses. For example, the fraction of goods sold by EBay through properauctions is progressively declining over time, being replaced by other priceformation mechanisms.



Bias in Answers. Again more relevant for digital than standard data col-lection, is that individuals or businesses could not report truthfully theirexperiences, assessments and opinions.

For example, some newspapers and other sites conduct online surveys aboutthe feelings of their readers (happy, tired, angry, etc.) and one could thinkof using them, for example, to predict election outcomes, as a large fractionof happy people should be good for the ruling political party.

But, if respondents are biased, the prediction could be also biased, and alarge fraction of non-respondents could lead to substantial uncertainty.



Data Format. A fifth issue is that data could not be available in a numericalformat, or not in a directly usable numerical format.

A similar issue emerges with standard surveys, for example on economicconditions, where discrete answers from a large number of respondents haveto be somewhat summarized and transformed into a continuous index.

However, the problem is more common and relevant with internet data.



Irregularities. A final issue, again common also with standard data butmore pervasive in internet data due to their high sampling frequency andbroad collection set, relates to data irregularities:

outliers,

working days effects,

missing observations,

presence of seasonal / periodic patterns, etc.

all of which require properly de-noising and smoothing the data.


Introduction: Big Data Advantages

Big data provide potentially relevant complementary information with re-spect to standard data, being based on rather different information sets.

Moreover, they are timely available and, generally, they are not subjectto subsequent revisions, all relevant features for potential coincident andleading indicators of economic activity.

Finally, they could be helpful to provide a more granular perspective onthe indicator of interest, both in the temporal and in the cross-sectionaldimensions.


Introduction: Big Data Advantages

In the temporal dimension, they can used to update nowcasts at a givenfrequency, such as weekly or even daily, so that the policy and decisionmakers can promptly update their actions according to the new and moreprecise estimates.

In the cross-sectional dimension, big data could provide relevant informa-tion on units, such as regions or sectors, not fully covered by traditionalcoincident and leading indicators.


Introduction: Summary

First, do we get any relevant insights? In other words, can we improvenowcast precision by using big data?

Second, do we get a big data hubris? Again as anticipated, we think of bigdata based indicators as complements to existing soft and hard data-basedindicators, and therefore we do not get a big data hubris.

Third, do we risk false positives? Namely, can we get some big data basedindicators that nowcast well just due to data snooping?


Introduction: Issues Summary

Fourth, do we mistake correlations for causes?

Fifth, do we use the proper econometric methods?

Sixth, do we have instability due to Algorithm Dynamics or other causes(e.g., the financial crisis, more general institutional changes, the increasinguse of internet, discontinuity in data provision, etc.)?

Finally, do we allow for variable and model uncertainty?


Part 1: Literature Review


Literature Review

We review the most important research papers on big data in three distinctareas:

Big Data in macroeconomics,

Variable selection and dimensional reduction for big data inmacroeconomics,

Nowcasting in macroeconomics.


Literature Review

The aim of the review is to try to answer the following questions:

1 What are possible big data sources in relation to the list ofEurostat’s Unit C1 macroeconomic indicators (i.e. GDP, inflation andproducer prices, employment and unemployment, industrialproduction index and retail trade deflated turnover)?

2 What are the advantages and disadvantages of the previouslyanalysed sources?

3 What are the main types of statistical methods used in the big data inmacroeconomics literature?

4 What are the possible gains generated either by the use of big data ornew statistical methods or both in comparison with existing practicesin the field of nowcasting?


Literature Review

Based on the surveyed papers, we can say that the use of Google Trendshas been dominant in studies using big data in macroeconomics.

There exist some papers based on Twitter data, also reviewed, but they aremainly in finance.

Webscrapping and collection of online prices also offer some potential, es-pecially for nowcasting inflation.

However, such datasets are very difficult to obtain (and possibly sustain),even more so when many countries and long enough samples are required.

A similar comment applies for credit card and financial transactions data,and for data summaries resulting from textual analysis.


Literature Review

From the literature it also emerges that the advantages of using data likeGoogle Trends are:

more timely forecasts, not subject to data revision,

some improvements in forecast accuracy, even though these typicallyemerge with respect to simple benchmarks (AR models),

ease of data access and collection,

ease of data management and treatment,

expected good data quality,

reasonable likelihood that similar data,

will be available on a continuous basis and without major definitional changes.


Literature Review

There also some disadvantages when using this data source, the main onesbeing:

a typical sole use of such data, which can lead to biased results (“bigdata hubris”),

the impossibility to access the raw data, and the lack of knowledge ofthe precise algorithms used to pre-treat and summarize them,

the possibility that free access will be discontinued by the (private)data provider, or limited due to the introduction of more stringentprivacy laws.


Literature Review

Varian (2014) provides an intuitive introduction in big data managementand manipulation.

Big data, possibly after transformation to some kind of numerical format,has to be stored in some sort of database, as it is difficult to be dealt withspreadsheets.

A medium sized dataset (i.e. about a million observations) could be storedand manipulated using a relational database, such as MySQL, whereas adataset of several million observations could be efficiently stored and ma-nipulated by NoSQL databases.

Sometimes, and depending on the nature of the research, a carefully selectedsubsample or summary of the data might be sufficient for analysis.


Literature Review

Hence, big data creation, storage and management typically require specificIT skills, software and hardware.

The associated costs should be kept into consideration when assessing thepotential benefits of big data for nowcasting.


Literature ReviewIn terms of statistics and econometrics, data analysis is typically brokendown into four categories:

pre-treatment and summarisation,

estimation,

hypothesis testing and

prediction.

Since a large amount of data is available, penalised regressions such asLASSO, LARS, and elastic nets can be used instead of the standard linearor logistic regression.

These techniques could also be used for variable selection.

Then, the choice of the final model should come from forecasting cross-validation so that the researcher makes sure the model has good out-of-sample predictive ability.


Part 1: Literature Review

“Main Findings”


Literature Review: Main Findings

The purpose of the literature review is to answer four questions:

1 What are possible big data sources in relation to the list of Eurostat’sUnit C1 macroeconomic indicators?

2 What are the advantages and disadvantages for each of the previouslyanalysed sources?

3 What are the main types of statistical methods used in the big data inmacroeconomics literature?

4 What are the possible gains generated either by the use of big data ornew statistical methods or both in comparison with existing practicesin the field of nowcasting?



As anticipated in the Introduction, after a careful examination of the mostimportant papers in each area we can say that the majority of big datapapers are based on Google Trends as predictors.

The advantages of using data like Google Trends include:

a the improved timeliness of the forecasts without need for datarevision,

b the potential improvement of forecasts,

c open access to the data,

d easy data handling,

e good data quality,

f reasonable possibility that this sort of data will be released on acontinuous basis.



However, one must be cautious when using data of this type as it is oftenassociated with the following issues:

a the use of Google data as the only data input could lead to biasedresults (commonly known as “big data hubris”),

b the restrictions to access the raw data but only the Google index,

c the possibility that free access will be discontinued.

Therefore, we suggest the use of Google Trends for nowcasting the macroe-conomic variables of interest for Eurostat.

We strongly believe that such data must be used as a supplement to currentforecasting tools and not as a substitute.



Regarding questions (3) and (4), it turns out that most of the papers inthe literature generate nowcasts based on mixed frequency versions of linearregressions, VARs, (dynamic) factor models or a combination of them, andadopt various strategies for variable selection in the presence of a large setof potential regressors.

While no clear-cut ranking of the alternative methodologies emerge, thereseems to be consensus about the usefulness of big data for nowcastingvariables such as unemployment, GDP, inflation and surveys, even thoughthe gains are often computed with respect to (too) simple benchmarks.


Part 2: Big Data Modelling


Big Data Modelling

Let yt , t = 1, ...,T , be the target variable and xt = (x1t , ..., xNt)′ be a set

of potential predictors, with N very large.

We do not assume a particular data generating process for yt but simplyposit the existence of a representation of the form

yt = a + g(x1t , ..., xNt) + ut , (1)

which implies that E (ut |x1t , ..., xNt) = 0.

While the potential nonlinearity in (1) might, in principle, be worth exploring,it is extremely difficult to model nonlinearities in the context of large datasetsand no work is available on this in the big data literature.


Big Data Modelling

As a result, we consider an approximating linear representation of the form,

yt = a +N∑i=1

βixit + ut , (2)

with ut denoting a martingale difference process and where the set of xitscan also contain products of the original indicators in order to provide abetter approximation to (1).


Big Data Modelling

Our main aim is to provide estimates for current and future values of yt ,where either no, or only a preliminary, value for yt is available from officialstatistics.

To do so, we can rely on many approaches, which can be categorised inthree main strands.

The first strand aims to provide estimates for β = (β1, ..., βN)′.

While ordinary least squares (OLS) is the benchmark method for doing so, itis clear that if N is large this is not optimal or even feasible (when N > T ).


Big Data Modelling

So other methods need to be used. We consider two classes of methods.

The first one is sparse regression, with origins in the machine learningliterature. A main aim there is to stabilise the variability of theestimated βi .

The second class considers the use of a variety of information criteriasuch as AIC or BIC to select a smaller subset of all the availablepredictors.


Big Data Modelling

The second strand consists of reducing the dimension of xt by producing amuch smaller set of generated regressors, which can then be used to producenowcasts and forecasts in standard ways.

The third strand suggests the use of a (possibly very large) set of smallmodels, one for each available indicator or small subset of them, and thenthe combination of the resulting many nowcasts or forecasts.


Part 2: Big Data Modelling

“Main Findings”


Big Data Modelling: Main Findings

We feel that Multiple Testing (related to Boosting) and, potentially, somevariant of LASSO could be the preferred approaches among the set of ma-chine learning techniques.

Among the data reduction techniques, PCA and, possibly PLS are promising.

And it would be also worthwhile experimenting with Bayesian regression,with substantial shrinkage, and forecast combination, with simple equalweighting.


Big Data Modelling: Main Findings

Finally, all these approaches should be also modified to take into accountthe possibility of a different timing for the target and the indicator variables.

We have surveyed a number of alternative methods to handle mixed fre-quencies, and it turns out that Unrestricted MIDAS or bridge modellingappear as the most promising approaches, as they preserve linearity and donot add an additional layer of computational complexity.


Part 3: Empirical Analysis


Empirical Analysis: Introduction

We consider nowcasting and short-term (one- to twelve- months ahead)forecasting three key macroeconomic variables:

1 inflation (measured by the growth rate in the Consumer Price Index),

2 growth in retail sales (measured by the growth rate of the RetailTrade Index), and

3 the Unemployment Rate.

The exercise is conducted recursively in a pseudo out of sample framework,using monthly data for three economies:

Germany (DE),

Italy (IT) and

the UK.



We assess the relative performance of:

Big Data (proxied via weekly Google Trends) and

standard indicators (based on a large set of weekly and monthlyeconomic and financial variables)

In fact, as we have mentioned several times, we think of Big Data as pro-viding complementary information, and we wish to assess how useful it is ina forecasting context relative to standard indicators.


Empirical Analysis: IntroductionWe also evaluate the role of several econometric methods and alternativespecifications for each of them (with or without big data), also capable ofhandling the frequency mismatch in our data.

Specifically, we consider:

Naive and autoregressive (AR) models as benchmarks

Dynamic Factor Analysis (DFA)

Partial Least Squares (PLS)

Bayesian Regression (BR) and

LASSO regression.

DFA and PLS are representatives of data reduction methods; BR and LASSOare representatives of, respectively, econometric and machine learning tech-niques.

In addition, we also consider model averaging. Overall, we have a total of255 models and model combinations.



In general, we find that factors extracted using a combination of standardmacroeconomic and Big Data predictors lead to a substantial improvementin the nowcasting and forecasting performance for all variables across thethree economies, though the gains are small for unemployment, which is avery persistent variable over our evaluation sample which makes beating asimple AR model difficult, in particular at long horizons.

Furthermore, a data-driven automated model selection strategy, where theforecasts from a set of best performing models over the recent past arepooled, performs particularly well, with Big Data present in about 45% ofthe pooled models.


Empirical Analysis: Conclusions

In general, we find the DFA and BR often provide accurate results with andwithout Google Trends.

Most of the suggested models and model averages perform better than thebenchmark when nowcasting/forecasting the CPI and RTI, however they failusing the unemployment rate.

This is due to structure of the unemployment rate series which is slowlymoving in our 32-38 evaluation months.



A longer evaluation period could reverse this finding, and provide morereliable results, but unfortunately we are constrained by the availability ofGoogle Trends, which start in 2004.

Our suggested data-driven automated strategy, which rotates models basedon their forecasting performance in the past 1, 6 and 12 months, seems towork extremely well for CPI and RTI, and decently for unemployment, andtherefore it provides a powerful tool for applied forecasting.



The detailed analysis of model rotation shows that, on average, 45% of thetimes the chosen best models for averaging, for the CPI and RTI targetvariables across all three economies, include Google Trends.

Hence, we conclude that Big Data, as proxied via Google Trends in ourapplications and combined with standard macroeconomic and financial in-dicators, can indeed improve the nowcasting and short term forecastingperformance of econometric models.


Conclusions


Overall Recommendations

Overall, our suggestion is to take a pragmatic approach that balances poten-tial gains and costs from the use of Big Data for nowcasting macroeconomicindicators, in addition to standard indicators.

A preliminary step should be an a priori assessment of the potential useful-ness of Big Data for a specific indicator of interest, such as GDP growth,inflation or unemployment.



This requires to evaluate the quality of the existing nowcasts and whetherany identified problems, such as bias or inefficiency or large errors in specificperiods, can be fixed by adding information as potentially available in BigData based indicators.

Similarly, it should be considered whether these additional indicators couldimprove the timeliness, frequency of release and extent of revision of thenowcasts.

Relevant information can be gathered by looking at existing empirical studiesfocusing on similar variables or countries, and in this respect the extensiveliterature review we presented can be quite helpful.



Once Big Data passes the “need check” in the preliminary step, the firstproper step of the Big Data based nowcasting exercise is a careful searchfor the specific Big Data to be collected.

As we have seen, there are many potential providers, which can be groupedinto Social Networks, Traditional Business Systems, and the Internet ofThings.

Naturally, it is not possible to give general guidelines on a preferred datasource, as its choice is heavily dependent on the target indicator of thenowcasting exercise.



Having identified the preferred source of Big Data, the second step requiresto assess the availability and quality of the data.

A relevant issue is whether direct data collection is needed, which can bevery costly, or a provider makes the data available.

In case a provider is available, its reliability (and cost) should be assessed,together with the availability of meta data, the likelihood that continuity ofdata provision is guaranteed, and the possibility of customization (e.g., makethe data available at higher frequency, with a particular disaggregation, fora longer sample, etc.).



All these aspects are particularly relevant in the context of applications inofficial statistical offices.

As the specific goal is nowcasting, it should be also carefully checked thatthe temporal dimension of the Big Data is long and homogeneous enough toallow for proper model estimation and evaluation of the resulting nowcasts.



The third step analyzes specific features of the collected Big Data.

A first issue that is sometimes neglected is the amount of the requiredstorage space and the associated need of specific hardware and software forstoring and handling the Big Data.

A second issue is the type of the Big Data, as it is often unstructured andmay require a transformation into cross-sectional or time series observations.



Even when already available in numerical format, pre-treatment of the BigData is often needed to remove deterministic patterns and deal with datairregularities, such as outliers and missing observations.

While standard methods can be usually applied, the size of the datasetssuggests to resort to robust and computationally simple approaches, appliedvariable by variable.



The fourth step requires to assess the presence of a possible bias in theanswers provided by the Big Data, due to the “digital divide” or the ten-dency of individuals and businesses not to report truthfully their experiences,assessments and opinions.

A related problem, particularly relevant for nowcasting, is the possible in-stability of the relationship with the target variable.

This is a common problem also with standard indicators, as the type andsize of economic shocks that hit the economy vary over time. Both issuescan be however tackled at the modelling and evaluation stages.



The fifth step when nowcasting with Big Data requires to select the propereconometric technique.

Here, it is important to be systematic about the correspondence between thenature of the Big Data setting and use under investigation and the methodthat is used.

There is a number of dimensions along which we wish to differentiate.



1

The first choice is between the use of methods suited for large but nothuge datasets, and therefore applied to summaries of the Big Data (such asGoogle Trends, commonly used in nowcasting applications), or of techniquesspecifically designed for Big Data.

For example, nowcasting with large datasets can be based on factor models,large BVARs, or shrinkage regressions.



1

Huge datasets can be handled by sparse principal components, linear mod-els combined with heuristic optimization,or a variety of machine learningmethods (which, though, are generally developed assuming i.i.d. variables).

It is difficult to provide an a priori ranking of all these techniques and thereare few empirical comparisons and even fewer in a nowcasting context, sothat is may be appropriate to apply and compare a few of them for now-casting the specific indicator of interest.



2

A second dimension is the frequency of the available data.

If this frequency is mixed then specific techniques for mixed frequency databecome relevant.

Chief among them is unrestricted MIDAS which provides a very flexibleframework of analysis and can be adapted to work together with most if notall Big Data methods be they machine learning of econometric.



3

Yet another dimension relates to the purpose for which large datasets areconsidered.

Possibilities include model or indicator selection, forecasting or a more struc-tural analysis.

In this case of course each purpose is best served by different methods andthe choice of method crucially depends on the purpose.

Most methods can be used for forecasting and so the choice has to be casedependent.


Overall Recommendations3

We recommend that as many methods are possible are evaluated in a fore-casting context although past experience suggests that factor analysis andshrinkage methods can be of great use.

For model or indicator selection penalised regression and the MT methodsseem to be appropriate and also have been reported to have good potential.

Finally, for more structural analysis it is clear that it is likely that hugedatasets are more difficult to accommodate.

In this case, system methods that analyse the whole or a large proportion ofthe available data simultaneously, seem necessary for a satisfactory analyticaloutcome.

Bayesian VAR models stand out as an appropriate method in this context.Kapetanios, Marcellino, Papailias Big Data & Nowcasting April 7th 2016 77 / 79


The final step consists of a critical and comprehensive assessment of thecontribution of Big Data for nowcasting the indicator of interest.

In order to avoid, or at least reduce the extent of, data and model snooping,a cross-validation approach should be followed, whereby various models andindicators are estimated over a first sample and they are selected and/orpooled according to their performance, but then the performance of thepreferred approaches is re-evaluated over a second sample.

This procedure, which we have implemented in the empirical evaluation,provides a reliable assessment of the gains in terms of enhanced nowcastingperformance from the use of Big Data.



To conclude, we are very confident that Big Data are precious also in anowcasting context, not only to reduce the errors but also to improve thetimeliness, frequency of release and extent of revision.

We hope that the approach we have developed in this project will be usefulfor many users.


Documents

Big Data and Macroeconomic Nowcasting · Big Data and Macroeconomic Nowcasting George Kapetanios, Massimiliano Marcellino & Fotis Papailias King’s College London, UK Bocconi University,