21
CONFIDENTIAL File ref: 17/7/4/1/3 Concept Paper On Big Data Annexure B

Introduction - sadcbankers.org€¦  · Web viewIntroduction. Background. By deliberation of CCBG in February 2017 this project was incorporated in the SIA Project. The Project was

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Introduction - sadcbankers.org€¦  · Web viewIntroduction. Background. By deliberation of CCBG in February 2017 this project was incorporated in the SIA Project. The Project was

CONFIDENTIALFile ref: 17/7/4/1/3

Concept PaperOn

Big Data

Prepared by: CCBG ICT Subcommittee

Date: June 2017

Annexure B

Page 2: Introduction - sadcbankers.org€¦  · Web viewIntroduction. Background. By deliberation of CCBG in February 2017 this project was incorporated in the SIA Project. The Project was

CONFIDENTIALFile ref: 17/7/4/1/3

Table of Contents

1. Introduction.....................................................................................................31.1 Background...................................................................................................................3

1.2 Purpose of the document..............................................................................................3

1.3 Big Data and Central Banks..........................................................................................3

2. What is big data...............................................................................................43. The importance of big data in Central Banks...............................................64. Big Data Analytics and available solutions..................................................75. Data Harmonization.........................................................................................86. Way forward for doing big data analytics.....................................................97. Challenges of Big Data.................................................................................10

7.1 Technical.....................................................................................................................10

7.2 Legal............................................................................................................................11

7.3 Privacy and Security....................................................................................................11

8. Conclusions and Recommendations..........................................................119. References.....................................................................................................1210. Annexures......................................................................................................14

Page 3: Introduction - sadcbankers.org€¦  · Web viewIntroduction. Background. By deliberation of CCBG in February 2017 this project was incorporated in the SIA Project. The Project was

1. Introduction

1.1 BackgroundBy deliberation of CCBG in February 2017 this project was incorporated in the SIA

Project. The Project was defined to be approached as a joint project made up of

one team comprising of representatives from the following central banks:

Banco de Moçambique as team leader Banco Nacional de Angola

Banque Centrale Du Congo

Central Bank Of Lesotho

South African Reserve Bank

Reserve Bank Of Zimbabwe

1.2 Purpose of the documentThe purpose of this document is to provide a conceptual introduction to Big Data

and to understand the potential uses of Big Data in the Central Banks of the SADC

region, as well as, challenges and implications that comes from handling data that

display high volume, high velocity and variety characteristics.

1.3 Big Data and Central Banks

Actually the concern with handling huge volumes of data started seventy-three

years ago when Fremont Rider, a librarian of Wesleyan University, wrote a book

that talked about challenges to manage American University Libraries in future as

he estimated that they were doubling in size every sixteen years [1]. With huge

volumes of digital data that is created lastly, both structured and unstructured, this

concern is more pressing today than never as it is possible to analyse and convert it

into new opportunities in ways never before possible.

According to the results of a survey conducted by Irving Fisher Committee on

Central Bank Statistics (ICF) [2], two thirds of central banks have a strong interest in

big data as it is perceived as a potentially effective tool in supporting

macroeconomic and financial stability analyses. Also, the structural shift toward the

exploitation of big data by other economics agents have increased the interest of

Central Banks. However, even with this strong interest, the survey also found that

the actual involvement of Central banks in the use of big data is currently limited.

3 | P a g e

Page 4: Introduction - sadcbankers.org€¦  · Web viewIntroduction. Background. By deliberation of CCBG in February 2017 this project was incorporated in the SIA Project. The Project was

Challenges for exploring big data are related to, among others, technology

complexity and high costs in terms of investment in human capital and IT.

Using big data, the central banks can benefit from a number of advantages

including making better informed decisions regarding the following areas:

Economic forecasting.

Business Cycle Analysis.

Financial Stability analysis.

2. What is big data

The term big data was first used by John Mashey, who gave a talk “Big Data and the

Next Wave of infrastress” in 1998 [12], stressing the need for a new type of computing

taking into account new developments in technology.

The term “big data” is not clearly defined among the literature on this topic. One

common view, however, is that big data basically refers to huge volume of data, both

structured and unstructured, that cannot be stored and processed using traditional

approach within a given time frame. These attributes, namely, Volume, Velocity and

Variety characterize what is big data paradigm. But also we have Veracity [5] and

Value [6] that were added a few years later after the creation of big data.

Figure 1: The 3 Vs of big Data [7]

So, what does huge volume means in terms of size to be classified as big data? The

term big data can refer either GB or TB or PB or EX or anything that is larger than this

in size. However, the data size can be smaller than GB and be called big data

depending on the context that it is being used. For example, commonly the current

4 | P a g e

Page 5: Introduction - sadcbankers.org€¦  · Web viewIntroduction. Background. By deliberation of CCBG in February 2017 this project was incorporated in the SIA Project. The Project was

email systems do not support an email with an attachment of 300MB. Therefore, as the

email systems do not support attachment of this size, this data can be regarded as big

data, with respect with these systems. So, the term big data is not tied to a specific size

of data but to the infeasibility of processing this data on traditional computing

environment.

On the other hand, popular social network sites such as Facebook, twitter LinkedIn,

google+ and YouTube, each of these sites receive huge volumes of data on daily basis.

For example, LinkedIn alone receives tens of TB on daily basis. As the number of users

keep growing on these sites, storing and processing this data becomes a challenging

tasks. Since this data holds a lot of valuable information, this data needs to be

processed on a short span of time. By making use of traditional computing systems, it

can be noted that it is not possible to accomplish this task within a given time frame. As

the computing resources of traditional computing system will not be sufficient for

processing and storing such huge volume of data, new techniques and tools are being

developed and employed to analyse it. So, the term big data refers not only the large

data sets, but also to the frameworks, techniques, and tools used to detect patterns in

data sets.

For central banks, the big data inventory is composed of macro-economic data, survey

data, financial institution data, third party data, micro level data and unstructured data

(examiner reports, social media, etc.) [11].

5 | P a g e

Page 6: Introduction - sadcbankers.org€¦  · Web viewIntroduction. Background. By deliberation of CCBG in February 2017 this project was incorporated in the SIA Project. The Project was

Figure 2: Traditional and Newly Emerging Data Types Merging to Form Big Data for

Central Banks [11].

3. The importance of big data in Central Banks

Historically big data tools and techniques have had big impact in certain sectors such

as the information and communications industry than they have had in financial

services. Nowadays, the situation is quite different as the researchers and financial

services sector is getting more and more involved in the use of big data for forecasting

in economics and finance.

Despite the researchers in the field of economics are the major exploiters of big data

for forecasting various economic variables, it can be seen around the world that some

banks are also exploiting big data with the same purpose. Among various examples of

forecasting with big data in economics, Choi et al. [9] shows how big data from Google

trends variables can be used to predict economic indicators, outperforming models that

exclude these predictors by 5% to 20%; Carriero et al. [10] used big data to obtain real

time GDP predictions for the US. In addition, some banks are using big data, among

others, to enhance early detection of fraud [3] and analyse bank´s intraday liquidity

management [4].

6 | P a g e

Page 7: Introduction - sadcbankers.org€¦  · Web viewIntroduction. Background. By deliberation of CCBG in February 2017 this project was incorporated in the SIA Project. The Project was

So it is important for the central banks to understand how big data, including social

media, news feed and transaction-level trading, can impact their strategies and

operations, including making better informed decisions regarding macroeconomic and

financial stability purposes [2], especially in the following areas:

Economic forecastingo Inflation

o Housing prices

o Unemployment

o GDP

o Industrial production

o Retail sales

o External sector developments

o Tourism activity

Business Cycle Analysiso Sentiment indicators

o Nowcasting techniques

Financial stability analysiso Construction of risk indicators

o Assessment of investors’ behaviour

o Identification of credit and market risk

o Monitoring of capital flows

o Supervisory tasks

4. Big Data Analytics and available solutionsThe process of collecting, organizing and analysing huge volume of data, both

structured and unstructured, within a given time frame, with the aim of discovering

patterns and other useful information, requires the use of specialized technologies. As

stated before, computing resources of traditional computing system is not sufficient for

processing and storing such huge volume of data in a short time frame. Specialized

technologies for big data comprise a wide variety of specialized tools and applications

[13], summed up in the following table.

7 | P a g e

Page 8: Introduction - sadcbankers.org€¦  · Web viewIntroduction. Background. By deliberation of CCBG in February 2017 this project was incorporated in the SIA Project. The Project was

Technology DescriptionPredictive analytics “Software and/or hardware solutions that allow firms

to discover, evaluate, optimize, and deploy predictive models by analyzing big data sources to improve business performance or mitigate risk”. See annex 2.

NoSQL databases “Key-value, document, and graph databases”. A complementary addition to RDBMSs and SQL.

Search and knowledge discovery

“Tools and technologies to support self-service extraction of information and new insights from large repositories of unstructured and structured data that resides in multiple sources such as file systems, databases, streams, APIs, and other platforms and applications”.

Stream analytics “Software that can filter, aggregate, enrich, and analyze a high throughput of data from multiple disparate live data sources and in any data format”.

In-memory data fabric “Provides low-latency access and processing of large quantities of data by distributing data across the dynamic random access memory (DRAM), Flash, or SSD of a distributed computer system”.

Distributed file stores “A computer network where data is stored on more than one node, often in a replicated fashion, for redundancy and performance”.

Data virtualization “A technology that delivers information from various data sources, including big data sources such as Hadoop and distributed data stores in real-time and near-real time”.

Data integration “Tools for data orchestration across solutions such as Amazon Elastic MapReduce (EMR), Apache Hive, Apache Pig, Apache Spark, MapReduce, Couchbase, Hadoop, and MongoDB”.

Data preparation “Software that eases the burden of sourcing, shaping, cleansing, and sharing diverse and messy data sets to accelerate data’s usefulness for analytics”.

Data quality “Products that conduct data cleansing and enrichment on large, high-velocity data sets, using parallel operations on distributed data stores and databases”.

8 | P a g e

Page 9: Introduction - sadcbankers.org€¦  · Web viewIntroduction. Background. By deliberation of CCBG in February 2017 this project was incorporated in the SIA Project. The Project was

Most of the big data tools and technologies are open-source. Apache Foundation, a

non-profit organization, has provided valuable open source big data tools and

technologies, among them, Hadoop, Hadoop Distributed File System (HDFS), Hadoop

YARN, Hadoop MapReduce (see Annex 1). However, organizations can have the

option to use professional vendor support to help them get their big data platforms

running. In addition, there are vendor-specific distributions (see annex 2) based on

open source big data technologies, such as Cloudera, Hortonworks, MapR.

5. Data Harmonization

One of the biggest challenge with doing big data analytics is extract insights from

diverse information feeds from multiple, often unrelated sources. So before any insights

are extracted from big data, there is a need to harmonise this diverse information to a

common definition of granularity and naming conventions, enhancing in this way the

quality and utility of the data.

Data harmonization in big data is a process that brings together a variety of data types,

naming conventions, and columns, and transforming it into one cohesive data set.

Without it, the data is subject to sit in separate and disparate units, turning the process

of gathering insights from this data much more difficult.

Another aspect of data harmonization concerns the financial firms that central banks

regulate. Central banks typically have collected aggregate data from firms using

reporting returns structured like standard financial statements. However, standard

financial statements by themselves do not provide all information if further question

arises. A way to overcome the gaps leaved by standard financial statements is to

collect granular data once from financial firms, enabling the central bank to better spot

systemic risk and manage it with macro prudential policy. In addition, to enhance the

quality and utility of the data, there is a need to harmonise and enforce common

definitions of granular data attributes, both across the organisation and to financial

firms.

9 | P a g e

Page 10: Introduction - sadcbankers.org€¦  · Web viewIntroduction. Background. By deliberation of CCBG in February 2017 this project was incorporated in the SIA Project. The Project was

It be should be noted that the ongoing project SIA is working on data harmonization

component, which has a lot of things in common with data harmonization component of

big data. The most certain thing would be to extend the scope of SIA in order to include

data harmonization component of big data.

6. Way forward for doing big data analyticsDue to the great potential brought by big data, we believe that it is just a matter of time

before the big data exploitation takes place in the central banks. However, for this to

happen smoothly, the central banks need a strong roadmap. The roadmap can be

guided by the following steps [14]:

1. Strategic plano Identify strategic priorities

2. Identify Opportunitieso Brainstorm and ask crunchy questions

3. Determine data sources, assesso Data and applications landscape including archives

o Analytics and BI capabilities including skills

o Assess new technology adoptions

o IT strategy, priorities, policies, budget and investments.

o Current projects

o Current data, analytics and BI problems

4. Identify/define use caseso Based on the assessments and business priorities identify and prioritize

big data use cases

5. Pilots and Prototypeso Identify tools, technologies and processes for use cases and implement

pilots and prototypes

6. Adopt in Productiono Prioritize and implement successful high value initiatives in production,

high value initiatives in production.

10 | P a g e

Page 11: Introduction - sadcbankers.org€¦  · Web viewIntroduction. Background. By deliberation of CCBG in February 2017 this project was incorporated in the SIA Project. The Project was

7. Challenges of Big Data

234567

7.1 Technical1. Scarce availability of data scientists and people with expertise on math,

statistic, data engineering, pattern recognition, advanced computing,

visualization and modelling to handle and analyse big data [8].

2. The majority of statisticians are familiar with traditional statistics techniques.

So it will be a great challenge to retrain them in order to develop required

skills for big data.

3. Determining how to get value from big data. The nature of big data implies a

high and increasing noise to signal ratio over time, distorting the accuracy of

expected results. Because of this, retrieving useful information is more

complex than traditional statistics techniques.

4. Cost and effort associated with acquisition of new set of tools and

technologies to interact with extremes volume and variety of data formats.

5. Cost associated with maintaining data quality with regard to completeness,

validity, integrity, consistency, timeliness, and accuracy.

6. Legacy frameworks, disparate and third party systems that are difficult to

incorporate, pose additional challenges in implementing big data.

7. Information gaps when collecting financial statements from other firm. Lack

of a framework that enforce the delivery of granular information from firms

can undermine the results.

8. Lack of data harmonization can jeopardize the utility of the data as the data

will be sitting in separate, disparate units, which turn the process of gathering

insights from this data much more difficult.

7.2 Legal 9. Re-evaluation of internal and external data policies and regulatory

environment.

11 | P a g e

Page 12: Introduction - sadcbankers.org€¦  · Web viewIntroduction. Background. By deliberation of CCBG in February 2017 this project was incorporated in the SIA Project. The Project was

10.The use of big data can be limited/complicated by laws that regulate the use

of certain types of data.

7.3 Privacy and Security11.Working with huge data volume, variety and complexity can increase the risk

of data breach and thereby can pose threat to privacy. To make things

worse, the traditional security mechanisms such as firewalls and

demilitarized zones cannot be suitable to protect the big data environment.

8. Conclusions and RecommendationsBig data can be leveraged to provide the central banks with better informed decisions

regarding macroeconomic and financial stability purposes Macroeconomic. Due to this,

we believe that it is just a matter of time before the big data exploitation takes place in

the central banks. Central banks of SADC have the interest to clarify the benefits of

using big data, the process of collecting, organizing and analysing huge volume of

data, and manage all the associated challenges that go along with it. The cooperation

between central banks of SADC can respond to these needs by taking into account the

following actions:

Creation of a strong big data roadmap in order to establish the foundation and

structure for successful usage and exploitation of big data.

Creation of data lab with all resources (computers, software, IT experts) as a

way to help the organization and/or counterparties collect, organise and analyse

huge volume of data, both structured and unstructured. The lab can also serve

to create small-scale “big data” prototypes that meets the business goals, which

later can move into a full fledge big data solution.

Participation and training in big data workshops, mainly focused on the use of

big data in central banks.

Establishment of a framework of data harmonization that applies across the

organization, central banks and to financial firms.

9. References

12 | P a g e

Page 13: Introduction - sadcbankers.org€¦  · Web viewIntroduction. Background. By deliberation of CCBG in February 2017 this project was incorporated in the SIA Project. The Project was

1. Press, G. (2013). A Very Short History Of Big Data, Forbes, web page, available at

https://www.forbes.com/sites/gilpress/2013/05/09/a-very-short-history-of-big-data/

#51bd5bb865a1

2. Tissot, B., Hülagü, T., Nymand-Andersen, P., & Suarez, L. C. (2015). Central banks’ use of and

interest in “big data”. Bank for International Settlements 2015.

3. Davenport, T(2014), ´Big data at work: dispelling the miths, uncovering the oportunities´, Harvard

Business Review, Boston

4. Merrouche, S (2014), ´Banks´intraday liquidity management during operational outages: theory

and evidence from the UK payment system´, Bank of England Working Paper No. 370, available

at http://www.bankofengland.co.uk/research/Documents/workingpapers/2009/wp370.pdf

5. IBM (2016), “The Four V's of Big Data”, web page, available at

http://www.ibmbigdatahub.com/infographic/four-vs-big-data.

6. BBVA (2017), “the five V´s of big data”, web page, available at https://www.bbva.com/en/five-vs-

big-data/

7. EUCLIDE, Chapter 6: Scalling UP linked Data, web page, available at

http://euclid-project.eu/modules/chapter6.html

8. Poynter R (2013), Big data successes and limitations: what researchers and marketers need to

know, web page, available at http://www.visioncritical.com/blog/big-data-successes-and-

limitations. Accessed 14 Jul 2017

9. Choi H, Varian H (2011), Predicting the present with google trends, available at

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.221.2435&rep=rep1&type=pdf

10. Carriero A, Clark TE, Marcellino M (2012a) Real-time nowcasting with a Bayesian mixed

frequency model with stochastic volatility. Working Paper, No. 1227, Federal Reserve Bank of

Cleveland, available at

https://www.ecb.europa.eu/events/pdf/conferences/140407/MarcellinoReal_TimeNowcastingWit

hABayesianMixedFrequencyModelWithStochasticVolatility.pdf?

1349ebe7044626a4f953406dac102015

11. Casey, M. (2014), Emerging Opportunities and Challenges with Central Bank Data, presentation

slides, available at https://www.ecb.europa.eu/events/pdf/conferences/141015/presentations/

Emerging_opportunities_and_chalenges_with_Central_Bank_data-presentation.pdf?

6074ecbc2e58152dd41a9543b1442849

12. Lohr, S. (2013), The origins of ´Big Data´: An Etymologial Detective Story´, web page, available

at https://bits.blogs.nytimes.com/2013/02/01/the-origins-of-big-data-an-etymological-detective-

story/

13. Press, G. (2016), Top 10 Hot Big Data Technologies, web page, available at

https://www.forbes.com/sites/gilpress/2016/03/14/top-10-hot-big-data-technologies/

#750957af65d7

13 | P a g e

Page 14: Introduction - sadcbankers.org€¦  · Web viewIntroduction. Background. By deliberation of CCBG in February 2017 this project was incorporated in the SIA Project. The Project was

14. Deloitte (2013), Big Data challenges and Success Factors, presentation slides, available at

https://www2.deloitte.com/content/dam/Deloitte/it/Documents/deloitte-analytics/

bigdata_challenges_success_factors.pdf

14 | P a g e

Page 15: Introduction - sadcbankers.org€¦  · Web viewIntroduction. Background. By deliberation of CCBG in February 2017 this project was incorporated in the SIA Project. The Project was

10. Annexures

Annex 1 – Hadoop Components & Ecosystem

15 | P a g e

Page 16: Introduction - sadcbankers.org€¦  · Web viewIntroduction. Background. By deliberation of CCBG in February 2017 this project was incorporated in the SIA Project. The Project was

Annex 2 – Oracle Integrated Solution for big data and Oracle Big Data Platform

16 | P a g e