The use of Big Data Analytics to protect Critical Information …ltu.diva-portal.org/smash/get/diva2:1037515/FULLTEXT02.pdf · 2016. 10. 28. · The use of Big Data Analytics to protect

The use of Big Data Analytics to protect

Critical Information Infrastructures from

Cyber-attacks

Thomas Oseku-Afful

Information Security, masters level

2016

Luleå University of Technology

Department of Computer Science, Electrical and Space Engineering

Abstract

Unfortunately, cyber-attacks, which are the consequence of our increasing dependence on

digital technology, is a phenomenon that we have to live with today. As technology becomes

more advanced and complex, so have the types of malware that are used in these cyber-

attacks. Currently, targeted cyber-attacks directed at CIIs such as financial institutions and

telecom companies are on the rise. A particular group of malware known as APTs, which are

used for targeted attacks, are very difficult to detect and prevent due to their sophisticated

and stealthy nature. These malwares are able to attack and wreak havoc (in the targeted

system) within a matter of seconds; this is very worrying because traditional cyber security

defence systems cannot handle these attacks. The solution, as proposed by some in the

industry, is the use of BDA systems. However, whilst it appears that BDA has achieved greater

success at large companies, little is known about success at smaller companies. Also, there is

scarcity of research addressing how BDA is deployed for the purpose of detecting and

preventing cyber-attacks on CII. This research examines and discusses the effectiveness of the

use of BDA for detecting cyber-attacks and also describes how such a system is deployed. To

establish the effectiveness of using a BDA, a survey by questionnaire was conducted. The

target audience of the survey were large corporations that were likely to use such systems for

cyber security. The research concludes that a BDA system is indeed a powerful and effective

tool, and currently the best method for protecting CIIs against the range of stealthy cyber-

attacks. Also, a description of how such a system is deployed is abstracted into a model of

meaningful practice.

Tom Oseku-Afful/ MSc Information Security: Thesis

LTU/Department of Computer Science and Space Engineering 1

Table of Contents

1.0 Introduction ...................................................................................................................................... 6

2.0 Background theories and concepts ................................................................................................. 9

2.1 CII versus CI ................................................................................................................................... 9

2.2 Interdependency .......................................................................................................................... 9

2.3 Big Data ....................................................................................................................................... 10

2.4 Big data analytics........................................................................................................................ 12

2.5 Big data technologies ................................................................................................................. 13

2.6 The cyber-attack landscape ....................................................................................................... 15

2.7 Threat detection with BDA ........................................................................................................ 16

3.0 The Literature review ..................................................................................................................... 20

3.1 Scope ........................................................................................................................................... 20

3.2 Conceptualisation of research topic .......................................................................................... 20

3.3 The review method .................................................................................................................... 20

3.4 Review analysis and synthesis ................................................................................................... 23

3.4.1 Part I: big data analytics for cyber security .................................................................. 23

3.4.2 Part II: models for protecting CIIs ................................................................................ 26

3.4.3 Part III: big data analytics for cyber security in CII ........................................................ 28

3.5 Literature review conclusion ..................................................................................................... 29

3.5.1 Research gaps ............................................................................................................ 30

3.5.2 Research question(s) .................................................................................................. 30

4.0 Research Methodology .................................................................................................................. 31

4.1 Justification................................................................................................................................. 31

4.2 Questionnaire design ................................................................................................................. 32

4.2.1 Target audience and sample size ................................................................................ 35

4.2.2 Data collection ........................................................................................................... 36

5.0 The Results ..................................................................................................................................... 38

5.1 Analysis ....................................................................................................................................... 38

5.2 The model ................................................................................................................................... 54

6.0 Conclusion ...................................................................................................................................... 56

6.1 Research limitations ................................................................................................................... 57

6.2 Future research .......................................................................................................................... 57

7.0 References ...................................................................................................................................... 59



Acknowledgements

I will like to thank Ahmed Elragal, my supervisor, for his expert feedback, patience and advice.

This thesis would not have been possible without his astute guidance.

I have thoroughly enjoyed the entire course of information security and this is down to the

experience and expert knowledge of all the lecturers but especially, Devinder Thapa, Dan

Harnesk, Tero Paivarinta and Todd Booth. Please keep inspiring your students with your wit,

intelligence and experience.

Mum, your interest and questioning about the various topics about this course, as well as

your encouragement kept me going. Thanks so much for your support.

To my loving and lovely wife Nana, you know I could not have done this without your tireless

support and encouragement. You are my inspiration.

And to Reuben my dear son, this work dedicated to you and your baby brother.

Your Grace and Mercy is indeed what has brought me through. Thank you LORD!



Acronyms and abbreviations

AI

Artificial Intelligence

APT

Advanced Persistent Threats

BDA

Big Data Analytics

CI

Critical Infrastructure

CII

Critical Information Infrastructure

DiD

Defence in Depth

DOJ

Department of Justice

FIM

File Integrity monitoring

HDFS

Hadoop Distributed File System

IDS

Intrusion Detection System

IDPS

Intrusion Detection and Prevention System

IPS

Intrusion Prevention System

IoT

Internet of Things

SIEM Security Information and Event Management



List of tables and figures

Tables

Table 1 - journal database search ......................................................................................................... 21

Table 2 - selected research papers for review ...................................................................................... 22

Table 3 - parameters of the questionnaire with justification ............................................................... 35

Table 4 - Frequency analysis for Q1 ...................................................................................................... 38






Table 10 - Frequency analysis for Q7 .................................................................................................... 44



Table 13 - Frequency analysis for Q10 .................................................................................................. 47







Figures

Figure 1 - the three V's of big data compared to traditional data ........................................................ 11

Figure 2 - big data classification into five categories ............................................................................ 12

Figure 3 - 'mining' valuable information using a big data technology .................................................. 14

Figure 4 - method 1: making existing detection system intelligent with big data ................................ 17

Figure 5 - method 2: using internal and external data for the analytic process ................................... 18

Figure 6 - method 3: streaming data from external sources for the analytics ..................................... 18

Figure 7 - mathematical equation for calculating margin of error (ME) .............................................. 36

Figure 8 - mathematical equation for calculating sample size n, derived from the ME equation ....... 36

Figure 9 – a derived model showing how a BDA security system can be deployed ............................. 55



Charts

Chart 1: this chart shows the percentage distribution of the responses to question 1. ...................... 39









Chart 10: this chart shows the percentage distribution of the responses to question 10. .................. 48









1.0 Introduction

In 2014, the initial discovery of a successful cyber-attack against a major US bank, JPMorgan

Chase, sent waves of fear and worry across Wall Street and triggered an investigation by the

FBI (Silver-Greenberg, Goldstein and Perlroth, 2016). In the news article, it was mentioned

that the hack-attack, which was launched from overseas, was not discovered until after a

month had passed; this was in spite of the fact that the bank had at that time fortified its

defences and employed the services of a top cyber security firm. Also, the article mentioned

that given the level of sophistication of the attack, and the apparent lack of profit (no

customer financial information was stolen) from the hackers, it is likely to have been

sponsored by a government; as a matter of fact, law enforcement officials and cyber security

experts believed that it may have been sponsored by elements of a particular foreign country.

This story raises the question of why the attack caused the whole of Wall Street to worry?

An answer to this can be found in another news article by Rushton (2014). Quoting statements

from the superintendent of the New York State Department of Financial services, and the

head of policy of the City of London, Rushton discusses the possibility of a cyber-attack

capable of triggering a global financial crisis. She paints a picture of a national bank

“disappearing” as a result of customer savings being wiped off, and accounts being deleted in

the event of a cyber-attack. Examining the JPMorgan hack-attack, it is quite startling to see

that such a scenario (as described by Rushton) could have easily been the case. It is not hard

to imagine that an attack (from another country) such as this (where a national bank is

brought to its knees) will not only cause a financial crisis but it will also be an act of war.

Perhaps, this is the reason why the country that launched the attack refrained from causing

further harm.

Earlier this year, the telecoms company (BT) that owns and maintains the physical

infrastructure that makes up the UK’s broadband network, experienced an outage of a part

of its broadband services which resulted in hundreds of thousands of customers (including

businesses) losing connection to the Internet as well as phone connections for about two

hours (Williams, 2016). According to the news article, this was the biggest and most

widespread network failure in years. Although the company denies it, and blames this

network failure on a faulty router, it has been suggested that it was the result of a cyber-

attack. Regardless of whether BT’s reason (of a faulty router) for the outage is true or not, the

point is that it is clearly possible to launch a cyber-attack that can take down an organisation’s

infrastructure. Suppose this was indeed a cyber-attack, and it affected more routers and

endured for days and not hours; can one imagine the negative impact it would have had on

the nation’s economy, not to mention the lives it might affect, as emergency services lose

communication?

In both of these scenarios (the JPMorgan and BT cases), the operations of the organisations

are so intertwined with other organisations – in their respective nations – that their failure

will inevitably trigger a domino effect causing these other organisations also to fail. This



makes the running of the technology infrastructure that supports these organisations critical.

Therefore, the protection of such infrastructures, commonly referred to as critical

information infrastructures, are considered a matter of national security.

Today, to say that cyber-attacks are widespread is a major understatement. With society’s

increasing dependence on technology, and the advent of the Internet of Things (IoT), cyber-

attacks will get worse and remain an unfortunate consequence of human activities. Just as

human viruses and bacteria are organisms that exist with humans, so are malware to

computer systems. And just as there is a concerted effort to handle the threat of drug

resistant bacteria, so the cyber security community must strive to find solutions to the

increasing threat of the dangerous malware that are being propagated in Cybersphere.

In the JPMorgan hack-attack, it took about a whole month for the attack to be detected. This

problem is not an isolated one. In the highly publicised Sony cyber-attack, which happened in

the same year as the JPMorgan case, it also took a while for the breach to be discovered. In

fact, it is still unclear as to how long it took, but some sources say it may have lasted for as

long as a year (Zetter, 2014). Another high profile attack, that took quite a while for the

organisation to realise, is the hacking of the retail giant Target. The remarkable element about

this cyber-attack is that although the attack was detected by the security detection systems

the organisation had in place, the personnel somehow missed it and only reacted when they

were alerted by the Department of Justice (Finkle and Heavy, 2014).

The fact that these attacks were not discovered until it was too late shows the level of

sophistication of the malware out there. According to a Sophos Security Threat Report

(Eschelbeck, 2014), modern malwares have evolved into something more sinister,

sophisticated and stealthier such as those used for Advanced Persistent Threats (APTs)

attacks; and these form of attacks are on the rise in recent years. The report also mentions

that APTs are perpetrated by highly knowledgeable, skilful and motivated individuals who are

very organized and well-funded and therefore, it is becoming increasingly harder to detect

and defend against such attacks.

Another report points out that cyber-attacks against critical infrastructure systems are on the

rise, and in some cases the objective is to destroy rather than steal data (Trend Micro, 2015).

Although the report focuses on the Americas, a case can be made for the fact that this is a

global phenomenon. The report also mentions that the nature of the malware used in these

attacks shows a noticeable increase in the knowledge of critical infrastructure systems

(especially SCADA systems). In other words, the cyber criminals are becoming more

knowledgeable and sophisticated, and this shows in the lethal malware they are using to

attack organisations. Also, the attacks are directed against specific critical information

organisations.

So how does one defend against such attacks? Could big data analytics (BDA) be the answer?

In a press release in 2014, a prediction was made that by 2016, large corporations will turn to

big data analytics for cyber detection and defence (Rivera, 2014). In the news article, it is



mentioned that big data analytics will allow the organisations that use it to see a bigger and

clearer picture of threats, and therefore be able to detect and prevent these attacks from

happening. Also, it is mentioned (in the press release) that the technology (i.e. big data

analytics) is still in its infancy, and its effectiveness has barely been proven.

Hence, in this research, a study of the effectiveness of big data analytics – for cyber-attack

detection – will be carried out. This will be done by examining the success rate (via a survey

by questionnaire) in using the technology for the detection of sophisticated and stealthy

cyber-attacks such as APTs. Given the fact that stealthy malware is designed to be undetected

and an attack can compromise a computer system in a matter of seconds (Brewer, 2015), the

word “effectiveness” is defined as: having a speed of detection that is within seconds, minutes

or hours, but no more than a day as a day might be too late; able to detect stealth attacks

significantly more often than not - at least 75% success rate.

Also, a model of meaningful practice of how BDA is deployed (for cyber security) will be

developed. This model will be based on how various organisations have used big data

analytics to achieve success in detecting and quelling cyber-attacks. In other words, questions

about the kind of big data analytics operation these organisations use for the detection of

stealthy attacks will be asked: perhaps the result of their success is because the big data

analytics tool is combined with the organisation’s traditional cyber security detection

systems; or maybe the traditional systems are completely replaced with the big data analytics

tool.

The main contribution of this research is to shed more light, howbeit little, into the relatively

complex area of applying big data analytics in cyber security. In addition, this model of

meaningful practice can serve as an initial guide (of how a BDA system can be deployed) for

new organisations that wish to embrace this technology for cyber security.



2.0 Background theories and concepts

This section presents a discussion of the concepts, technologies and cyber threats that

underpins this research. Although the study is primarily about the detection and prevention

of cyber-attacks using big data analytics, it specifically addresses how big data analytics can

be used for the protection of critical information infrastructure (CII) organisations from cyber-

attacks. There are different types of cyber threats that these organisations face, but the main

threat of concern is the use of stealthy malware, which are notoriously difficult to detect and

prevent. Therefore, explaining what these background concepts are is necessary and relevant.

2.1 CII versus CI

A search (in any Journal databases or indeed a search engine such as Google) about critical

information infrastructures (CII) will invariably churn out results about critical infrastructures

(CI) as well. In fact, CIs and SCADA systems dominate the search results. It seems the two

concepts (CI and CII) are treated as one and the same. Therefore, a clarification will be useful.

In other words, what is the difference between a CII and a CI?

To address this apparent confusion of the two concepts, the explanation offered by Cavelty

and Suter (2012) makes sense. They explain that a CI consists of all critical sectors of a nation’s

infrastructure of which a CII is a subset. They describe a critical information infrastructure as

a system which is part of a global or national information infrastructure that is essential for

the continuity of critical services, which includes banking and finance, hospitals and utility

services. There are two sides to it: a physical component side which include equipment such

as high speed networks, satellites, wireless communication networks, televisions, radios,

phones and computers; and an immaterial side which is the information and content that is

stored on and flows through it (the physical component). A typical example of a CII is a

datacentre that serves millions of people and financial system such as a stock market.

This research will focus on CII and not necessarily CI, although it must be mentioned that some

strategies for protecting CIIs will definitely overlap with the protection of CIs as well. It must

also be mentioned that the protection of a CII is considered a matter of national security.

2.2 Interdependency

Given the definition of a critical information infrastructure, it is quite obvious that a CII for

any nation serves an important purpose for that nation’s development, but what is the big

deal if say, a high speed network or a satellite is taken off line by accident or design? After all,

people probably will not die and though it will be expensive, these infrastructures can be

rebuilt, one might say. So what makes their (CII) existence so critical that it must be protected



as a matter of highest priority? In other words, why is their protection considered a matter of

national security?

To answer this question fully, one must understand the concept of interdependency.

Essentially, this means that critical information infrastructures do not exist in a vacuum but

they are connected to a range of other important infrastructures and services. This means

that a failure in one of them affects the others. Renaldi, Peerenboom and Kelly (2001) explains

this idea of interdependency by looking at the case of the Galaxy 4 telecommunications

satellite failure in 1998, which resulted in a loss of pager service to about 45 million customers

(Rosenbush, 1998). At the time, people (especially in the US) relied on pagers for

communication. Renaldi, Peerenboom and Kelly added that the event caused lots of

disruptions including the disruption of important financial transactions such as card

purchases, and also threatened lives by disrupting communications between doctors and

emergency services. This shows that the failure of one CII can cause a domino effect to other

CIIs and CIs. If the failure of a relatively simple communication device such as a pager can

cause such serious disruptions (including threats to human lives), can one imagine the scale

of the disruptions that can arise if the CII (as in the BT case mentioned in the introduction)

that supports the use of current communication devices such as smartphones, tablets and a

host of other systems is taken down? Even worse, the scale of these potential disruptions

expands exponentially when one factors in emerging technologies such as the Internet of

Things (IoT), which is fostering more dependency on technology, by both individuals and

organisations, for their respective daily activities and operations.

It must be mentioned here that the increasing dependence on the use of technology (by

people and organisations), and the pervasiveness of the IoT have added more fuel to the

phenomenon called Big Data.

2.3 Big Data

According to Cisco, the global IP traffic will exceed 1000 Exabyte (i.e. 1 zettabyte) by the year

2016 (Cisco, 2015). To put the quantity of data being discussed here into perspective, 1

zettabyte of data is the same volume as the Great Wall of China (Arthur, 2011). This explosion

of data is what is referred to as big data.

However, big data is not just about volume. It is also about variety and velocity. Variety refers

to a range of different types or formats of data such as video, audio, images, text messages

and email, as well as sensor and machine generated data. Velocity refers to the speed

(including real time) by which these data are generated, processed and transmitted.

Therefore, though there are other characteristics, big data is essentially characterized by the

so called, “three Vs” – volume, variety and velocity (Gartner, 2012).

Due to the nature of big data, it cannot be easily categorised and organised into a traditional

database. Figure 1 (Cloud Security Alliance, 2013) gives a good illustration of the three Vs of



big data by contrasting it with data that is traditionally used and processed in relational

database management systems.

Volume:

traditional data

big data

Velocity: traditional data big data

Variety:

traditional data

(homogeneous and

structured)

big data

(heterogeneous and

unstructured)

Figure 1 - the three V's of big data compared to traditional data



Figure 2 - big data classification into five categories

A further explanation of the nature of big data is presented by Hashem et al., (2015). They

explain that big data can be categorised into five different aspects such as: data sources,

content format, data stores, data staging and data processing. This categorisation is aptly

depicted in a diagram (Hashem et al., 2015), and an adaptation and more visual version of

this diagram is presented in Figure 2.

Today, big data has become very important to organizations because of the wealth of

information that can be gleaned from it. The “mining” of big data for relevant information is

referred to as big data analytics.

2.4 Big data analytics

Big data analytics is quickly becoming an indispensable tool in our increasingly digitised

society. It is used not only by big corporations to aid decision making, but for many other

disciplines including artificial intelligence (AI), health related research and information

security.

McAfee and Brnjolfsson (2012) explain the importance of Big data analytics by giving an

example whereby real time location data from users’ smartphone were used to determine

how many shoppers were at Macy’s parking lot on Black Friday at the start of the Christmas

shopping season in the US. This data then allowed analysts to estimate the retailer’s sales

even before the actual sales had been recorded.



Also, combined with machine-learning algorithms, Big data have been used to create artificial

intelligence (AI) systems that are better at performing tasks, which in the past only humans

could do. A typical example is IBM’s Watson, which beat the best minds at the game of

Jeopardy in 2012 (Ferrucci, 2012). Another example of machine learning is driverless cars;

although these cars have not yet surpassed humans, tests (on certain selected roads) show

that these machines have mastered the complex art of driving (Gibbs, 2014).

The point here is that big data analytics can provide a powerful tool for organisations to make

smarter and better decisions because it can give a better picture of any particular event even

before it happens. This makes Big data analytics a perfect and potent tool for the detection

cyber-attacks. Tankard (2012) suggested how this application - of using big data analytics to

detect cyber-attacks - can be done when he discussed the advantages of big data. He explains

that organizations can mine the huge amounts of data they have been collecting for potential

cyber security events such as malware and phishing attempts.

To conduct big data analytics, there are a range of technologies that are used. The general

consensus amongst most experts in this field is that the whole big data phenomenon is early

days, and it seems this is supported by the raft of changing technologies – including storage

applications, machine-learning algorithms for analytics and user interfaces – that are

appearing on the market today.

2.5 Big data technologies

In previous sections, it was established that, like any form of analytics, big data analytics

involves the careful examination of the (big) data in order to obtain meaningful and useful

information. However, given its nature, it is obvious that any kind of examination of big data

will neither be easy nor straight forward. Therefore, innovative solutions have been required

to make the mining of meaningful information from big data less challenging and as easy as

possible. Given the rise in the amount of technologies (for big data analytics) on the market

today, it seems a lot of progress is being made.

In this section, the general structure of a big data technology or system is described. Just as

big data is complex and diverse, so are its current technologies. Elgandy and Elragal (2014)

discusses a framework – dubbed B-DAD – which makes it easier for one to understand the

general structure of big data technologies. This framework consists of three main areas, and

these are: storage and architecture; data and analytic processing; and the analyses (results).

Like Elgandy and Elragal (2014), Hu et al. (2014) also use a similar three-layered architecture

model to describe big data systems, which consists of an infrastructure layer, a computing

layer, and an application layer. The infrastructure layer consists of a network of storage

systems enabled by cloud computing and virtualization. In other words, the infrastructure

layer consists of a system of distributed hardware that are used to store the big data. This

means that the data might not necessarily be found in one particular location or system, but



spread across multiple servers in different locations. The computing layer, also known as the

middleware layer with respect to this three-layered model, consists of the software tools that

are used for the management and integration of the data. The Application layer consists of

the application or software that is used for implementing the required data analytics. A more

graphical variation of the illustration used by Hu et al. (2014), to describe their model, is

presented in Figure 3.

Figure 3 - 'mining' valuable information using a big data technology

Although the infrastructure layer is indispensable and big data analytics is possible today

because of advancements in storage capacity and micro-processors, big data technologies

usually comprise of the computing and application layer. At the Application layer, a big data

system can be classified into two main groups; one may recall that one of the five categories

of big data (as shown in Figure 2) is about data processing of which there are two main types,

batch and real time (or stream). Therefore, based on these two types of processing, big data

technologies are grouped into batch and stream. Batch processing is when the analytics is

performed on data at rest; and stream processing is when the analytics is carried out on data

in motion (Cloud Security Alliance, 2013).

Like any traditional batch processing system, time is not necessarily of essence in batch

analytics. Thus, in batch analytics, data is stored over a period of time before it is analysed. A

typical example of a batch processing big data technology is Hadoop (Cloud Security Alliance,

2013). Hadoop (or Apache Hadoop software library, as it is formally called) is a framework

that enables the processing of large volumes of data across clusters of distributed systems in

a simple manner. It can operate on a single server or scale horizontally to cover thousands of

servers, with each of them providing their own local storage and processing. The framework

or ecosystem consists of four modules including the Hadoop Distributed File System (HDFS)



and Hadoop MapReduce. HDFS is used for the storage and management of application data,

and MapReduce is used for parallel processing of large volumes of data. The other two

modules are Hadoop Common and Hadoop YARN (Hadoop.Apache.org, 2016).

Other technologies that take advantage of the Hadoop framework to perform different tasks

include Pig, Mahout, Spark, Hive, HBase and Cassandra. Pig is a high-level language used for

performing data analysis; Mahout is used for generating machine learning and data mining

algorithms; the key functionalities of Spark are machine-learning, stream processing and

graph computation; and like HDFS, Hive, HBase and Cassandra are used for storing large

volumes of data (Hadoop.Apache.org, 2016).

Unlike batch analytics, time is absolutely important in stream analytics. This is based on the

idea that the value of certain data (such as instant messages, for example) is bound by its

velocity. Hence, in stream analytics, data is analysed as soon as it arrives. Thus, another name

for stream analytics is real-time. Stream analytics is usually used for online applications where

data arrives within seconds and/or milliseconds (Hu et al., 2014). One of the most talked

about technologies for stream analytics is Apache Storm. It is an open source real-time

computation system, which is easy to use, reliable, and compatible with any programming

language (Storm.Apache.org, 2015). Other big data technologies for stream analytics are

Apache Spark and Apache Kafka, which are quickly becoming the analytics engine of choice

for most organisations.

It must be mentioned here that a key problem of big data technologies, that still exists, is the

pre-processing stage, where the data in its raw state is transformed into a more

understandable format, using techniques such as cleansing, integration, transformation and

reduction. Therefore, the effectiveness of a big data technology can be determined by the

pre-processing techniques it uses (Tasi et al., 2015).

2.6 The cyber-attack landscape

During a speech at a cyber security conference in 2012, the FBI Director made quite an

unusual statement, “I am convinced that there are only two types of companies: those that

have been hacked, and those that will be” (Mueller, 2012). Unfortunately, this is the sad

reality in today’s digitised world. Cyber-attacks have become so mundane that organisations

are not even surprised when it happens. According to a security report (Internet Security

Threat Report, 2016), more than 430 million new pieces of malware were discovered in 2015,

and what was even more remarkable about this is that this finding came as no surprise to the

researchers. The security report goes on to explain that targeted, sophisticated and persistent

attacks against government organisations and businesses of all sizes are on the rise and pose

a serious threat to national security and economy.

Cyber-attacks come in different shapes and forms – from common viruses to highly

sophisticated malware such as cyber-weapons. van Kessel and Allan (2014) provide a list of



the most likely sources of cyber-attacks and it includes the usual suspects of disgruntled

employees, hacktivists and criminal syndicates. In the information security field, it is almost

taken for granted, with good reason and evidence, that internal threat (by employee

activities) pose the most threat (or risk) to the security of information in any organisation, but

this report (van Kessel and Allan, 2014) makes a startling statement which suggests that this

is changing. The report explains that the combined activities of external attackers are now

significantly more likely as a threat source than internal threats.

External attackers include criminal syndicates, state sponsored hackers, hacktivists and lone-

wolf hackers. The modus operandi of these attackers include the use of highly sophisticated

malware, which cannot be easily detected even with equally highly sophisticated security

systems (van Kessel and Allan, 2014). For example, in the Sony attack, it was reported that

the malware used could have slipped past most of the network defences out there today (Lee,

2014).

Since the advent of Stuxnet – a rather insidious and stealth malware that was used to attack

the Iranian nuclear facility – similar types of such malware have emerged in Cybersphere.

Known examples of these types of stealthy malware, apart from Stuxnet, include Duqu, Flame

and Red October, and are collectively called APTs (Virvilis and Gritzalis, 2013).

Virvilis and Gritzalis (2013) describe an APT as having the following general features: they are

usually directed at specific and high-valued targets and therefore, for particular operating

systems or platform; they usually have an initial attack vector such as malicious office

documents or removable drives; they are equipped with a list of evasion techniques in order

to bypass anti-virus software and intrusion detection systems (IDS) using command and

control techniques; part of their evasive techniques include encryption of their network

traffic; they use stolen but legitimate digital certificates, which fools the targeted systems as

being safe.

These features (of an APT) make it very difficult for even hardened and sophisticated systems

to detect them. Thus their detection relies heavily on conducting manual investigations and

the expertise of human analysts (Cloud Security Alliance, 2013).

2.7 Threat detection with BDA

Traditionally, the range of systems for detecting and preventing cyber-attacks can be grouped

as follows: antivirus programs; network IDS/IPS; host IDS/IPS; network device events; logging;

FIM and whitelisting; and SIEM. Although these systems are useful in many ways, they are

proving to be largely ineffective against current types of stealthy cyber-attacks. This is

because, apart from operating independently from each other, these systems generate a

huge amount of data which is difficult and time consuming to analyse without the correct

tool; thus it is easy to miss key cyber-attack events (Shackleford, 2016).

This suggests that with the correct deployment of the right tool (such as a BDA technology),

which can sift through the data much quicker, these disparate systems can be made more



efficient and effective. Laitan (2014) suggests, in an example, that using a big data analytics

system, an organisation that employed about 35 staff to monitor 135,000 data loss prevention

(DLP) alerts per day, managed to reduce it considerably using a handful of staff.

So how does an organisation detect cyber-attacks or threats using a BDA system? (Laitan,

2014) discusses three main approaches – of deploying a big data analytics tool for cyber

security – that are based on the source of the data and the analytics setting (canned or ad

hoc). An adaptation of the illustration used for explaining these approaches or methods, as

presented in the report, is shown in figures 4, 5 and 6.

The first method, as described in the report (Laitan, 2014), involves making existing systems

– such as SIEM, DLP and DAP – more intelligent and less noisy so that only the most dangerous

cyber-attacks (e.g. APTs) are flagged and isolated. This means that the analytics setting of the

big data system used is usually canned. Also, the data to be used will be derived internally

from the organisation’s databases, servers and applications. This method is illustrated in

figure 4a.

Figure 4 - method 1: making existing detection system intelligent with big data

In the second method, the data (for the analytics) is sourced from internal and external

sources (such as online and mobile activities), and the analytics setting is customised (or ad

hoc). This means the organisations can set their own search criteria, and in some of the big

data analytics systems used, searches for malicious activities can be performed ‘in google-like

fashion’. This method is illustrated in figure 5.



Figure 5 - method 2: using internal and external data for the analytic process

Figure 6 - method 3: streaming data from external sources for the analytics

In the third method (see figure 6), the analytics is performed mainly on external data about

threats and the activities of various bad actors. This means that the big data analytics system

is designed to comb through the Internet (both dark and public) for malicious activities against

organisations. Perhaps this is the reason why in the Target cyber-attack (see section 1), the

DOJ – using systems that search the Internet for cyber security events - were able to

determine the cyber-attack before Target became aware of it.

In essence, a big data technology for cyber security can be described as having the following

key features (Sullivan, 2016):

• It must have the ability to scale, as smoothly as possible, to accommodate the increasing

size of the security data being collected (from both internal and external sources) without

losing performance in its functionalities. This means that the analytics engine must be able

to handle the data as it scales horizontally across distributed storage systems. Also, the

storage systems it uses must be persistent with low data latency. In other words, the

database must be capable of keeping copies of the original data even after it has been

modified, and data access must be quick.



• It must have a reporting and visualization function which will allow the information (after

the analytics – canned or ad hoc) to be presented in a way that will be useful and

meaningful to security analysts.

• The source of the data (for the analytics) must be in context. In other words, analysing

weather data for cyber-attack events might not be a good idea. Using just any data might

result in higher than necessary false positives, or even worse, it might result in false

negatives. This means that the source of the data for the analytics is of absolute

importance.



3.0 The Literature review

In order to ensure that the review of literature for this research is rigorous and thorough, a

systematic approach was adopted; this is the framework of literature review proposed by

Vom Brocke et al. (2009). The framework suggests that a rigorous literature review should

consist of the following phases: scope definition; conceptualisation of the topic; literature

search; literature analysis and synthesis; and conclusion.

3.1 Scope

The scope for this research is based on the relevant topics associated with the study, and the

time frame for the selection of past research papers. The relevant topics are identified as “big

data analytics”, “cyber-attacks”, and “critical information infrastructure protection”; and the

time period is determined to be spanning the past five years. The choice of the time span (of

the past “five years”) is due to the fact that one of the key topics, big data analytics, is a

relatively new phenomenon and therefore, it is not likely for one to find any study (beyond

this period) that will be relevant (to this research).

3.2 Conceptualisation of research topic

Considering the research topic, the key terms for the basis for this research are identified to

be: big data analytics for cyber-attacks detection; models or methods for critical information

infrastructure protection; and using big data analytics for the detection of cyber-attacks in

critical information infrastructures.

3.3 The review method

The main approach to the literature search was keyword searches, using the key terms

identified during the “conceptualisation” phase, in a range of selected but relevant journal

databases including Elsevier Science Direct, IEE Explore, Emerald Journals and ACM Digital. In

other words, these key phrases were typed directly into these journal databases. The result

of this search is summarised in Table 1. One can observe (from Table 1) that this approach

alone did not yield enough relevant results. Therefore, to support this approach, public

databases such as Google Scholar proved to be very useful in that, given the relevant

keywords, it returned a list of research papers and their corresponding journals; this made it

easier to identify where to obtain the research paper of interest. For example, entering the

keywords, “Big data analytics for cyber security”, in Google Scholar yielded a list of research

papers accompanied by a brief description. This allowed for a quick read about the contents

of each paper, and then after selecting a paper of interest (eg. “Big Data Analytics for cyber

security: a review of trends, techniques and tools”), the relevant journal database (IEEE) was



used to obtain it (when it was not readily available in Google Scholar). After obtaining the

particular paper, a backward and forward search was then conducted in order to obtain other

relevant paper(s).

Database Keywords Fields Hits Reviewed

IEEE Explore Big data analytics for cyber security and/or

cyber-attack detection

Any 20 1

All 0 0

Models or methods for protecting critical

information infrastructures

Any 0 0

All 0 0

Using big data analytics for the detection

of cyber-attacks in critical information

infrastructures

Any 40 1

All 0 0

Elsevier

Science Direct

Big data analytics for cyber security and/or


Any 75 2

All 0 0



Any 150 0

All 0 0



infrastructures

Any 110 3

All 0 0

Emerald

Journals

Big data analytics for cyber security and/or


Any 1 0

All 0 0



Any >1000 0

All 0 0



infrastructures

Any 0 0

All 0 0

ACM Digital Big data analytics for cyber security and/or


Any 0 0

All 0 0



Any 3 0

All 0 0



infrastructures

Any 0 0

All 0 0

Table 1 - journal database search

Note: the review column (in Table 1) does not mean that only those number of papers were

examined. It simply indicates the number of papers (from the respective databases searches)

that were reviewed for this research. The final list of the relevant research papers for review



are presented in Table 2. They are grouped according to the concepts discussed in this

research.

Papers Concepts

A B C

Ahn, Kim and Chung (2013) x x

Gandomi and Haider (2014) x

Ghazal et al (2012) x

Constantine (2014) x

Hurst, Merabti and Fergus (2014) x

Dunn-Cavelty and Suter (2009) x

Mouton and Ellefsen (2013) x

Curry et al (2013) x

Tsai et al (2015) x

Laitan (2014) x

Ulltveit-Moe (2013) x

McLaughlin et al (2014) x

Ma, Smith and Skopik (2013) x

Tsegaye and Flowerday (2014) x

Cardenas, Manadhata and Rajan (2013) x

Ragupathi and Ragupathi (2014) x

Everett (2015) x

Tankard (2012) x

Chen, Chiang and Storey (2012) x

Slavakis, Giannakis and Mateos (2014) x x

Mahmood and Afzal (2013) x

Kambatla et al (2014) x

Puri and Dukatz (2015) x

Hipgrave (2013) x

Tannahill and Jamshidi (2013) x

Smith and Watson (2013) x x

Tasi et al (2015) x

Aniello et al (2014) x

Table 2 - selected research papers for review

Table 2 Key:

Concept A: big data analytics for cyber security and/or cyber-attack detection

Concept B: models or methods for protecting critical information infrastructures

Concept C: using big data analytics for the detection of cyber-attacks in critical information

infrastructures.



3.4 Review analysis and synthesis

This literature analysis focuses on the relevance of big data analytics to information security,

how it has been applied for the detection of cyber-attacks in information systems in general,

and in CIIs in particular. It is based on the three concepts (A, B, C) outlined in Table 1.

3.4.1 Part I: big data analytics for cyber security

The Information Security Forum (ISF) compiled a report about the potential positive impact

and improvement big data analytics can have on information security. The report states that

organisations need to move away from reacting to cyber-attack incidents and move towards

detecting and preventing such incidents (Information Security Form, 2012). The report

concludes that though current big data analytics can be used to improve information security

by reducing risks and increasing agility, the technology is not quite mature in the information

security industry.

Tankard (2012) agrees with the observations made by the ISF (2012) and in his article about

big data and security he explains that, of the many advantages of big data analytics, the most

compelling is operational efficiency (for commercial organisations). One can argue that this

“operational efficiency” includes cyber-attack detection because keeping a system secure is

part of an organisation’s operations. He goes on to explain that apart from commercial

organisations, big data analytics can be useful to governments for the detection of threats

from foreign countries, terrorists, hacktivists and criminal elements in the real world and in

cyberspace. Essentially, Tankard’s argument is that because the information obtained from

big data analytics is of high value (to the organisation that has obtained it), it will inevitably

be a target for cyber-attacks, but the big data analytics itself can also be used to prevent such

attacks. In other words, big data analytics can be used by organisations to increase

productivity and at the same time make their systems more secure. He adds that, for big data

analytics to be effective, the security access controls should be moved away from the network

perimeter and closer to the data asset that needs protection.

Like Tankard (2012), Cardenas, Manadhata and Rajan (2013) extols the use of big data

analytics, but they focus more on its uses for cyber security. They explain that the idea of data

analysis for cyber-attack detection is not new in that the information security community

have been monitoring network traffic, and analysing system logs and other sources of data in

order to detect threats and malicious activities for more than a decade, but the use of big

data analytics is better and has overcome the many challenges that faced the traditional data

analysis (for security) of monitoring network traffic, security logs, etc. One of these challenges

is the inability to perform long term and large scale analytics because it was not economically

feasible to keep large volumes of data for a long period. They explain that one of the main

impacts of big data technologies is the facilitation of the development of affordable

infrastructures – such as storage and maintenance - for security monitoring by various



industries, thus making it possible for large scale analytics to be carried out. However, they

argue that despite the significant promise of big data analytics for security, there are several

challenges, such as privacy laws, that can prevent this development from realising its true

potential if not addressed. They also caution that big data analytics is not a panacea for cyber

security and therefore, security specialists will have to continue researching new ways to curb

sophisticated attacks.

Examining the potential of big data analytics, Curry et al. (2013) predicted that it will change

the status quo of most information security products (i.e. network monitoring,

authentication, fraud detection, IDS, etc.) and will evolve to have advanced predictive and

real time features. They state that just as big data analytics have transformed the competitive

dynamics of commercial organisations, it will also make the information security sector

better. They explain that big data analytics is especially relevant because a system better than

the traditional ones is needed to defend against cyber attackers who are becoming more

sophisticated and are able to stage highly targeted and complex attacks. Also, attack surfaces

are becoming broader and more diverse as organisations dissolve network boundaries by

allowing data and application access through cloud services and mobile devices. Therefore,

they argue that only a system that is more agile and uses dynamic risk assessments and

analysis of big data will be enough to handle this situation.

Discussing the time taken for cyber-attacks to happen and the response times by

organisations, Brewer (2015) echoes this idea, by Curry et al. (2013), that the network

perimeter (for organisations) is becoming wider. Brewer goes as far as explaining that the

network perimeter does not exist at all in the current climate of computing, and therefore

insisting on blocking attacks at the perimeter is bound to fail. He also adds that the nature of

the cyber-attacks is stealthy and happens very rapidly (in minutes and seconds), making it

difficult to be prevented. He explains that many organisations operate in a mode where it

takes weeks or months to detect cyber-attacks. Therefore, he proposes a “fundamental shift”

from prevention to detection by using big data analytics to detect and respond to cyber-

attacks quickly (i.e. as they happen).

Mahmood and Afzal (2013) are so convinced that big data analytics is the solution to the

growing threat of cyber-attacks faced by organisations that, not only do they encourage

organisations to embrace it, they prescribe how it can be implemented. They, like Cardenas,

Manadhata and Rajan (2013), argue that the traditional security solutions are simply not good

enough, especially in this era of big data (where data can arrive and disappear in seconds) and

the ever increasing attack skills of cyber criminals. They explain that for a big data analytics

solution to be effective, the data (for the analytics) will have to be pooled from diverse

sources; the analytics engine must be sophisticated and cutting edge; and the user interface

must be interactive.

Ahn, Kim and Chung (2013) discuss the reason why traditional security solutions are simply

not good enough, and explain that these (traditional) security solutions, such as firewalls, anti-



viruses, intrusion detection and prevention systems, were based on signatures and

characteristics of known malware. Therefore, new malware (especially APTs) which had no

signatures were impossible to detect using these systems. To overcome this problem, they

propose a big data analytics-based model. Essentially, this model (they propose) consists of

three main features: data collection; data processing or analytics; and an alert and report

service. The data collection feature works by pooling data from traditional sources (including

existing databases and security logs) into one place. This data is then mined, using current

data mining techniques, for attack or abnormal behaviours; if any of these are detected, the

result is passed on to the alert and report service. The problem one may find with this solution

is that the data pool is still traditional – it does not seem to be “big”. Apart from using smart

analytics, a key aspect of big data analytics is that the data must be “big”; this means that the

data must be as diverse as it is extensive and also include ephemeral data.

Laitan (2014) makes the case for the use of big data analytics by explaining that the value of

information needed to detect cyber-attacks is time dependent and in certain situations, a

second might be too long. Therefore, a system such as big data analytics that can provide

faster access to an organisation’s own information as well as relevant external information is

much desired. She explains that in the past, cyber-attackers would conduct a careful and

extensive reconnaissance of a target before attacking it, but they are not doing that anymore

because of the stiff resistance that has been mounted by most organizations. So to overcome

these resistances, the cyber-attackers attack directly without studying their target for too

long. Also, cyber-attackers are becoming more knowledgeable and there are more of them

attacking organisations.

To deal with these threats, organisations have in the past used several isolated monitoring

and detection systems that have been tuned for different scenarios such as user access, data

loss and financial fraud. This (traditional) approach generated a lot of “noise” and false alarms

in the systems. With big data analytics, Laitan (2014) explains that organisations can: reduce

the noise and false alarms in the existing monitoring systems by supplying them with relevant

data and smarter analytics; pool their internal data and relevant external data into one place

so that they can look for known patterns of cyber-attacks and unusual activities; and remain

agile and stay ahead of the cyber-attackers. Laitan adds that despite the benefits of big data

analytics, it is still early days and most organisations do not have the skills and abilities to

adopt the technology.

In her discussion about the benefits big data analytics, Everett (2015) focuses on cyber

security and poses the question as to whether big data analytics is really the future “saviour”

of cyber security as some are saying, or whether it is the latest threat. To answer this question,

she presents quotes from interviews with several key security experts from various

organisations, and the consensus amongst these experts is that big data analytics is indeed

the way forward and not a threat to cyber security. The responses from these experts make

a quite a compelling case for the relevance of big data analytics for the prevention of cyber-

attacks. Everett makes the point (through one of the experts interviewed) that the result of



an organisation’s data being spread everywhere – from the cloud to personal devices – means

it is difficult to identify where the network perimeter is and how to defend it. Therefore, the

challenge today is more about preventing access to the data itself and not necessarily about

preventing access to the network and the devices on it; this is what Tankard (2012) means,

when he said that the security access controls of big data analytics should be moved away

from the network perimeter and closer to data asset. Everett adds that identifying breaches

to the network using traditional means is proving to be very difficult because determined

cyber-criminals are more likely to be using legitimate user credentials (probably obtained

through social engineering), which is why big data analytics is of immense benefit. Everett

continues and explains that the benefit of big data is, it is not only capable of detecting threats

and behavioural inconsistencies in real time, but it could be used in the future to obtain the

required intelligence to trigger automatic responses to cyber-attacks. However, she adds that

despite the huge potential of big data analytics, it remains the choice of only a small number

of large organisations with a sophisticated security posture and until it is packaged as a

commodity product like anti-virus programs, smaller firms are unlikely to take it up. Like

Cardenas, Manadhata and Rajan (2013), Everett cautions that big data analytics must not be

seen as a silver bullet because the technology is still in its infancy and the expertise or

knowledge in this field is currently limited. A reason for this could be the fact that

organisations that are already using big data analytics are unwilling to share their experiences

because they are worried it might expose their vulnerabilities, she says.

Although the main objective of the paper by Verma et al. (2015) is about why cyber security

professionals and students should study data analytics, they raise some very important and

relevant points about the use of big data analytics for cyber security and believe that it is the

right tool needed to handle the cyber security challenges of today. They make a similar point,

to Curry et al (2013), about the cleverness of current cyber-attackers and the sophistication

of the malware being released. They explain that malicious adversaries who are clever at

hiding their attacks are the difference between applying big data analytics and traditional data

analytics. In other words, traditional data analytics are simply not enough to handle these

attackers. Another point they make, which is in resonance with Laitan (2014), is that the speed

an attacks can be very quick (“fraction of a second”) and so the defender will need to react

quickly, and therefore, systems (such as big data analytics) that can deal with such time scales

are needed.

3.4.2 Part II: models for protecting CIIs

With regards to protecting CIIs, Ulltveit-Moe et al (2013) proposes the concept of information

sharing and best practices, between computer emergency response teams in Europe, as

proposed by the EU – the so called European Information Sharing and Alert System (EISAS).

They argue that the sharing of security information such as cyber-attacks and vulnerabilities

between these organisations will not only improve the security of the respective



organisations, but will also reduce the cost of protecting one’s own system. They identify

some challenges to this approach (of information sharing) – such as trust, lack of awareness,

and lack of standards to enforce the privacy of information – and propose methods of how to

address these issues. The method they propose to overcome these challenges is based on the

PRECYSE (Prevention, Protection and Reaction to cyber-attacks to Critical Infrastructures)

project, which is essentially the development of a methodology, architecture and latest

technology and tools (PRECYSE, 2012). To be clear, Ulltveit-Moe et al (2013) are not proposing

PRECYSE as a method for protecting CIIs (although, this is what the PRECYSE project is all

about), but rather adapting its principles as a way of overcoming the challenges (of their

solution) of information sharing. They rightly point out that traditional methods of protecting

of CIIs, against cyber-attacks such as APTs, have been ineffective, and therefore these

(traditional) security systems must be handled as being fundamentally insecure. However,

their proposal involves quite a bit of traditional approaches. For example, they propose the

use of “privacy metrics” in Intrusion Detection Systems (IDS) to reduce the occurrence of false

alarms, but it seems these IDSs are based on traditional data sources.

Another approach to protecting CIIs, which is also based on the PRECYSE project, is offered

by McLaughlin et al. (2014). They make a rather interesting claim that, ultimately the

information security posture of CII organisations (especially small-to-medium size ones) will

be based on money and not necessarily on how powerful or sophisticated the types of cyber-

attacks that the organisations face. They explain that given the current types of attacks (APTs

such as Stuxnet, Duqu, etc) experienced by organisations, it is only a matter of time that an

“all powerful” adversary will defeat security systems no matter how secure. In other words,

having a “defence-at-all-cost” system does not make a good business case, especially for

small-to-medium size CII organisations. They have rightly identified that the basic strategy

employed by most security systems is perimeter defence, and therefore malicious activities

that have breached the perimeter, and are inside the system, are not likely to be detected.

Hence, the solution they propose, which is essentially an architecture for the detection of

cyber-attacks, consists of an integrated system that includes features such as interior anomaly

detection as well as a perimeter defence, and a countermeasure management system. With

regards to the interior anomaly detection feature, the source of the data to be used for the

detection (of anomalous behaviour) is obtained from both the corporate ICT system (or

network) and the Industrial Control System (ICS). They mention that similar solutions (to what

they are proposing) from commercial organisations employ big data analytics but do not

readily combine data from both the corporate network and ICS systems. From the description

given, it seems that although the data used for security analysis is pooled from the two

systems, they are static and not dynamic. One would not be wrong to think that this makes

their proposed system quite vulnerable, since the speed of an attack can be less than a

second. Also, it seems a lot of manpower and different tools will be required in order to be

successful. Perhaps a big data analytics system (that can pool data from both systems as well

as from external sources) can make such a system more secure in that it will be able to



monitor dynamic (real-time) as well as static data in much less time and with much less

manpower and resources.

Unlike Ullviet-Moe et al. (2013) and McLaughlin et al. (2014) who discussed technical

solutions for the protection of CIIs, Ma, Smith and Skopik (2013) identifies the security

analysis exercise as a key element for the protection of CIIs. This is highly relevant because

regardless of how sophisticated the security system may be, it will be ineffective if it is not

deployed properly; and to be able to do so, a security or risk analysis exercise is needed to

reveal where the vulnerabilities may be. Therefore, in their research article, Ma, Smith and

Skopik (2013) present a thorough description of a CII in the form of an architectural model.

The idea is that in order to identify vulnerabilities in any system, one must understand how

the various components are connected together. The description they present truly simplifies

the complexity of a CII, and although the model is in relation to an industrial system, it does

not conflict with other standalone CIIs such as banking systems.

Although the main focus of their article is about the protection of power grids from cyber-

attacks such as APTs, the solution presented by Skopik, Friedberg and Fiedler (2014) can be

adopted for other CIIs. They mention that the effort to modernise and digitise existing power

grids in order to make it more efficient has widened the attack surface for these

infrastructures. They explain that the reason why traditional systems, such as IDSs and

security information and event management (SIEM), fail to detect for APT attacks is because

they are based on signatures of known malware whole, but these APTs are designed to exploit

zero-day vulnerabilities. Therefore, the solution they propose avoids signature based

solutions, and relies on statistical analysis of data about system behaviour stored in the

system log files. In other words, data analytics is used in order to uncover unusual behaviour.

Since the sources of the data (to be analysed) is from internal activities, one wonders whether

this will be enough to uncover an attack, since a system can be breached within a second.

The solution and subsequent model for the protection of a CII as proposed by Tsegaye and

Flowerday (2014) are sound but it is based on traditional methods of protecting information

systems, which are currently proving to be ineffective against zero day attacks as observed by

Skopik, Friedberg and Fiedler (2014). The solution consists of three main controls: preventive,

detective and corrective. For the preventive, they describe instruments or tools such as

policies, firewalls, anti-virus software. Detective tools include anti-virus software and IDS, and

the corrective tools also include anti-virus software and a disaster recovery plan. With this

approach, one cannot help but to wonder how the CII will be able to withstand the types of

cyber-attacks organisations are currently facing.

3.4.3 Part III: big data analytics for cyber security in CII

With respect to using big data analytics to protect CIIs, Hurst, Merabti and Fergus (2014)

presents a method which involves the use of behavioural observation and big data analysis.



Although this method is not specifically for a CII, it is indirectly included since CIIs are a subset

of CIs. This method, which is actually a type of technology, is designed to monitor the

behaviour of the CI in terms of its data processing and to detect any abnormalities, which

might be a cyber-attack. The system, which was tested with data sets from a simulator, and

analysed with mathematical classification techniques, allowed the researchers to

demonstrate the effectiveness of the method. The system (or method) they propose does not

present a big data analytics tool for the protection of a CI per se, but rather it is about how

extra observations of anomalous behaviour (in the data operations of a CI) can be performed

using big data analytics techniques in order to add to the defence in depth. They also make

the point that, the nature of the current threats combined with the fact that the traditional

methods of defence are not up to the task means that new and original methods of protecting

CIs are needed.

3.5 Literature review conclusion

As observed by McLaughlin et al. (2014), it is only a matter of time that a malware, that can

defeat any security system, will be created. This statement might seem a bit extreme, but

considering the nature of current cyber-attacks and malware (see section 2.6) that have been

experienced, and the fact that cyber-attacks are designed for specific targets, it will not be

amiss to say that CIIs face very dangerous, knowledgeable and well-resourced attackers.

From the review analysis and synthesis, it was quite obvious that, although they are still

relevant, traditional security systems are no match for cyber-attacks launched with APTs. This

is because, as explained by Cardenas, Manadhata and Rajan (2013), traditional security

systems are based on traditional analytics which are based on limited storage, slow speed,

and a specific data type. Therefore, there is a need for a more dynamic system.

Also, in the era of cloud computing and BYOD, the traditional network perimeter does not

really exist since employees can connect to an organisation’s system from any remote

location. Thus, traditional systems, which are effectively designed for network perimeters and

are ‘prevention-centric’ are no longer effective. Consequently, ‘detection-centric’ systems

will be more suitable for the current cyber-attack climate (Brewer, 2015).

The undeniable theme that emerged from this review is that a dynamic system, that is based

on big data analytics, is the solution that can handle the current cyber-attacks threat

landscape. However, because big data analytics is still in its infancy, not a lot is known about

its power, especially in the way it has been used for cyber security. From the review, here is

what was determined about big data analytics for cyber security:

1. Although all the researchers reviewed extolled the “greatness” of big data analytics as a

tool for cyber security, there was not much evidence about the successes (or failures).

According to Everett (2015), this (lack of evidence) is attributed to the fact that



organisations are unwilling to share their experiences for fear that it might expose their

vulnerabilities. Also, not many organisations have the expertise and resources to utilise it.

2. Research about how big data analytics is or can be used to protect CIIs is limited. Entering

the key phrase, “using big data analytics for the detection of cyber-attacks in critical

information infrastructures”, in the relevant journal databases mentioned above yielded

almost no research paper as can be observed in Tables 1 and 2. Some of the solutions

presented for the protection of CIIs were in the form of models and framework, but there

was certainly no (standardised) model of meaningful practice about how big data analytics

can be deployed to protect CIIs.

3.5.1 Research gaps

From the literature review, one can determine that although big data analytics is seen as a

very powerful and relevant technology for the detection and prevention of cyber-attacks,

there is limited documentation of exactly how effective it is.

Also, there is no abstract model of meaningful practice about how big data analytics is used

for cyber-attack detection. Such a model might help new organisations that seek to employ

the technology, to avoid “trial and error” situations, which could prevent potential attacks

and might perhaps save them money.

3.5.2 Research question(s)

Given the research gaps outlined above, the main research questions for this study are:

• How effective is big data analytics when used for the detection and prevention of cyber-

attacks?

• In using a BDA technology, what is the overall strategy for deploying the system; is there

a specific methodology (or model) that is followed?

Other related questions are:

• What is the nature of threat landscape as far as CII organisations are concerned?

• What are the sources of data for the big data analytics technology being used?



4.0 Research Methodology

The selected methodology for this study is quantitative research. With respect to the nature

of the information to be obtained in this research, the data analysis will be conducted by

means of descriptive statistical techniques such as frequency tables; to facilitate this process,

statistical applications such as SPSS will be used. Details of this analysis approach is provided

in the results section (i.e. section 5).

In this section, the justification for selecting this particular methodology as well the data

collection process, including questionnaire design, are discussed.

4.1 Justification

Quantitative research consists of experimental designs as well as non-experimental designs

such as surveys, but in either case, the main objective of is to test or measure the impact of a

treatment, an intervention, a trend or an opinion (Creswell, 2014). The result will provide a

numeric description of the entity (i.e. treatment, trend or opinion) being measured, which

will make it easy to establish its effectiveness (or lack of it).

In this research, the main objective is to measure the effectiveness of big data analytics for

the detection and prevention of cyber-attacks, and the best way to establish this is to ask the

people (or organisations) that use them. Thus, the best methodology for collecting the

relevant data should be quantitative. To be specific, a survey by questionnaire will be

employed in this study.

A subsidiary objective of this study is to develop a model of best practice in the use of big data

analytics for cyber security. It is highly conceivable that organisations that achieve success will

deploy their (big data) technology in similar ways. Based on the similarities of their

operations, a model will be developed. Therefore, since a model (or an artefact) is to be

created, another research methodology considered was Design Research (DR). However, it

was decided that DR will not be suitable for the purposes of this research. This is because one

of the key aspects of the DR methodology is that, there must be a clear understanding of the

design problem. This research study is more exploratory in nature as one of the key parts is

to establish the effectiveness of big data analytics for cyber security. In other words, this

research is not intended to solve any known problem but rather to establish the success of an

operation and to describe a generic method (in the form of a model) of how it is done.

A key assumption of this research is that using big data analytics for the detection of cyber-

attacks (especially stealthy attacks) is very effective. Therefore, there is a chance that this

might not be case, and that this research will reveal a problem (in the method of how the big

data technology is deployed); should that be the case, future work or research could use DR

to solve that problem.



4.2 Questionnaire design

In its basic form, quantitative research by survey is simply a matter of asking the relevant

target audience question(s) about the issue one’s research is about. For example, if a teacher

wants to judge if her lesson was fun, she can simply pose the question, “Did you find this

lesson fun?”, and count the number of “yes” responses. However, if she wants to judge the

effectiveness of her lecture, just asking such a direct question might not yield accurate results.

First, she may have to define what “effectiveness” actually means; is it about whether the

students found the lesson fun or whether they understood the main points of the lesson, or

both of these points. Then, she will have to frame the questions in a way that will allow her

to obtain the information she needs without bias.

This shows that, depending on the nature of the research, question design for a survey

(especially by questionnaire) can be a complex but very important process. The parameters

or variables – which may be defined as the questions or pieces of information that is collected

from the target audience (Open.edu.,2014) – of the main research question must be carefully

thought through and the wording of the questions must be clear, devoid of ambiguity and

must reflect the parameters. In other words, the questions one asks is as equally important

as how they are asked.

With regards to deciding what questions to ask, Leung (2001) explains that there are three

main types of information to be obtained:

• The first type is the main information one is seeking to obtain from the chosen target

audience. This is known as the dependent parameters.

• This is the information that might bring more meaning to the main information (i.e.

dependent variables). The second type is known as the independent parameters.

• The third type refers to external factors that might affect or distort the final results. These

are called confounding parameters.

In order to determine the parameters (or questions) of this research, the context for word

“effectiveness” in the research question (see section 3.5.2) was first established. Considering

the fact that the stealth attacks are the most difficult, if not impossible, to detect by

traditional methods, it was decided that (based on the nature of an APT and the speed of an

attack), an effective BDA system should have the following attributes:

• Able to detect stealth attacks significantly more often than not – at least 75% more often.

• The speed of detection must be within seconds, minutes and hours, but no more than a

day.

Having established the context by which an effective BDA system will be determined, an

outline of the dependent parameters, as well as independent parameters, for the

questionnaire with their corresponding justification are presented in Table 3. For the actual

questionnaire, including details of the options for each question, check Appendix i.



No. Parameter Rationale

1 What is the size of your

company?

Given that currently, only large global companies

use BDA for cyber security, this question will serve

two purposes: confirm that BDA for cyber security is

indeed exclusive to large organizations; establish

the legitimacy of the respondent in terms whether

he/she has indeed used a BDA system for cyber

security.

2 What is your job role?

This question is also to establish the legitimacy of

the respondent. If they have no cyber security or

data science credentials, it is highly unlikely that

they will have any knowledge or experience about

BDA systems, given that the technology is still in its

infancy.

3 How long have you been

using a big data analytics

system for cyber security?

BDA technology is still in its infancy so it is highly

unlikely that anyone will have more than 5 years of

experience in its use.

4 Do you use your big data

analytics system together

with traditional cyber

security systems (such as

intrusion detection systems,

intrusion prevention systems,

anti-virus programs, firewalls,

etc.)?

The purpose of this question is to determine how

organisations deploy a BDA system for cyber

security.

5 Have you completely

replaced your traditional

cyber security systems with a

big data analytics system?

This question is a follow up on the previous one,

and it is to ensure that the respondent is taking the

survey seriously and not just selecting random

answers. For example, ‘yes’ and ‘yes’ responses for

questions 4 and 5 respectively signals that the

respondent might not be taking the survey seriously

because one cannot have the two situations

occurring at the same time.



6 What type of big data

analytics processing do you

perform?

The purpose of this question is to determine the

nature of the BDA system that is being used for

cyber security.

7 Where do you source your

data from, for your security

analytics?

The purpose of this question is to determine how

organisations deploy a BDA system for cyber

security.

8 What type of big data

analytics technology do you

use?

The purpose of this question is to determine the

type of the BDA system that is being used for cyber

security. Perhaps a particular one of choice for the

range of organisations surveyed might indicate that

it is more effective that the others.

9 Which of these big data

storage systems or databases

do you use?

The type of storage system being used can affect

the speed of analytics process. Perhaps a particular

one of choice for the range of organisations

surveyed might indicate that it is more effective

that the others.

10 What percentage of the

cyber-attacks you have

experienced do consider as

targeted?

This is to establish the level of targeted attacks

against the organisation.

11 What percentage of these

targeted attacks do you

consider as stealth attacks

(e.g. advanced persistent

threats)?

Targeted attacks can be in the form of DDoS, spear

phishing, etc. so the purpose of this question is to

get an idea of how much of the targeted attacks are

stealthy.

12 Are you able to detect these

stealth attacks with your big

data analytics system?

This is to establish the effectiveness of the BDA

system.

13 About what percentage of

these stealth attacks are you

able to detect with your big


This is to establish the effectiveness and efficiency

of the BDA system.

14 How long does it usually take

you to detect these stealth

attacks?


system.



15 Overall, using a big data

analytics system has

enhanced your ability to

detect stealth attacks.


system.

16 Overall, using a big data

analytics system has made

your cyber security

operations more efficient.


system.

Table 3 - parameters of the questionnaire with justification

4.2.1 Target audience and sample size

According to a Gartner report, only large global organisations can currently afford to use BDA

for cyber security (Laitan, 2014). Therefore, the target audience for this research is large

organisations whose operations are critical in nature to a country’s economy; specifically, the

questionnaire will be directed at cyber security professionals that work for these large

organizations.

Also, given that only an estimated 25 percent of these large global companies are likely to use

BDA, the sample size for this research is likely to be relatively small. However, it is important

that an accurate sample size is used and in order to do so, several methods can be used.

Bartlett, Kotrlik and Higgins (2001) discusses a range of these methods for determining the

sample size, but it seems the best approach involves the use of two main variables (or

statistics), namely, confidence interval (or margin of error) and confidence level.

The confidence interval is the acceptable range for which one’s estimated value (or result)

will be considered accurate (Open.edu.,2014). For example, if one sets the margin of error to

be 10% and the estimated value after the research is 80%, then the accuracy of this value will

be plus or minus 10%, which works out to be in the range, 70% to 90%.

The level of confidence ensures that your result falls within the confidence interval. A

common value that is often used is 95%, which means that there is a 5% chance that the result

will be outside the confidence interval (Open.edu.,2014).

Another statistic which is used in determining the sample size is the z-score, which is directly

proportional to the confidence interval. To put it simply, it gives an idea of the level of

deviation from the value determined (from the result). One can determine the z-score for a

given confidence interval value by looking at a standard chart.

The formula for calculating the sample size, that involves these two variables, is derived from

the formula for calculating the margin of error, as shown in figures 5a and 5b.



ME = Z�p��1 − p��n

Figure 7 - mathematical equation for calculating margin of error (ME)

n = Z �P��1 − P��ME

Figure 8 - mathematical equation for calculating sample size n, derived from the ME equation

In both of these equations (shown in figures 5a and 5b), ME is the margin of error; Z is the z-

score; P is the estimate of the proportion of large global organisations that use a BDA system

for cyber security; and n is the sample size to be determined.

For this research, the ME is set to 10% (i.e. ME is 0.1) due to the fact that the number of large

global companies that use BDA for cyber security is not exactly known, and the figure given is

only a prediction (albeit from a reputable organisation). Also, although the questionnaire will

be directed at large global organisations, they will not be contacted directly, since they are

not known to this researcher. For this same reason, the confidence interval for this research

was chosen to be 90%. Looking at a z-score chart (http://www.stat.ufl.edu/, n.d), the z-score

for this value is 1.29.

As already mentioned earlier, it is projected that about 25 percent of large global companies

will adopt big data technologies for cyber security or fraud detection (Rivera, 2014). This

means that the value of P (for the equation shown in figure 5b) for our research is 0.25.

Applying these values (of ME, Z and P) to the equation in Figure 5b, the sample size n for this

research works out to be 31.

4.2.2 Data collection

After the questionnaire had been designed and created, it was tested with the relevant target

audience. To do this, the questionnaire was sent to 5 cybersecurity professionals and to

obtain their feedback, these cybersecurity professionals were asked to respond to a few but

directed questions about the questionnaire. Although three of them confessed that they were

not really familiar with the subject content (of big data analytics), they all agreed that the

questions were clear and easy to understand. Also, they all agreed that the use of closed

questions made the survey more user friendly and easy to complete.



To collect the data (from the identified target audience), it was deployed on the online survey

service platform, where it was produced. This platform had a facility which allowed a specific

target audience to be selected. For the purposes of this research, the audience selected

included people that worked in the technology, telecommunications and Internet industries.

Also, different information security groups on the LinkedIn social media platform were

targeted. Finally, to ensure that every option (for obtaining feedback from the target

audience) was explored, emails were also sent to top security firms. The outcome and analysis

of the data collected are presented in “The Results” section (i.e. section 5).



5.0 The Results

Since the parameters of the questionnaire consists of groups of categories (e.g. less than a

year; between 1 to 3 years; between 3 to 5 years), the data collected can be summarised and

described by determining how many times a category occurs. This technique used for

analysing quantitative data called is frequency analysis.

Therefore, in this section, the data collected in the survey is analysed using frequency analysis.

This means that, for each question, the number of times a category occurs (in the responses)

is counted and calculated as a relative percentage. These are presented in a frequency table

as well as in pie chart. Also, the meaning and conclusions drawn from this analysis is

discussed; this includes the limitations and accuracy of the survey as well as the consequent

model of meaningful practice of using BDA for cyber security.

5.1 Analysis

For each question, the findings of the survey are analysed and its implications discussed.

Q1: What is the size of your company?

Answer choices No. of responses Percentage

1 – 4 employees 9 21%



20 – 99 employees 4 10%

100 – 499 employees 7 16%

500 – 599 employees 4 10%

1000 – 4999 employees 2 5%

5000 – 9999 employees 6 14%

10,000 + employees 7 16%

Total 43 100%

Table 4 - Frequency analysis for Q1



Chart 1: this chart shows the percentage distribution of the responses to question 1.

The rationale behind question 1 was to confirm that BDA for cyber security is indeed exclusive

to large organisations. Looking at the responses to this question (see Chart 1 and frequency

table, Table 4), it seems BDA is not exclusive to large organisations; only 35% (i.e. >500

employees) work for large organisations.

However, the outcome of question 2 reveals that about 35% (see Table 5 and Figure 9) of the

respondents are cyber security professionals. This might be indication that these cyber

security professionals are the same 35% that indicated that they work for a large organisation.

Therefore, BDA is indeed exclusive to large organisations.

Q2: What is your job role?


Information security officer 2 5%

Information security analyst 5 12%

Information security manager 0 0%

Information security auditor 1 2%

Information security consultant 4 9%

IT/Network manager 1 2%

IT/Network administrator 0 0%

Chief information security officer (CISO) 0 0%

Chief information officer (CIO) 2 5%

Other (please specify) 28 65%

Total 14 100%





In Chart 2 and Table 6, one can observe that the percentage of respondents that chose “other”

is quite significant and it might be interesting to find out what a further investigation might

reveal; there is a chance that the background of these respondents (that chose “other”) might

suggest no knowledge or experience in using BDA systems for cyber security.

Q3: How long have you been using big data analytics system for cyber security?


Less than a year 10 23%

Between 1 to 3 years 9 21%

Between 3 to 5 years 1 2%

More than 5 years 1 2%


Total 43 100%





This chart and frequency table for Q3 (Chart 3 and Table 6) show that more than a third of

the respondents have been using BDA systems for 5 years or less. It seems the same

respondents that selected “other” for question 2, also selected “other” for this question. This

might indicate that these respondents have little or no knowledge of BDA technologies.

Q4: Do you use your big data analytics system together with traditional cyber security

systems (such as intrusion detection systems, intrusion prevention systems, anti-virus

programs, firewalls, etc.)?


Yes 19 48%

No 21 52%

Total 40 100%





Comments about the responses of Q4 is combined with that of Q5 below, since the two

questions are connected.

Q5: Have you completely replaced your traditional cyber security systems with a big data

analytics system?


Yes 5 12%

No 35 88%

Total 40 100%





Combining the responses for questions 4 and 5, one can observe (from Charts 4 and 5 as well

as Tables 7 and 8) that most organisations deploy their BDA cyber security systems in

conjunction with traditional cyber security systems. This indicates that, perhaps this is the

best way to do it. This makes sense because the more security event data one can analyse,

the more likely you are to detect inconsistencies and threats. It must be mentioned here that

the speed of the analytics is also essential.

Q6: What type of big data analytics processing do you perform?


Batch 6 17%

Stream/Real-time 7 20%

Both (Batch and Stream) 22 63%

Total 35 100%





Comments about the responses of Q6 is combined with that of Q7 below, since the two

questions are connected.

Q7: Where do you source your data from, for your security analytics?


Internally – from traditional systems such as SIEM 8 22%

Externally – from cyber security events on the Internet 4 11%

Both internal and external sources 24 67%

Total 36 100%





The responses for questions 6 and 7 (see Charts 6 & 7 as well as Tables 9 & 10) suggests that

the meaningful practice is to use data from both internal and external sources for performing

both batch and real-time analytics. In other words, both batch and real-time analytics are

necessary for detecting cyber-attacks, and also data should be obtained from any source, so

long as it is relevant.

Q8: What type of big data analytics technology do you use?


S4 1 3%

Hamma 0 0%

Hadoop 8 22%

Storm 4 11%

Spark 7 19%


Total 36 100%





Comments about the responses of Q8 is combined with that of Q9, since the two questions

are connected.

Q9: Which of these big data storage systems or databases do you use?


CouchDB 4 0%

MongoDB 3 21.4%

HBase 8 42.9%

Cassandra 2 0%

Giraph 0 0%

CouchBase 0 0%

Riak 2 0%

Redis 0 0%

Neo4j 0 0%

OrientDB 1 0%

Other (please specify) 18 35.7%

Total 38 100%





The rationale behind questions 8 and 9 (see Charts 8 & 9 as well as Tables 11 & 12) was to

establish which technology is most widely used, and therefore most efficient and user

friendly. The results are inconclusive; perhaps these questions were not necessary since an

organisation’s choice of BDA technology will depend on its needs.

Q10: What percentage of the cyber-attacks you have experienced do you consider as

targeted?


Less than 10% 18 42%

Between 10% and 20% 7 16%




More than 50% 1 2%


Total 43 100%





Looking at the chart and frequency table (Table 13) and chart (Chart 10) for question 10, it

seems different organisations experienced different levels of targeted attacks although the

majority experienced less than 10% of targeted attacks.

Q11: What percentage of these targeted attacks do you consider as stealth attacks (e.g.

advanced persistent threats)?







More than 50% 2 5%


Total 43 100%





The outcome of this analysis suggests that stealthy attacks may not be as common as some

might think.

Q12: Are you able to detect these stealth attacks with your big data analytics system?


Yes 20 47%

No 23 43%

Total 43 100%





This analysis seems to suggest BDA systems are able to detect stealthy attacks only about 50%

(as shown in Chart 12 and Table 15).

Q13: About what percentage of these stealth attacks are you able to detect with your big








More than 50% 12 31%


Total 39 100%





For a BDA system for cyber security to be effective, it was determined that should be able to

detect stealthy attacks at least 75% of the time (see section 4.2). This result seems to suggest

that the success rate is definitely more than 50%. With hindsight, this question should have

had the choice of “More than 75%”. Since “More than 75%” falls into the “More than 50%”

set, one can conclude that a BDA is effective.

Q14: How long does it usually take you to detect these stealth attacks?


Seconds 7 16%

Minutes 13 30%

Hours 6 14%

Days 2 5%

Weeks 1 2%

Months 1 2%


Total 43 100%





Another measure of effectiveness of a BDA system for cyber security was determined to be

that, it should be able to detect stealthy attacks within seconds, minutes and hours (see

section 4.2). This result (as shown in Chart 14 and Table 17) seems to suggest that it takes a

matter of minutes to detect stealth cyber-attacks using a BDA system. In other words, this

results suggests that a BDA system for cyber security is effective in detecting stealthy cyber-

attacks.

Q15: Overall, using a big data analytics system has enhanced your ability to detect stealth

attacks.


Yes 23 53%

No 20 47%

Total 43 100%





Comments about the responses of Q15 is combined with that of Q16, since the two questions

are connected.

Q16: Overall, using a big data analytics system has made your cyber security operations

more efficient.


Yes 21 49%

No 22 51%

Total 43 100%





The rationale behind questions 15 and 16 was to establish the effectiveness of BDA systems

for cyber security in a direct manner. The responses to these questions (as shown in Charts

21 & 22 as well as Tables 18 & 19) show a 50-50 split. In other words, 50% of the respondents

agree that a BDA system is effective in dealing with stealth attacks whilst the other 50%

disagreed. Perhaps, the reason for this split outcome is down to inexperience of use; or some

of the respondents were simply guessing since they do not have any knowledge of the

technology.

5.2 The model5.2 The model5.2 The model5.2 The model

The outcome of the survey, especially questions 4 and 5 show that BDA systems are always

used in conjunction with traditional security systems. Combining this outcome with

information from literature (see section 2.0), a model of how a BDA security system can be

deployed was developed; this is presented in Figure 9, below.

In this model, the concept of defence in depth (DiD) is applied whereby traditional

information security tools are used as the first level of defence. At this level, data generated

by using traditional security detection systems such as NIDS/NIDPS, Antivirus, firewalls, etc.

can be collated with a SIEM tool and used as a source of (internal) data for the BDA security

system which forms the second level of defence. Since the BDA security system is more

powerful in that it can crunch much more data and at a much faster rate than traditional

systems, it is important that the BDA system is placed closer to the data asset to be protected.

This way, if a stealth cyber-attack is missed by the first level of defence (i.e. SIEM tool) due to

lack of processing time (as attacks can occur in a matter of seconds) and relevant data, there

is a higher probability that BDA security system will be able detect it, since it has access to

more relevant data.



In a nut shell, this model can be described as follows: common cyber-attacks, such as known

malware and general hack attacks, can be detected by the hardened traditional systems; if a

sophisticated and stealthy attack manages to bypass this first level of defence, that is the

traditional system, it will have to deal with the BDA security system, which is much more

formidable. In a worst case scenario where the attack still manages to breach the BDA security

system, this system should still be able to detect the attack before it becomes too late (a day

might be too late) and causes too much damage.

Figure 9 – a derived model showing how a BDA security system can be deployed



6.0 Conclusion

Targeted cyber-attacks against CIIs such as international financial systems and national

communication systems are on the rise, and a breach of such a system could be devastating

to a country’s economy or even trigger a war. The level of sophistication and stealthy nature

of these targeted attacks as well as the success rate (one breach is too many) demonstrate

that traditional cyber security systems are proving to be ineffective. Thus more innovative

solutions for the detection and prevention of stealthy and targeted cyber-attacks are needed.

This research explored one such solution, which is the use of BDA systems.

In the literature review, it was established that BDA systems are capable of meeting this

challenge, and perhaps they provide best solution for protecting CIIs. However, it was not

clear just how effective these BDA systems are (in detecting and preventing cyber-attacks),

since it is still early days for this technology. Also, the literature review did not reveal any

established method or strategy of using a BDA system for the detection and prevention of

cyber-attacks. Therefore, this study sought to answer these particular questions:

1. How effective are BDA systems for detecting and preventing cyber-attacks?

2. Is there a model of best practice for deploying a BDA system for cyber security?

With respect to the first question, the survey results were inconclusive with about 50% of the

respondents agreeing that the use of BDA systems for detecting and preventing cyber-attacks

is effective (as shown in the analysis of Q15 and Q16 in section 5) and the other 50%

disagreeing. Also, going by the analysis of Q13 and Q14 (see section 5), a BDA security system

is an effective tool for detecting stealthy attacks. Therefore, one can conclude that although

the jury is still out on the effectiveness of BDA security systems, compared to traditional

security systems, they are definitely an effective tool for detecting targeted and stealthy

cyber-attacks and must be used for the protection of CIIs.

For the second question, the research (literature review and survey) revealed that the

meaningful practice is to apply the concept of DiD by using traditional existing security

systems in conjunction with the BDA security system. In other words, the best defence

strategy for protecting CIIs is to have a BDA system that is capable of performing both batch

and real-time analytics. The idea is that, the data that is generated by users (authorised or

not) should be stored and combined with real-time and static data from external sources. This

large volume and diverse data can then be mined quickly (using a BDA system) for security

events. A model (see Figure 9) depicting this process is presented in section 5. This model

shows that the BDA system must be placed closest to the data item to be protected, followed

by the traditional security systems. The traditional security systems could also be organised

into multi-layers.



6.1 6.1 6.1 6.1 ResearchResearchResearchResearch limitationslimitationslimitationslimitations

The main limitation of this study was a lack of resources in terms of finance, clout and time

during the data collection process. As already mentioned, the survey was created and

deployed with an online survey service. This online survey service provides a range of

different types of audiences based on their backgrounds. The target audience selected on this

platform was technology, networking, telecoms and media professionals, but this process

proved to be a rather challenging one with very little responses. The idea was to tap into the

audience database (for the general public) possessed by this online survey company, hoping

to attract only the people with the relevant knowledge (to complete the survey), but despite

the backgrounds of the respondents, their responses showed limited knowledge and

experience about big data analytics.

To get around this problem of limited responses and lack of relevant knowledge from the

respondents, specific security professional audiences were targeted. To do this, web links to

the survey were placed on information security specialised groups on a business oriented

social networking platform (i.e. LinkedIn). Also, emails (with the web links to the survey) were

sent to about 50 specific cyber security firms. With these two approaches, we managed to

collect data from more than the sample size.

Due to the nature of this research, which involves large organisations and technologies still in

its infancy, it was anticipated that such problems of lack of knowledge and limited responses

will occur. Therefore, although there is not much one can do to get more responses (especially

if one lacks resources), to ensure a certain level of accuracy, the questioning was designed in

a way that would inform the researcher as to whether responses should be trusted or not.

For example, looking at the responses for Q3 (see Table 8), since there is a substantive number

that have ‘other’ credentials, it was necessary to examine what their responses were and

upon examination, it turns out that some of them had a background that could indicate some

knowledge about BDA systems, but checking out the rest of the responses showed limited

knowledge of this field.

Hence, given these limitations, one can conclude that the results could have been of higher

accuracy and more conclusive if these targeted community groups of BDA knowledge and

Security were approached directly; that is getting them to complete the questionnaires by

interviewing them directly.

6.2 Future research

This research, at the very least, has laid out the theoretical foundation of how one can

measure the success rate, and thus the effectiveness using BDA for security. Also, a model (or

method) of meaningful practice for deploying BDA systems as part of a DiD strategy has been

produced. However, due to the limitations of the survey (see section 6.1), a future study could



be based on improving the data collection process, thus establishing unequivocally the

effectiveness of using BDA systems for cyber security.

Also in this research, it was determined that a range of BDA technologies have been and are

being developed for cyber security. A future research could take an in-depth look at these

technologies, and measure and rank their effectiveness as well as describe the nature of the

organisation they will be best suited for. In other words, a comparative study of the various

technologies can be conducted. This will make it easier for organisation to know what to do

if they wish to upgrade their security systems to BDA systems.

Recently, MIT scientists have developed an AI-enabled BDA system that is capable of

detecting (with some human help) cyber-attacks three times more effectively than current

systems (Shead, 2016). Another research could look more closely at this AI system in order to

determine how it is deployed, and whether it is cost effective. The study could seek to answer

questions such as: what kind of cyber-attacks does this system detect? Is it able to detect

stealthy attacks? What is the speed of detection? Is this the only system of its kind or are

there others? Also, another issue, related to this concept of AI, that can be explored is the

practicality and implications of fully automating cyber-security operations.



7.0 References

1. Bartlett, J., Kotrlik, J. and Higgins, C. (2001). Organizational Research: Determining

Appropriate Sample Size in Survey Research. Information Technology, Learning, and

Performance Journal, 19(1).

2. Brewer, R. (2015). Cyber threats: reducing the time to detection and response. Network

Security, 2015(5), pp.5-8.

3. Cardenas, A., Manadhata, P. and Rajan, S. (2013). Big Data Analytics for Security. IEEE

Security & Privacy, 11(6), pp.74-76.

4. Cavelty, M. and Suter, M. (2012). The Art of CIIP Strategy: Tacking Stock of Content and

Processes. Centre for security studies, pp.27 - 36.

5. Cloud Security Alliance, (2013). Big Data Analytics for Security Intelligence. [online]

Cloud Security Alliance. Available at:

https://downloads.cloudsecurityalliance.org/initiatives/bdwg/Big_Data_Analytics_for_S

ecurity_Intelligence.pdf [Accessed 24 Apr. 2016].

6. Creswell, J. (2014). Research design: Qualitative, quantitative, and mixed methods

approaches. 4th ed. California: Sage, pp.3 - 24, 155 - 182.

7. Dugan, K. (2014). Regulator sees cyber attacks on banks causing ‘Armageddon’. [online]

New York Post. Available at: http://nypost.com/2014/09/22/regulator-sees-cyber-

attacks-on-banks-causing-armageddon/ [Accessed 5 Mar. 2016].

8. Elgendy, N. and Elragal, A. (2014). Big Data Analytics: A Literature Review Paper.

Springer, 8557, pp.214 - 227.

9. Eschelbeck, G. (2014). Smarter, Shadier, Stealthier Malware. Security Threat Report.

[online] Sophos. Available at: https://www.sophos.com/en-

us/medialibrary/PDFs/other/sophos-security-threat-report-2014.pdf [Accessed 21 May

2016].

10. Everett, C. (2015). Big data – the future of cyber-security or its latest threat?. Computer

Fraud & Security, 2015(9), pp.14-17.

11. Hadoop.apache.org. (2016). Welcome to Apache™ Hadoop®!. [online] Available at:

http://hadoop.apache.org/ [Accessed 9 May 2016].

12. Han Hu, Yonggang Wen, Tat-Seng Chua, and Xuelong Li, (2014). Toward Scalable Systems

for Big Data Analytics: A Technology Tutorial. IEEE Access, 2, pp.652-687.



13. Hashem, I., Yaqoob, I., Anuar, N., Mokhtar, S., Gani, A. and Ullah Khan, S. (2015). The rise

of “big data” on cloud computing: Review and open research issues. Information

Systems, 47, pp.98-115.

14. http://www.stat.ufl.edu/. (n.d.). Standard Normal Probabilities. [online] Available at:

http://www.stat.ufl.edu/~athienit/Tables/Ztable.pdf [Accessed 19 Jun. 2016].

15. Hurst, W., Merabti, M. and Fergus, P. (2014). Big Data Analysis Techniques for Cyber-

threat Detection in Critical Infrastructures. 2014 28th International Conference on

Advanced Information Networking and Applications Workshops.

16. Information Security Forum, (2012). Data Analytics for Information Security: From

hindsight to insight. London: Information Security Forum Ltd, pp.1 - 3.

17. Internet Security Threat Report. (2016). ISTR. [online] California: Symantec. Available at:

https://www.symantec.com/content/dam/symantec/docs/reports/istr-21-2016-en.pdf

[Accessed 14 May 2016].

18. Krishnan, R. (2016). NSA Data Center Experiencing 300 Million Hacking Attempts Per

Day. [online] The Hacker News. Available at: http://thehackernews.com/2016/02/nsa-

utah-data-center.html [Accessed 1 Mar. 2016].

19. Lee, T. (2014). The Sony hack: how it happened, who is responsible, and what we've

learned. [online] Vox. Available at: http://www.vox.com/2014/12/14/7387945/sony-

hack-explained [Accessed 19 May 2016].

20. Leung, W. (2001). How to design a questionnaire. Student BMJ, [online] 9. Available at:

http://www.cochrane.es/files/Recursos/How_to_design_a_questionnaire.pdf [Accessed

5 Jun. 2016].

21. McLaughlin, K., Sezer, S., Smith, P., Ma, Z. and Skopik, F. (2014). PRECYSE: Cyber-attack

Detection and Response for Industrial Control Systems. [online] Available at:

http://precyse.eu/downloads/ [Accessed 7 Apr. 2016].

22. Mueller, R. (2012). Combating Threats in the Cyber World: Outsmarting Terrorists,

Hackers, and Spies. [online] FBI. Available at:

https://www.fbi.gov/news/speeches/combating-threats-in-the-cyber-world-

outsmarting-terrorists-hackers-and-spies [Accessed 21 May 2016].

23. Open.edu. (2014). [online] Available at:

http://www.open.edu/openlearnworks/mod/resource/view.php?id=52658 [Accessed 5

Jun. 2016].



24. Polak, K. (2016). Keeping European datacentres safe from cyber attacks. [online]

ComputerWeekly. Available at: http://www.computerweekly.com/feature/Keeping-

European-datacentres-safe-from-cyber-attacks [Accessed 1 Mar. 2016].

25. PRECYSE. (2012). [online] Available at: http://precyse.eu/overview/ [Accessed 7 Apr.

2016].

26. Rivera, J. (2014). By 2016, 25 Percent of Large Global Companies Will Have Adopted Big

Data Analytics For At Least One Security or Fraud Detection Use Case. [online]

Gartner.com. Available at: http://www.gartner.com/newsroom/id/2663015 [Accessed

23 Mar. 2016].

27. Rushton, K. (2014). Cyber-criminals could spark next financial crisis. [online]

Telegraph.co.uk. Available at:

http://www.telegraph.co.uk/finance/newsbysector/banksandfinance/11156260/Cyber-

criminals-could-spark-next-financial-crisis.html [Accessed 5 Mar. 2016].

28. Shackleford, D. (2016). Using Analytics to Predict Future Attacks and Breaches. [online]

SANS Institute. Available at:

http://www.sas.com/content/dam/SAS/en_us/doc/whitepaper2/sans-using-analytics-

to-predict-future-attacks-breaches-108130.pdf [Accessed 28 May 2016].

29. Shead, S. (2016). MIT scientists have built an AI that can detect 85% of cyber attacks —

but it still needs human help. [online] Business Insider. Available at:

http://uk.businessinsider.com/mit-scientists-build-ai-that-can-detect-85-of-cyber-

attacks-2016-4 [Accessed 27 Jul. 2016].

30. Silver-Greenberg, J., Goldstein, M. and Perlroth, N. (2016). JPMorgan Chase Hacking

Affects 76 Million Households. [online] DealBook. Available at:

http://dealbook.nytimes.com/2014/10/02/jpmorgan-discovers-further-cyber-security-

issues/?_php=true&_type=blogs&_r=1 [Accessed 5 Mar. 2016].

31. Skopik, F., Friedberg, I. and Fiedler, R. (2014). Dealing with advanced persistent threats

in smart grid ICT networks. ISGT 2014.

32. Storm.apache.org. (2015). Apache Storm. [online] Available at: http://storm.apache.org/

[Accessed 14 May 2016].

33. Sullivan, D. (2016). Introduction to big data security analytics in the enterprise. [online]

SearchSecurity. Available at: http://searchsecurity.techtarget.com/feature/Introduction-

to-big-data-security-analytics-in-the-enterprise [Accessed 4 Jun. 2016].



34. Tankard, C. (2012). Big data security. Network Security, 2012(7), pp.5-8.

35. Trend Micro, (2015). Report on Cybersecurity and Critical Infrastructure in the Americas.

[online] Trend Micro Inc. Available at: http://www.trendmicro.com/cloud-

content/us/pdfs/security-intelligence/reports/critical-infrastructures-west-

hemisphere.pdf [Accessed 23 Apr. 2016].

36. Tsai, C., Lai, C., Chao, H. and Vasilakos, A. (2015). Big data analytics: a survey. Journal of

Big Data, 2(1).

37. Ullveit-Moe, N., Gjosaeter, T., Assev, S., Koien, G. and Oleshchuk, V. (2013). Privacy

Handling for Critical Information Infrastructures. [online] Available at:

http://precyse.eu/downloads/ [Accessed 7 Apr. 2016].

38. van Kessel, P. and Allan, K. (2014). Get ahead of cybercrime. EY's Global Information

Security Survey. [online] EYGM. Available at:

http://www.ey.com/Publication/vwLUAssets/EY-global-information-security-survey-

2014/$FILE/EY-global-information-security-survey-2014.pdf [Accessed 19 May 2016].

39. Virvilis, N. and Gritzalis, D. (2013). The Big Four - What We Did Wrong in Advanced

Persistent Threat Detection?. 2013 International Conference on Availability, Reliability

and Security.

40. Vom Brocke, J., Simons, A., Niehaves, B., Riemer, K., Plattfaut, R. and Cleven, A. (2009).

RECONSTRUCTING THE GIANT: ON THE IMPORTANCE OF RIGOUR IN DOCUMENTING THE

LITERATURE SEARCH PROCESS. ECIS, 161, pp.1 - 14.

41. Williams, R. (2016). BT broadband suffers major outage across UK. [online] The

Telegraph. Available at: http://www.telegraph.co.uk/technology/2016/02/02/bt-

broadband-suffers-major-outage-across-uk/ [Accessed 16 Apr. 2016].

42. Zetter, K. (2016). Sony Got Hacked Hard: What We Know and Don’t Know So Far. [online]

WIRED. Available at: http://www.wired.com/2014/12/sony-hack-what-we-know/

[Accessed 16 Apr. 2016].

Documents

The use of Big Data Analytics to protect Critical Information …ltu.diva-portal.org/smash/get/diva2:1037515/FULLTEXT02.pdf · 2016. 10. 28. · The use of Big Data Analytics to protect