Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
DEVELOPING A FRAMEWORK TO MITIGATE THE GROWING INCIDENTS OF CYBER-SECURITY THREATS ON PROCESS CONTROL NETWORKS (PCN): A CASE STUDY OF
THE PETROCHEMICAL INDUSTRY
A Dissertation Presented to
The Engineering Institute of Technology
by
Abimbola Ogunlade
In Partial Fulfillment of the Requirements for the Degree
Master of Engineering in INDUSTRIAL AUTOMATION
JUNE 2017
COPYRIGHT © 2017 BY ABIMBOLA OGUNLADE
i
DEDICATION
This thesis is dedicated to my loving wife, parents and my three sons: Tobi,
Ayo and Kunle Ogunlade.
ii
ACKNOWLEDGMENT
First and foremost, I would like to thank God Almighty, the Creator of
Heavens and Earth, for without Him, none of this would be possible.
I would like to express my profound gratitude to my supervisor Dr. Hadi for
giving me the opportunity to work under his professional supervision. Your
motivation and guidance were very admirable and inspiring. I will sincerely be
grateful for the effort and time you took in providing valuable advice and comments
on the entire thesis, particularly the subject of machine learning which is an area of
interest I may consider for my PhD research.
Finally, I must express my very profound appreciation for my loving wife Dr
Prudence Ogunlade for providing me with unfailing support and continuous
encouragement throughout my years of study. This achievement would not have been
possible without her emotional support. Thank you.
iii
SUMMARY
Cyber security threats are emerging at a more rapid rate than ever. Cyber
exploits and incidents relating to the Process Control Networks (PCN) are becoming
prominent and more sophisticated with the way and manner they are orchestrated.
Many petrochemical organizations simply cannot keep up with the pace of cyber
threats as attack defenses appear obsolete almost as soon as they are implemented.
A framework that will mitigate the dynamics of increasing incidents of cyber
security exploits in the petrochemical industry that is proposed. The framework
involves a hybrid approach of using machine learning predictors together with
traditional risk management strategy where the machine learningbased intrusion
detectors play a central role. A comprehensive review of several machine learning
algorithms were discussed and conceptualized. Using the WEKA machine learning
software to analyze captured data, three algorithm classifiers were applied, namely
Random Tree, Multilayer Perceptron and Naïve Bayes algorithms.
The analysis was able to discover and identify the key feature of reference
used in the analysis. Interestingly, “protocol and length” data packet features were
used as reference indicator, where the Random tree model achieved a classification or
detection accuracy of about 98% when analyzed using a 10-folds cross-validation and
percentage spilt analysis, respectively. Meanwhile, a Multilayer Perceptron and Naïve
Bayes obtained classification accuracies of 97% and 94%, respectively, when the
analyses were conducted on the same set of captured data.
iv
Evidently, the use of machine learning has demonstrated the enormous
opportunities of combining the methodology of active learning – machine learning in
collaboration with a PCN administrator in the detection and analysis of data traffic
within the PCN environment.
v
TABLE OF CONTENTS
Dedication ..................................................................................................................... ii
Acknowledgment ......................................................................................................... iii
List of Abbreviations ................................................................................................. viii
List of Figures ............................................................................................................... ix
List of Tables ................................................................................................................. x
Chapter 1. Research Introduction .................................................................................. 1
1.1 Introduction ....................................................................................................... 1
1.2 Problem Statement and Substantiation ............................................................. 2
1.3 Research Aims and Objectives ......................................................................... 4
1.4 Expected Outcomes and Deliverables .............................................................. 4
1.5 Research Methodology ..................................................................................... 5
1.6 Provisional Table of Contents ........................................................................... 6
Chapter 2. Cyber Security: A growing threat in Oil and Gas Industry .......................... 8
2.1 Introduction ....................................................................................................... 8
2.2 Understanding Cyber risks in the Oil and Gas.................................................. 8
2.3 Cyber incidents targeting Energy sector ......................................................... 10
2.3.1 Survey of Cyber Incidents on PCN Infrastructure ................................... 11
2.3.1.1 Incident 1: GAZPROM pipeline Incident ..................................... 12
2.3.1.2 Incident 2: Cyber-Attack on Saudi Aramco .............................. 12
2.3.1.3 Incident 3: The Stuxnet incident at Iranian Nuclear Plant ......... 13
2.3.1.4 Incident 4: Davis-Besse Nuclear power plant incident ............. 14
2.4 Growing trends of Cyber-attack...................................................................... 15
2.4.1 Cyber-Attack Methodology ..................................................................... 18
2.4.1.1 Advanced Persistent Threats ..................................................... 23
2.4.2 Methods of Operation .............................................................................. 24
vi
2.5 Cyber Security Risks Frameworks.................................................................. 25
Chapter 3. Machine Learning: A solution to mitigate cyber threats ............................ 30
3.1 Introduction to ML .......................................................................................... 30
3.2 Basic ML Methods .......................................................................................... 31
3.3 ML in Cyber security ...................................................................................... 33
3.4 Understanding ML Models ............................................................................. 36
3.4.1 Overview of Decision Tree ...................................................................... 38
3.4.2 Overview of Neural Networks ................................................................. 39
3.4.3 Overview of Naïve Bayes ........................................................................ 40
Chapter 4. Empirical Investigation .............................................................................. 42
4.1 Introduction ..................................................................................................... 42
4.2 Creating the data sets ...................................................................................... 43
4.3 Feature extraction of data (CSV format) ........................................................ 45
4.4 Combine and convert data (ARFF format) ..................................................... 46
4.5 Interpretation of results and findings .............................................................. 47
4.5.1 Verification and Validation...................................................................... 49
4.6 Using active learning method of machine learning for Cyber security intrusion detection for PCN application ...................................................... 50
Chapter 5. Conclusion .................................................................................................. 54
5.1 Conclusion and Recommendations ................................................................. 54
Appendix A. Naïve Bayes (Cross Validation) ........................................................ 56
Appendix B. Naïve Bayes (Percentage Split) ......................................................... 58
Appendix C. Multilayer Perceptron (Cross Validation) ......................................... 60
Appendix D. Multilayer Perceptron (Percentage Split) .......................................... 64
Appendix E. Random Tree (Cross Validation)....................................................... 68
Appendix F. Random Tree (Percentage Split) ....................................................... 71
References .................................................................................................................... 74
vii
LIST OF ABBREVIATIONS
PCN Process Control Networks
SCADA Supervisory Control and Data Acquisition
WEKA Waikato Environment for Knowledge Analysis
ICT Information and Communication Technologies
IoT Internet of Things
MBR Master Boot Records
PPC Plant Process Computer
SPDS Safety Parameter Display System
APT Advanced Persistent Threat
ICS Industrial Control Systems
NIST National Institute of Standards and Technology
NISCC National Infrastructure Security Coordination Center
ML Machine Learning
AI Artificial Intelligence
UEBA User and Entity Behavioral Analytics
AV Antivirus
DLP Data Loss Prevention
API Application Programming Interface
LAN Local Area Network
viii
LIST OF FIGURES
Figure 2.1 Cyber Incidents by Sector: Fiscal Year 2012 ............................................. 16
Figure 2.2 Intruder knowledge versus attack sophistication ........................................ 17
Figure 2.3 Attack tree for a MODBUS-based SCADA system ................................... 19
Figure 2.4 An attack step ............................................................................................. 20
Figure 2.5 Fiscal Year 2014 incidents reported by access vector (245 total) (The Industrial Control Systems Cyber Emergency Response Team (ICS-CERT), 2014) .............................................................................................. 21
Figure 2.6 The ISO 31000:2009 risk management process ......................................... 27
Figure 2.7 Generic SCADA hardware architecture ..................................................... 28
Figure 3.1 The four layers of Data mining and Machine learning ............................... 37
Figure 4.1 Based on the generated data ML models are created that can classify packets. ................................................................................................... 43
Figure 4.2 Wireshark screenshot during packet capture .............................................. 44
Figure 4.3 Combine and convert data (ARFF format) ................................................. 47
Figure 4.4 Active Learning Process ............................................................................. 52
Figure 4.5 Flowchart for Active Learning classification ............................................. 53
ix
LIST OF TABLES
Table 4.1 Class Categorization .................................................................................... 45
Table 4.2 WEKA output results – Classifiers comparison analysis ............................ 48
Table 4.3 WEKA output results – Confusion matrix analysis ..................................... 50
x
CHAPTER 1. RESEARCH INTRODUCTION
Chapter 1 introduces the research study. The research context is described to
provide background for the research. The goals of the research, and the research
methodology adopted are presented. Finally, the thesis organization is outlined.
1.1 Introduction
Petrochemical industries are not new to uncertainty and risk. Their increasing
dependence on technology and web-based communication has opened the door for
cyber security threat, particularly in the oil and gas industry. These are significant
threats, such as hydrocarbon installation terrorism, which can cause plant shutdowns
resulting from sabotage and interruption of utilities. With the oil and gas sector
driving every aspect of our daily life, the protection of this critical infrastructure has
never been more critical. However, the consequences of attacks on the operations and
systems that power such lifestyle cannot be underestimated.
In recent years, industrial cyber security threats have grown from the esoteric
practice of a few specialists to a problem of general concern. All stakeholders now
have a new responsibility in promoting the safety, reliability, and stability of critical
industrial infrastructure. With the rising threat of malware in today’s open computing
platforms, the typical Process Control Networks (PCN) is increasingly vulnerable to
outside modification and exploitations. Cyber-attacks on plant-automation systems
have not only increased, but have also grown more sophisticated in recent years. From
targeted information gathering and theft to elimination of crucial data, these intrusions
represent a real and present danger to plant productivity, reliability, and safety.
1
Companies in the oil and gas, refining, petrochemical, and power-generation
industries, among others, must avert and mitigate cyber security threats that expose
their production operations, including risks to plant infrastructure, equipment,
personnel, and the environment. This includes taking certain proactive steps to protect
critical facilities from cyber intrusion. Taking those steps requires an understanding of
current and future cyber security risks, past incidents in process sectors, and
knowledge of ever-changing security challenges. Because attacks on PCN are now
more frequent with increasing sophistication, defensive strategies must evolve to keep
up. This research will explores effective methods with a view of developing a
comprehensive framework that will mitigate the dynamics of increasing incidents of
cyber security exploits in the petrochemical industry.
This chapter introduces the research investigation. To set the background for
the research study, the research context is explored. The goals of the research and
research methodology adopted are then presented. Finally, the thesis organization is
outlined.
1.2 Problem Statement and Substantiation
The number of cyber incidents targeting energy and petrochemical
infrastructures has significantly increased over the last few years. Technically, the
cyberspace and its underlying infrastructure are vulnerable to a wide range of risk
emanating from both physical and cyber threats and hazards. Between December
2011 and June 2012, 23 gas pipeline companies were targeted by cyber spies where
confidential data were compromised and stolen. It should be noted that the stolen data
is crucial and sensitive, which could be used to sabotage gas pipelines. Per US
Industrial Control Systems Cyber Emergency Response Team (ICS-CERT) 2013
2
data, 56% of 257 recorded cyber incidents targeted energy infrastructures, whereas
this number was 40% in 2012 [1]. Sophisticated cyber actors and nation-states exploit
vulnerabilities to steal information and money, armed with capabilities to disrupt,
destroy, or threaten the supply of essential services. Moreover, a growing concern is
the cyber threat to critical petrochemical infrastructure, which is increasingly subject
to sophisticated cyber intrusions that pose new risks. In 2014, energy and
petrochemical sectors were indicated as a main target for sophisticated threat actors
for a variety of reasons [2]. It has been estimated that about 80% of oil and gas
companies saw an increase in the number of successful cyber-attacks in 2015 alone.
Allied Business Intelligence (ABI) Research calculates that cyber security spending
on the oil and gas critical infrastructure will reach $1.87 billion by 2018 [3].
As information technology becomes increasingly incorporated with physical
infrastructure operations, there is increased risk for wide-scale or high-consequence
events that could cause harm or disrupt services upon which financial economy and
the daily lives of millions of people depend. The PCN, which is a communications
network that is used to transmit instructions and data between control and
measurement units and Supervisory Control and Data Acquisition (SCADA)
equipment, is highly vulnerable to these risks. According to [4], the cyberspace is
particularly difficult to secure due to a number of factors: the ability of malicious
actors to operate from anywhere in the world, the relationships between cyberspace
and physical systems and the difficulty of reducing vulnerabilities and consequences
in complex cyber networks.
1.3 Research Aims and Objectives
3
The aim of this research is to analyze and assess the growing trend of cyber
security threats on the process control networks (PCNs) with a view of formulating a
framework that can mitigate these threats. This research will be carried out with the
objective of:
a. Assessing the rising incidents of cyber security exploits on PCNs in the
petrochemical industry.
b. Developing and formulating a framework that will mitigate these rising
occurrences using a hybrid approach.
Therefore, the above objectives present an opportunity to answer the following
research questions:
• What are the various cyber exploits possible in a PCN environment?
• How can these vulnerabilities be detected and prevented?
• Can the detection and prevention methods capable of adapting to the growing
sophistication of cyber intrusions?
1.4 Expected Outcomes and Deliverables
The number of attacks on industrial networks, particularly petrochemical
plants, is growing rapidly [5]. Attackers are gaining new skills that allow them to
bypass defenses that would probably have been effective just a few years ago.
Defensive strategies have also evolved, helping users keep their plants safe and their
critical information under control. However, these palliatives alone are not enough to
keep cyber intrusion at bay. Therefore, a comprehensive framework is required to be
developed as a matter of priority. This framework will include strategies that are
capable of mitigating ever rising cyber incidents in the petrochemical industries.
4
This research will propose effective methods with a view of developing a
comprehensive framework that is capable of potentially mitigating the dynamics of
increasing incidents of cyber security exploits in the petrochemical industry. The
framework would adopt a hybrid approach of using machine learning predictors
together with traditional risk management strategy where the machine learning-based
intrusion detectors play a central role.
1.5 Research Methodology
Analysis of literature and sources of information
This research work consulted the following sources for information, which
were used to carry out a comprehensive literature survey:
• Library sources (related books, etc.)
• Journal articles and publications.
• Newspapers, magazines, and reports (Oil and Gas or Petrochemical
proceedings, conferences, etc.)
• Thesis and dissertations (Reviewed related work and findings in this field of
study).
• Internet sources (search / metasearch engines: Lycos, Alta Vista, copernic,
metacrawler and financial databases: Macgregor’s, etc.).
• Project reports (benchmarked the reports of similar projects implemented on
cyber security and machine learning).
• Waikato Environment for Knowledge Analysis (WEKA): machine learning
Group website at the University of Waikato [6].
5
• A collection of machine learning algorithms for data mining tasks using
WEKA as the data mining software was adopted. However, due to legal and
confidentiality implications, it was impracticable to conduct this research in an
operational petrochemical facility. Therefore, Wireshark packet analyzer
software was used to provide a simulation of actual PCN to model the
different architecture layers of a PCN. Different exploit scenarios were
conducted to verify and validate the context of this research outcome.
• Write outcomes and findings.
1.6 Provisional Table of Contents
The thesis is organized into the following chapters:
Summary: This section briefly describes what the entire research work is all
about. It presents a short overview of the entire research work.
Preface and acknowledgment: This section has been used for personal
comments about the conditions in which my research was conducted and about
persons, institutions and organizations that provided assistance.
Chapter 1 – Introduction: This section presents a comprehensive
background for my research work.
Chapter 2, 3 – Literature Review: This section contains a detailed review of
the information gathered from different sources during my research. It extensively
covers a detailed analysis of literature, journals, articles, textbooks and other related
or similar works.
6
Chapter 4 – Empirical Investigation/Interpretation of Results and
Findings: This section introduces the research concepts and methodology. The
research design and approaches used in the research have also been addressed. The
results from my empirical investigation are also stated and explained here.
Chapter 5 – Conclusion: Conclusions and recommendations are made based
on the results achieved from the entire research work.
References: This section lists all the relevant information sources used
throughout the research work.
Glossary and definitions of terms
Appendices
7
CHAPTER 2. CYBER SECURITY: A GROWING THREAT IN
OIL AND GAS INDUSTRY
This chapter explores the growing trend of Cyber security threats in the Oil
and Gas Industry and investigates the level of sophistication. Various impacts of these
incidents are also reported.
2.1 Introduction
While the energy sector is diverse including renewable energies, coal,
electrical and nuclear power, and oil and gas, securing this infrastructure is daunting
and not a straightforward affair. Although some specific considerations may apply to
the broader energy sector, different solutions are still used in the various sub-sectors.
This report essentially looks at the cyber security landscape for the oil and gas
industry. However, references could be used on the broader energy industry.
In this chapter, indications have been provided on the following:
• Understanding Cyber risk in the Oil and Gas
• Cyber incidents targeting the Energy sector
• Growing trends of Cyber-attack
• Why PCN?
2.2 Understanding Cyber risks in the Oil and Gas Industry
According to [7], the oil and gas industry can be referred to as the exploration,
extraction, refining, processing, transport, distribution, and sale of petroleum and gas
products. Petroleum products may include fuel oil and gasoline, while natural gas is a
8
major source of electricity generation [7]. The installations and infrastructure to
support the oil and gas processes are often distributed across geographic areas. This is
often characterized by a high demand for distributed control systems used for remote
monitoring and control capabilities with exchange of real-time data. Although the
energy companies have complex industrial environments, their infrastructures are
underpinned by legacy control systems. These systems are vulnerable, and can be
highly susceptible to cyber-attacks if connected to modern information and
communication technologies (ICT) [7]. In my view, cyber security has become a
growing concern in the past decade with the development of sophisticated malware
targeting critical infrastructure. While the brunt of these threats has primarily focused
on government, military, and financial institutions, the energy sector has not been
spared either. Oil and gas companies have been the target of widespread cyber
infiltration in the past few years, with hostile agents successfully stealing intellectual
property assets and valuable confidential information [8]. Perhaps more
disconcertingly, industrial control systems (ICS) in oil and gas installations are
increasingly coming under siege from cyber-attacks. Consequently, protection of the
energy infrastructure is not only imminent but fundamental. While the security of
physical structures has already been mastered for some time, a rising concern is with
cyber security. The energy industry is not exempt from the increasing connectivity of
modern organizations. Connecting online is not a choice for businesses today; it is a
requirement. According to [9], today’s oil and gas industry has evolved to become a
technologically-complex one. More and more processes are being digitized; data
mining and analytical programs are being used more frequently, and sensors are
everywhere. This may lead to more efficiency, but it also makes systems more
vulnerable to cyber-attacks. As competition in the industry intensifies and the
9
backlash of the economic downturn continues, energy companies are investing in the
latest technology to help cut costs and improve efficiency [7].
Since the advent of Internet of Things (IoT), connectivity seems to make
SCADA & ICS vulnerable to cyber-attacks [10]. According to [10], the growing use
of smart grid technology, more new energy systems are increasingly connected to the
so-called IoT, which in turn opens up new security vulnerabilities due to the sheer
number of connected systems and the low or nonexistent security often placed around
simple devices. Large energy producers and power plants typically employed
Supervisory Control and Data Acquisition (SCADA) into their networks. However,
SCADA seem to be the easiest targets for cyber hackers/terrorists. In the past the
Information Control System (ICS) was isolated from the rest of the world. However,
the advent of advanced versions of Operating Systems (OS) and the internet seems to
have made the information sharing and connectivity a necessary evil [10]. Hackers are
using tools such as ‘Metasploit’, which can assist in hacking anything from a small
webcam to a turbine control system or a tank management system. Phishing emails
are now easily reaching the computers of corporate executives and employees alike.
Human errors are exploited by hackers who are looking at ways and means to hack
and steal sensitive information.
2.3 Cyber incidents targeting the Energy sector
According to [10], most energy companies are most vulnerable to cyber-
attacks during mergers and acquisitions. Mergers and acquisitions require complex
integration of information technology systems that may become susceptible to data
breaches and cyber exposures. Most of the time, cyber security is ignored in a merger
or acquisition due to which the companies involved may become susceptible to data
10
breaches and other cyber risks in future. International law firm Freshfields Bruckhaus
Deringer found in a survey shared with information security that 90% of respondents
believe cyber-breaches would result in a reduction in deal value; and 83% of
dealmakers believe a deal could be abandoned if cyber security breaches are
identified during a deal due diligence or mid-transaction [10]. Dealmakers’ top
concerns include targets suffering cyber-attacks during deal discussions, the target
being a proven victim of data or intellectual property (IP) theft by cyber-attack, and
evidence of a target not handling a past breach effectively (leading to fines, damage to
reputation, etc.). Interestingly, acquirers (30%) are most concerned about cyber
security issues derailing transactions, whereas 81% of sellers are unconcerned or only
slightly concerned about the risk of derailment. One of the biggest threats could be
unauthorized access of critical and proprietary information by malicious insiders and
or outsiders. Data breaches could be another problem since the both companies would
operate large volumes of data and information which must be integrated.
2.3.1 Survey of Cyber Incidents on PCN Infrastructure
According to [11], the North American Electric Reliability Corporation
Critical Infrastructure Protection (NERC CIP) guideline 001-1 considers
“Disturbances or unusual occurrences, suspected or determined to be caused by
sabotage” as reportable incidents. This is necessary to provide useful guidelines for
this research work to determine what could be considered as PCN security incidents.
The following are survey analyses of several critical infrastructure cyber security
incidents that were reported in the energy sector. Each incident has been explained
with description and summary of the root causes. However, it must be noted that not
all the incidents identified were due to external threats.
11
2.3.1.1 Incident 1: GAZPROM pipeline Incident
A gas company in Russia known as Gazprom, a major gas and oil producing
and Transportation Company of Russia, suffered a cyber-attack from hackers in 1991.
According to [12], the attack was collaborated with a Gazprom insider (disgruntled
employee). The disgruntled employee evidently facilitated a group of hackers to gain
access and control of the computer systems of Gazprom [11]. The hackers were said
to have gained control of the central switchboard that controls gas flow in pipelines
by using a Trojan-Horse [11] [12]. A major part of these systems is responsible for
the transportation of gas through several pipelines across Europe. It is noteworthy that
these pipelines are of great importance and as such have been the subject of several
international disputes [11]. Below is the summary of the root cause findings:
• Ineffective or weak malware protection system;
• Interconnectivity between the PCN and the corporate business network;
• Inappropriate firewall rules for filtering of network traffic; and
• Remote access capability to critical PCN infrastructure.
2.3.1.2 Incident 2: Cyber-Attack on Saudi Aramco
On 15 August 2012, the computer network of Saudi Aramco was struck by a
self-replicating virus that infected about 30,000 of its Windows-based computers [13]
Despite its vast resources as Saudi Arabia’s national oil and gas firm, Aramco,
according to [13], took almost two weeks to recover from this incident. According to
[13], viruses frequently appear on the networks of multinational firms, but it is
12
shocking that a cyber-attack of this scale was carried out against a company
infrastructure so critical to global energy markets. The virus was later discovered as
Shamoon, the virus caused considerable disruption to the world’s largest oil producer.
Shamoon is designed to indiscriminately delete critical data from computer
hard drives including the Master Boot Records (MBR), making the computer difficult
to boot up. According to [14], a group known as the “Cutting Sword of Justice” took
credit for the Saudi Aramco attack by posting a Pastebin message on the day of the
attack in 2012, and justified the attack as a measure against the Saudi monarchy.
Although this did not result in an oil spill, explosion or other major fault in Aramco
operations, the attack affected the business processes of the company, and it is
possible that some drilling and production information were also lost in this incident
[13] According to many reports, Shamoon was alleged to have also spread to the
networks of other oil and gas firms, including that of RasGas [14]. The incident
comes after years of advisory and warning about the risk of cyber security attacks
against companies’ critical energy and economic infrastructure.
2.3.1.3 Incident 3: The Stuxnet incident at Iranian Nuclear Plant
In June 2010, an Iranian nuclear control systems facility located at Natanz was
infected with a worm known as Stuxnet [12] Stuxnet is a computer worm designed to
allow hackers to attack industrial plants by changing the code in the systems it
controls. Stuxnet used four ‘zero-day vulnerabilities’ as it does not publicly report or
announce its presence before becoming active, leaving the software’s author with zero
days in which to create patches or advise workarounds to mitigate its actions [15] The
worm exploited Siemens’ default passwords to access Windows operating systems
that run WinCC and PCS7 application programs [11]. The worm is capable of
13
locating frequency-converter drives designed by Fararo Paya in Iran and Vacon in
Finland. These drives are known to be used to power the centrifuges popularly used in
the concentration of the uranium-235 isotope. According to [12], stuxnet distorted the
frequency of the electrical current to the converter drives causing them to oscillate
between high and low speeds for which they were not designed. Consequently, this
switching caused the centrifuges to fail at a higher than normal rate [12].
2.3.1.4 Incident 4: Davis-Besse Nuclear power plant incident
On January 25, 2003, an engineer at the Davis-Besse plant in Ohio used
a virtual private network connection to access the plant from his home [16]. Although,
the connection was encrypted, his home computer was infected with the Slammer
worm that infected the nuclear plant’s computers, causing a key safety control system
to fail for nearly five hours. According to [11], the worm crashed the Safety
Parameter Display System (SPDS) and the Plant Process Computer (PPC). However,
approximately four hours and fifty minutes were required to restore the SPDS and six
hours and nine minutes to restore the PPC [11].
The slammer worm was designed to settle in the system memory and search
for other hosts to infect. Although, the slammer worm carries no malicious payload, it
is still capable of causing extensive disruption. It searches for new hosts by scanning
random IP addresses. This would generate a large volume of spurious traffic,
consuming bandwidth and congesting the networks [16] Below is the summary of the
root cause findings:
• Interconnectivity between the PCN and the corporate business network;
• No firewalls switch between the PCN and the corporate business network; and
14
• Lack of regular windows patch update of machines within the PCN (the patch
to fix the slammer worm had already been available six months prior to fixing
the MSSQL vulnerability that the Slammer worm exploited) [11].
2.4 Growing trends of Cyber-attack
The number of attacks faced by the energy industry is on the rise, according to
a survey by Tripwire [9]. The research revealed that 77% of respondents had seen an
increase in successful cyber-attacks in the past 12 months. According to a similar
survey that was published in an article on the Security Week in September 2015, the
systems of the U.S. department of Energy were breached more than 150 times
between October 2010 and October [17] [18]. Furthermore, in November, a report
revealed that high profile cyber-attacks targeting the oil and gas industry will result in
a growth in security spending from $26.3 billion in 2015 to $33.9 billion by 2020.
The report further highlighted that 82% of oil and gas industry respondents said their
organizations registered an increase in successful cyber-attacks over the past 12
months [18]. Moreover, 53% of the respondents said that the rate of cyber-attacks has
increased between 50% and 100% over the past month [18]. The report further reveals
that the increase in attacks is horizontal across industries, but the data shows that
energy organizations are experiencing an excessively-large increase when compared
to other industries [18]. Given this staggering revelation, energy organizations face
unique challenges in protecting Process Control Systems and SCADA assets. Figure
2.1 below depicts the breakdown of cyber incidents sectors in 2012.
15
Figure 2.1 Cyber Incidents by Sector: Fiscal Year 2012 [19].
According to [9], not only has the number of cyber security attacks increased,
the sophistication has also risen over time. Substantial evidence has shown that cyber
security threats have rapidly escalated since 2008, when the industry saw the first
nation-state attacks [9]. Meanwhile, invaluable assets like bid-lease data, seismic
markups and intellectual property were stolen from very large Oil and Gas companies
in those attacks. Further reports have revealed that the attacks have intensified in the
years since, may be due to the geopolitical ramifications of natural resources, the
propagation of information technology convergence in the field (IoT), and because of
specialized intellectual property that has been created in drilling and production sector
[9]. Figure 2.2 below, depicts the trend of cyber-attack as projected between 1980 and
2010 [11].
16
Figure 2.2 Intruder knowledge versus attack sophistication [11].
The Founder and Principal ICS Security Consultant at Applied Risk re-
emphasized that the growing incidents have demonstrated that critical-infrastructure
companies must shift cyber security higher up the agenda. Take for instance in April
2015, the U.S. Department of Energy warned of the risk of terrorism on ageing energy
infrastructure [17]. A few months later, it was discovered its computer systems have
been the subject of continuous infiltration since 2010 [17] He further stressed that,
over the coming years, incursions by nation states or terrorist adversaries will grow
exponentially as they hit nuclear facilities, power grids and oil and gas pipelines [17].
Clearly, the driving force behind these attacks is economic and strategic gains [17].
An attack campaign against control systems known as Energetic Bear (also
called ‘Crouching Yeti’) is particularly relevant because of its demonstration of how
its attack mechanisms have been commonly used [20]. According to [20] the Russian
security software vendor Kaspersky Lab published an in-depth report that claims that
17
Energetic Bear attacks have successfully exploited more than 2,800 victims including
some 100 organizations in the United States, Spain, Japan, Germany, France, Italy,
Turkey, Ireland, Portland, and China. While Energetic Bear is wide in scope,
researchers at security firm Symantec discovered that as early as March 2014, the
group shifted its focus onto energy firms, with half of the targets in energy and 30%
in energy control systems [20]. Meanwhile, Symantec revealed even more staggering
report that suggests that Energetic Bear attacks against control systems were
successful to the extent that, “could have caused damage or disruption to energy
supplies in affected countries” and those targets included “energy grid operators,
major electricity generation firms, petroleum pipeline operators, and energy industry
industrial control system equipment manufacturers” [20].
2.4.1 Cyber-attack Methodology
There have been several investigations on how cyber-attackers such as those
associated with the Energetic Bear campaign managed to successfully compromise
control systems of so many companies. According to [20], evidence has shown that
the Energetic Bear attacks were conducted using commonly known and easily
executable attack methods against system vulnerabilities that were common
knowledge. The attackers used in many cases, variants of a well-known piece of
malicious software known as the Havex Trojan. Metasploit, a free tool that requires
just about no programming skills to operate was in frequent use as well [20]. Take for
instance, Figure 2.3 shows the attack tree for exploiting PCN or SCADA MODBUS
devices.
18
Figure 2.3 Attack tree for a MODBUS-based SCADA system [21].
According to [21], attack trees are used to assess vulnerabilities in SCADA
and PCN systems based on MODBUS and MODBUS/TCP communication protocols.
An attack tree provides a structured view of events leading to an attack and,
ultimately, helps with the identification of appropriate security countermeasures.
Risk, according to [21], depends on the following:
1. System architecture and conditions;
2. Countermeasures in place;
3. Attack difficulty;
4. Detection probability; and
5. Attack cost.
Figure 2.4 below shows the pictorial steps involved in a cyber-attack.
19
Figure 2.4 An attack step [21].
According to [20], malicious code related with the Energetic Bear attack
campaign was distributed using several primary methodologies including “spear-
phishing” and “waterholing” attacks as well as compromised SCADA software
updates. Spear-phishing as defined by [22], are exploratory attacks carried out by
sending an email with a malicious link or attachment to a targeted list of users. At first
glance, this may sound synonymous to the spam that you receive every day in your
inbox. The important difference to note is that these emails are sent to a very specific
set of individuals that the attackers typically know a good deal about. Consequently,
they can be constructed in a manner that makes them seem much more legitimate than
random spam. By mining social networks for personal information about targets, an
attacker can write emails that are extremely accurate and compelling. Once the target
clicks on a link or opens an attachment, the attacker establishes a foothold in the
network, enabling them to complete their illicit mission. Spear-phishing is the most
common delivery method for advanced persistent threat (APT) attacks [22].
Another technique commonly used by hackers is known as waterholing.
Waterholing refers to when threat actors compromise a carefully-selected website by
20
inserting an exploit resulting in a malware infection. In the case of Energetic Bear,
attackers simply exploited the websites of control system manufacturers where system
updates are maintained. By replacing legitimate updates on these sites with copies that
contained malicious software code, hackers could ensure that their targets would
infect their own systems. It is however important to note that this technique can work
even if the target control system is standalone, that is, systems that are not connected
to any external network.
In fiscal year 2014, ICS-CERT observed greater variety in the characteristics
of the reported incidents as depicted in Figure 2.5 below [23]. Whereas spear-
phishing is still a popular infection vector because of its effectiveness, a wider variety
of techniques was reported this year.
Figure 2.5 Fiscal Year 2014 incidents reported by access vector (245 total) [23].
While ICS-CERT has previously observed strategic watering hole attacks, a
new technique uses trojanized software installers at various vendors’ sites to install
21
malware on the unsuspecting user’s network along with the software update. Many of
the victims were unaware they were compromised. As expected, social engineering
continued to be a popular attack method, enhanced by using social media. In some
cases, this yielded greater success for attackers.
These attack methodologies are consistent with the recently published Cisco
Annual Security Report that reveals that hackers have increasingly shifted their focus
from seeking to compromise servers and operating systems to seeking to exploit
computer users at the browser and email levels [20].
2.4.1.1 Advanced Persistent Threats
According to [24], operational technology relies on obsolete security models
based on unfounded assumptions. Although there have been catastrophic cyber-
attacks on ICSs, a bigger and perhaps even more prominent challenge for owners and
operators of control systems, are viruses, spyware, and malware that migrate from IT
systems to control systems [24]. Viruses are accidentally introduced to control
systems every day through engineers’ laptops, websites, emails, removable drives,
and external computers that for some reason are interconnected with the control
system. These cyber-attacks are more aggravating than a real danger to the system,
but cause delays, shutdowns, and other problems every day. The general scare stems
from catastrophic attacks that may or may not happen. However, the daily struggle are
the viruses and malware, as these often look like software errors, and dealing with this
is costly and causes unplanned downtime [24]. It is often difficult for an engineer to
see the difference between a virus and a software error when the equipment is
malfunctioning. Cyber-attacks lead to increased processor and memory usage on the
attacked host and may cause heat generation, which also can lead to software errors or
22
equipment hardware failure. However, it is often difficult to diagnose the real
problem whenever there is an attack of this nature. For instance, it has only recently
been disclosed that hackers blew up a segment of a Turkish oil pipeline in 2008 [24].
In the control room, the operator’s console showed that everything was nominal
before a phone call from the field triggered the alarm. Furthermore, the dangerous
dimension about this attack was that the attackers also manipulated the CCTV feed to
the control room, covering up what was happening at the site [24]. According to [24],
there are some similarities in the attack methodology when compared with the
Stuxnet incident in 2010, where operator consoles showed normal operations when
the centrifuges of the Iranian Natanz nuclear facility were running at such high speeds
that they were destroyed. It was further alleged that Stuxnet was already resident in
the attacked control system for a few years before the attack took effect [24] The
attack set the Iranian nuclear facility back several years. There are several questions
to suggest why Stuxnet was successful and why an attack of that magnitude was not
detected. According to [24], the Stuxnet incident was recently referred and
categorized to as APT. These attacks have a specific target in mind and are advanced
as they have a high level of coordinated human involvement to monitor and control
the attack using one or more control centers. The persistent part of the attack refers to
the capability of the attack to remain invisible to the target for as long as possible with
priority to complete its mission and get out of the attacked system undetected. APTs
use deep system and attack knowledge to ensure a covert operation. APTs have three
things that the system owner does not have: people, money, and time. The attack
program could be removed if it is discovered or it might also be programmed to
destroy itself. This means that the attack leaves very few traces on the attacked
system.
23
2.4.2 Methods of Operation
According to 12, there are several methods that a perpetrator could exploit a
PCN to carry out an attack. These are summarized below:
• Misuse of Resources: Unauthorized use of IT resources. Excluding storing
unauthorized files on a server, using site as springboard for further
unauthorized activity.
• User Compromise: Perpetrator gains unauthorized use of user privileges on a
host.
• Root Compromise: Perpetrator gains unauthorized administrator privileges
on a host.
• Social Engineering: Gaining unauthorized access to privileged information
through human interaction and targeting people’s minds rather than their
computers.
• Virus: A virus is a piece of code that, when run, will attach itself to other
programs, which will again run when those programs are run.
• Web Compromise: Using vulnerabilities in a website to perform an attack.
• Trojan: A Trojan is a program that adds subversive functionality to an
existing program.
• Worm: A program that propagates itself by attacking other machines and
copying itself to them.
• Recon: Scanning/probing site to see what services are available. Determining
what vulnerabilities exist that may be exploited.
• Denial of Service: An exploit whose purpose is to deny somebody the use of
the service, namely to crash or hang a program or the entire system.
24
Michael Bell, President, CEO and Member of the Board of Directors, Silver
Spring Network asserted in [20] that in dealing with cyber threats to energy systems,
companies not only struggle to assess the risk but also often fail to develop the in-
house tools to understand their own response. He further stressed that, “Everyone is
rushing to adopt technologies but standards need to be used and best practices need to
be implemented” [20]. Similarly, O.H. Dean Oskvig, Vice Chair for North America,
World Energy Council and President and CEO, B&V Energy, noted in [20]. “There
are two types of companies: ones that have been hacked and the other ones that don’t
know they’ve been hacked.” He noted that most energy infrastructure was designed
before modern IT tools and systems. Security to protect this infrastructure tends to
focus on physical defenses at the expense of addressing cyber threats [20].
2.5 Cyber Security Risks Frameworks
It may appear that the likelihood of catastrophic cyber-attacks on SCADA and
PCN systems is comparatively low. This may lead to a false sense of security if we
overlook two key points. Foremost, considering the total number of attacks, it is
worth mentioning that only a small number of cyber security incidents are reported.
According to [21], only a small fraction of actual cyber events occurring are reported
and documented into the traditional business crime reporting database.
Therefore, developing a cyber security risk assessment methodology involves
providing a platform for enterprise-wide cyber security awareness and risk analysis.
According to the National Institute of Standards and Technology (NIST) framework
methodology, risk assessment has two parts, namely conformance assessment and risk
analysis; they must exist to ensure a preventative approach if cyber threats must be
mitigated.
25
The process involved in risk management as highlighted in [21], is depicted in
Figure 2.6. The following are the steps involved in risk management framework:
1. Risk management – This involves coordination of activities to direct and
control an organization regarding risk [21]
2. Risk assessment – This step involves the overall process of risk identification,
risk analysis and risk evaluation [21].
3. Risk identification – The risk identification is the process of finding,
recognizing and describing risks [21]
4. Risk analysis – This process entails comprehending the nature of risk and to
determine the level of risk [21].
5. Risk evaluation – This is the final stage of risk management of comparing the
results of risk analysis with risk criteria to determine whether the risk and its
magnitude are acceptable or tolerable [21].
Figure 2.6 The ISO 31000:2009 risk management process [25].
26
According to [21], a review of the state of the art in risk assessment of
SCADA or PCN systems is urgently required to form a new categorization scheme for
risk assessment methods.
Although there are several risk models and framework that directly attempt to
address the cyber security challenges on the SCADA and PCN environments, each
with a varying degree of effectiveness. In 2004, NIST released a publication
pertaining to the risks and objective of SCADA and PCN systems [21]. Similarly, the
National Infrastructure Security Coordination Center (NISCC) in 2005, a predecessor
of the Centre for the Protection of National Infrastructure (CPNI) in the United
Kingdom, published a best practice guide for firewall deployment in SCADA
networks [21]. In 2007, the U.S. President’s Critical Infrastructure Protection Board
and the Department of Energy published steps an organization must put in place to
improve the security of its SCADA networks. Subsequently, in 2008, NIST released a
comprehensive guidance on a wide range of security issues, and technical, operational
and management security controls [21]. This guide was later updated in 2011. In the
below Figure 2.7, the Generic SCADA hardware architecture is explained.
Figure 2.7 Generic SCADA hardware architecture [21].
27
The NIST Cyber Security Framework for the United States and the
international cyber security framework standard ISO/IEC: 21827 IT-ST - Systems
Security Engineering Capability Maturity Model define core principles for securing
ICSs: These five components of cyber-security philosophy are also the basis for a
defense in depth strategy.
1. Identify: continuous identifying, evaluating and managing of cyber risks
using best practice risk assessment and management methods.
2. Protect: structured and robust built-in security architecture, network perimeter
protection, host protection, network protection, interface protection, and
secure remote connection.
3. Detect: capabilities to detect viruses and other cyber annoyances, as well as
sophisticated cyber-attacks such as APTs, both on the network and inside of
the system and each host.
4. Respond: well-established and efficient processes to handle cyber-attacks.
5. Recover: ability to quickly return to normal or degraded operation after an
attack – the after-the-fact part of defense in depth. There are some cyber-
attacks that are not possible to prevent or respond to. Most often, such cyber-
attacks are APTs and other catastrophic attacks that have a very low
probability of happening and a high impact, should they occur.
28
CHAPTER 3. MACHINE LEARNING: A SOLUTION TO
MITIGATE CYBER THREATS
This chapter examines the use of machine learning (ML) on cyber security. It
further provides the reader with an overview of the vast range of applications where
ML has been adopted. Finally, the report outlines a set of basic yet effective
algorithms that could be used to solve the menace of growing cyber threat.
3.1 Introduction to Machine Learning
According to [26], ML could be defined as a method of data analysis that
automates analytical model building, using algorithms that iteratively learn from data,
which allows computers to find hidden insights without being explicitly programmed
where to look. ML is a form of artificial intelligence (AI) that provides a computer
with the ability to learn by itself without being explicitly programmed [26]. The
process involved in ML is like that of data mining. Although both systems search
through data to look out for patterns, instead of extracting data for human
comprehension, ML uses that data to detect patterns in data and adjust program
actions accordingly. Furthermore, ML focuses on the development of computer
programs that can change when exposed to new data [27].
According to SAS findings, evolution of ML is born from pattern recognition
and the theory that computers can learn without being programmed to perform
specific tasks; researchers interested in artificial intelligence (AI) wanted to observe if
computers could learn from data [26]. ML continues a process of self-training,
because as models are exposed to new data, they can independently adapt [26]. In
short, ML learns from previous computations to produce reliable, repeatable decisions
29
and results. Some examples of widely publicized examples of ML applications are as
follows:
• The self-driving Google car - the reason of ML.
• One of the more obvious, important uses in our world today is the fraud
detection.
The resultant importance and benefits of ML has unlocked the possibilities of
quickly and automatically able to produce models that can analyze bigger, more
complex data and deliver faster, more accurate results, even on a very large scale [26].
Consequently, by building accurate models, an organization has a better prospect of
identifying profitable opportunities or avoiding unknown risks.
According to [26] the following steps are required to create a good ML
system, namely:
• Data preparation capabilities;
• Algorithms – basic and advanced;
• Automation and iterative processes;
• Scalability; and
• Ensemble modeling.
3.2 Basic Machine Learning Methods
There are two most widely adopted ML methods often used globally, they are
supervised learning and unsupervised learning. However, other popular methods
besides these two are semi-supervised learning and reinforcement learning.
30
a. Supervised learning algorithms are trained using labeled examples such as an
input where the desired output is known. For example, a piece of equipment
could have data points labeled either “F” (failed) or “R” (runs). The learning
algorithm receives a set of inputs along with the corresponding correct
outputs, and the algorithm learns by comparing its actual output with correct
outputs to find errors. It then modifies the model accordingly. Through
methods like classification, regression, prediction and gradient boosting,
supervised learning uses patterns to predict the values of the label on
additional unlabeled data. Supervised learning is commonly used in
applications where historical data predict likely future events. For example, it
can anticipate when credit card transactions are likely to be fraudulent or
which insurance customer is likely to file a claim.
b. Unsupervised learning is used against data that has no historical labels. The
system is not told the “right answer.” The algorithm must figure out what is
being shown. The goal is to explore the data and find some structure within.
Unsupervised learning works well on transactional data. For example, it can
identify segments of customers with similar attributes who can then be treated
similarly in marketing campaigns, or it can find the main attributes that
separate customer segments from each other. Popular techniques include self-
organizing maps, nearest-neighbor mapping, k-means clustering and singular
value decomposition. These algorithms are also used to segment text topics,
recommend items and identify data outliers.
c. Semi-supervised learning is used for the same applications as supervised
learning, but it uses both labeled and unlabeled data for training – typically, a
small amount of labeled data with a large amount of unlabeled data (because
31
unlabeled data is less expensive and takes less effort to acquire). This type of
learning can be used with methods such as classification, regression, and
prediction. Semi-supervised learning is useful when the cost associated with
labeling is too high to allow for a fully labeled training process. Early
examples of this include identifying a person’s face on a webcam.
d. Reinforcement learning is often used for robotics, gaming and navigation.
With reinforcement learning, the algorithm discovers through trial and error
which actions yield the greatest rewards. This type of learning has three
primary components: the agent (the learner or decision maker), the
environment (everything the agent interacts with) and actions (what the agent
can do). The objective is for the agent to choose actions that maximize the
expected reward over a given amount of time. The agent will reach the goal
much faster by following a good policy. So, the goal in reinforcement learning
is to learn the best policy.
3.3 Machine Learning in Cyber security
It is unquestionable that within ML and its parent technology AI exists huge
opportunity whose present analytic capabilities could help mitigate literally every
challenge currently witnessed in the digital system today. In recent times, ML has
been hailed as the brand new weapon emerging from the multilayered discipline of
data science to penetrate the sphere of cyber security [28]. The application of ML to
address the growing trend of cyber threats is gaining popularity within the research
industry. According to ABI Research [29], cyber threats are an ever-present danger to
global economies and are projected to surpass the trillion dollar mark in damages
within the next year. In view of this, the cyber security industry is investing greatly in
32
ML to provide a more dynamic prevention approach [29]. Furthermore, ABI Research
forecasts that ML in cyber security will boost big data, intelligence, and analytics
spending to $96 billion by 2021 [29].
Meanwhile, Dimitrios Pavlakis an Industry Analyst at ABI Research predicted
the current era where AI security revolution will drive ML solutions [29]. It is poised
to emerge as the new norm beyond Security Information and Event Management
(SIEM), and ultimately displace a large portion of traditional antivirus (AV),
heuristics, and signature-based systems within the next five years [29]. Although ABI
Research further reveals that the government and defense, banking, and technology
market sectors to be the primary drivers and adopters of ML technologies [29]. The
energy sectors as usual are slow to adopt any sudden change in technology due to
initial skepticism; but, however, the application of ML in this industry is growing
gradually. Cases where ML has been used within the energy industry are known; from
finding new energy sources, predicting refinery sensor failure to streamlining oil
distribution to make it more efficient and cost-effective, are number of applications
where ML have been documented to be used. Moreover, User and Entity Behavioral
Analytics (UEBA) along with Deep Learning algorithm designs are emerging as the
two most prominent technologies in cyber security offerings, especially in innovative
technology startups [29]. Meanwhile, established AV companies in the industry, such
as Symantec, continue to innovate some of their solutions from highly trained
supervised models to unsupervised and semi-supervised ones in preparation of the
constantly shifting threat variables [29]. According to ABI findings [29], SIEM’s
techniques are expected to be divided altogether and integrated within different
functions of UEBA, unsupervised, and deep learning solutions. Consequently,
signature-based AV systems will be expunged completely and comprise only a
33
subsection of supervised ML models [29]. Meanwhile, enterprise-focused giants such
as IBM are transforming the way enterprises employ ML in every market sector, from
healthcare to enterprise analytics to cyber security [29]. On the other hand,
corporations like Gurucul, Niara, Splunk, StatusToday, Trudera, and Vectra Networks
are attempting to take the lead in innovative applications of UEBA [29]. Given the
rising trend of ML application in cyber security, Pavlakis further concludes that, “the
radical transformation is already underway and is occurring as a response to the
increasingly menacing nature of unknown threats and multiplicity of threat agents”
[29]. He further stressed that, “the proliferation of ML is also causing an explosion of
agile startups, such as JASK, focusing more on SIEM complementary network traffic
analysis and even pioneering application protection such as Sqreen” [29].
While AI technology has certainly been around for some time, data science
aided by an ardent increase in computing power has made an astonishing progress
over the past few years, enabling ML to be used largely in almost every aspect of IT
security [28]. According to [28], ML has found numerous grounds in contemporary
cyber-security applications including:
1. The ability to introduce new capability into enterprise security by
incorporating sophisticated versions of anomaly and fraud detection using
UEBA.
2. Enabling corporation to customize their own data and deliver ways for
innovative monitoring applications (e.g., predicting behaviour for hard-to-
detect vectors such as multi-layered attacks or insider threats).
34
3. Transforming the data mining methods of deriving actionable insights by
considering a larger percentage of already available variables (e.g., data
harvested from network traffic, endpoints, web crawlers, etc.).
4. Leveraging the power of vast repositories such as malware and virus databases
to support existing security domains.
5. Providing a quicker and more accurate platform to assist stretched IT and
security resources that may be time-pressured to combat the rising tide of
cyber-threats.
6. Adding to existing data loss prevention (DLP) strategy as reliable security
protocols capable of self-learning, recognizing patterns and behaviours.
7. Assisting IT security personnel in their daily activities, thereby streamlining
the monitoring and decision-making process.
8. Helping design more accurate predictive models for threats, both inside and
outside the company, and capable of managing more critical attacks.
3.4 Understanding Machine Learning Models
It is useful to arrange the data mining domain into four layers. Figure 3.1
shows the four layers of data mining and ML.
35
Figure 3.1 The four layers of Data mining and Machine learning [30].
The first layer represents the target application. ML can benefit many
applications such as cyber instruction detection, credit rating, etc. The second layer
represents the ML tasks such as Classification, Regression, Clustering, etc. Each ML
task can be attained using various ML models as depicted in the third layer. Similarly,
each model can be induced from the sample data using various learning algorithms.
There are numerous selections of ML algorithms that can provide deep analysis of
sample data within the ML domain; however, it should be noted that they provide
varying results of accuracy based on their individual capabilities. Some known ML
models commonly used are: Neural networks, Decision trees, Random forests,
Associations and sequence discovery, gradient boosting and bagging, Support vector
machines, Nearest-neighbor mapping, k-means clustering, Self-organizing maps, local
search optimization techniques (e.g., genetic algorithms), Expectation maximization,
Multivariate adaptive regression splines, Naïve Bayes, Kernel density estimation,
Principal component analysis, Singular value decomposition, Gaussian mixture
36
models and Sequential covering rule building [26]. Meanwhile, for the benefit of this
report, focus will be restricted to only three of these models, namely Neural networks,
Naïve Bayes and Decision trees models.
3.4.1 Overview of Decision Tree
Generally, decision tree is a popular data model that uses the predictive
modeling approaches used in statistics, data mining and ML [30]. Decision tree can be
used to represent both classifiers and regression models. In this case, where the target
variable can take a finite set of values, it is also referred to as classification trees.
Meanwhile, in the case of decision trees, the target variable can take continuous
values (typically real numbers), also known as regression trees [30]. According to
[30], classification trees are regularly used in applications such as finance, marketing,
engineering and medicine. The classification tree is most valuable as an exploratory
technique. Meanwhile, it does not attempt to substitute existing traditional statistical
methods; however, there are many other techniques that can be used classify or
predict the membership of instances to a predefined set of classes, such as artificial
neural networks [30].
The use of a decision tree is a very popular technique in data mining. In fact,
many researchers attributed the popularity of decision trees to their simplicity and
transparency [30]. Decision trees are self-explanatory; there is no need to be a data
mining expert to follow a certain decision tree. Classification trees are typically
represented graphically as hierarchical structures, which makes them easier to
interpret than other techniques. According to [30], whenever the classification tree
becomes complicated and clumsy, then its graphical representation becomes
ineffective.
37
3.4.2 Overview of Neural Networks
Modern neural networks are non-linear statistical data modeling tools that are
modeled on biological neural networks [31]. Structurally, neural network is modeled
using layers of artificial neurons, or computational units able to receive input and
apply an activation function along with a threshold to determine if messages are
passed along [31]. They are often used to model complex relationships between inputs
and outputs, to find patterns in data, or to capture the statistical structure in an
unknown joint probability distribution between observed variables [31]. According to
[31], neural networks are characterized by containing adaptive weights along paths
between neurons that can be tuned by a learning algorithm that learns from observed
data in order to improve the model. The related algorithms form an integral part of
ML, and can be used in many applications. This technique is mostly accurate with
degree of high performance. Neural networks utilize cost functions to learn the
optimal solution to the problem being solved [31]. This is possible by determining the
best values for all the tunable parameters in the model, where the adaptive neuron
path weights are the primary target, along with algorithm tuning parameters, for
example, the learning rate [31]. This is usually carried out through optimization
techniques such as gradient descent or stochastic gradient descent. The model
architecture and tuning are major components of neural networks that give this
technique a significant performance advantage over other ml models [31]. But at
times, the model can become increasingly complicated, and with increased problem-
solving capabilities by increasing the number of hidden layers, the number of neurons
in any given layer, and or the number of connectors between neurons [31].
38
3.4.3 Overview of Naïve Bayes
Naïve Bayes is a classification method that is based on Bayes’ Theorem that
relies on simple probabilistic assumption of independence among predictors [32].
This classifier method assumes that the existence of an attribute in a class is
unconnected to the presence of any other feature. Naive Bayes’ model is particularly
useful where very large data sets are required. Besides its simplicity, Naive Bayes in
specific cases has been known to exceed the capability of vastly sophisticated
classification techniques [32], especially when there is a case of high dimensionality
in the input. Although, Naive Bayes has been researched extensively since the 1950s,
It was only introduced under a different name into the text retrieval community in the
early 1960s, and since then it remains a popular technique for text categorization [32].
In the computer science literature and ML, Naive Bayes’ models are recognized under
an array of names, including simple Bayes and independence Bayes. All these names
reference the use of Bayes’ theorem in the classifier’s decision rule, but essentially
Naive Bayes is not a Bayesian method in itself [32]. Naive Bayes’ classifiers are
highly scalable, requiring only several parameters linear in the number of variables or
predictors in a learning problem [32].
Applications of Naive Bayes’ Algorithms
• Real-Time Prediction: Naive Bayes is a very fast and effective learning
classifier. This characteristic is suitable for making accurate predictions in
real time [32].
• Multiclass Prediction: This algorithm is known for its multi-class prediction
attribute. Thus, this classifier is capable predicting the probability of multiple
classes of target variable.
39
• Text classification: Naive Bayes’ classifiers are commonly used in text
classification have superior accuracy and detection rate when compared with
that of other classifiers. Thus, for this reason its application is commonly used
in Spam recognition and filtering. Meanwhile, common application can also
be found in social media analysis as Sentiment Analysis to identify positive
and negative customer sentiments [32].
• Recommendation System: Naive Bayes’ classifier can be used
collaboratively as a filtering system known as Recommendation System,
which leverages on ML and data mining methods to sort out hidden
information and predict whether a user would prefer a given resource or not
[32].
40
CHAPTER 4. EMPIRICAL INVESTIGATION
In the previous chapters, trends involving cyber incidents against SCADA and
PCN were reviewed with the possibility of using ML as mitigation was exploited.
This chapter introduces the research concepts and methodology. The research design
and approaches used therein have also been addressed.
4.1 Introduction
A comprehensive review has been conducted on the impact and trend of cyber
incidents involving the SCADA and PCN within the energy sector. Thereafter, several
ML frameworks were discussed and conceptualized; the focus turns to a practical case
where it is applied. This chapter focuses on the research concept and methodology of
the empirical investigation adopted in achieving the objectives of this thesis.
The methodology of this thesis is explained in the following six steps:
1. Create data sets (using Wireshark network packet analyzer),
2. Feature extraction of data (CSV format),
3. Combine and convert data (Attribute-Relation File Format [ARFF] format),
and
4. Interpretation of results.
a. Verification and Validation
The methodology that is explained in this chapter is depicted in Figure 4.1
below.
41
Create Data Normal Data
Create Data Malicious Data
Feature extraction of
data
Feature extraction of
data Combine and convert Data
Naïve Bayes
Decision Tree
Neural Networks
Interpret Results
Figure 4.1 Based on the generated data ML models are created that can classify
packets.
4.2 Creating the data sets
The concept of generating data required for this research is by using
Wireshark and WEKA softwares. Wireshark is a network analysis tool formerly
referred to as etherreal, which is used to capture network packets in real time and
displays them in human-readable format [33]. Meanwhile, WEKA is an application
that contain sets of ML algorithms, which are usually used for providing solution to
real-world data mining problems [6]. It is developed in Java language and is
compatible on almost any platform [6]. The algorithms can either be applied to a data
set directly or called from an own Java code [6]. Figure 4.2 below shows the
screenshot of Wireshark during data packet capture.
42
Figure 4.2 Wireshark screenshot during packet capture.
Two sets of unique data were required: normal data and malicious data.
Normal data was generated by launching the Wireshark software application and start
capturing the packets in network interface card of the target machine that was used to
connect to the Local Area Network (LAN). Once data capture was in progress, normal
data was generated by normally using the target machine to perform activities such as
web-browsing, downloading files, and emailing for a minimum of 10 minutes [34].
Specific filters such as http, ICMP, telnet, etc. were applied to capture different types
of traffic, so that the same training data set could be used in WEKA. Similarly,
malicious data was generated by launching different vulnerability exploits from a
remote machine to the target machine for a minimum duration of 10 minutes.
Meanwhile, a penetration testing software for detecting vulnerability known as
43
Metasploit was used to simulate malicious exploits [35] 1 . The captured packets
(normal data and malicious data) were exported from the Wireshark software as a
CSV delimited file for data mining and analysis via the WEKA ML software.
4.3 Feature extraction of data (CSV format)
It is first important and necessary to gather all the data together into a set of
instances. Preparation for data input for a data mining analysis usually consumes the
bulk of the effort invested in the entire data mining process. Subsequence to data
packets that was captured using the Wireshark and exported as CSV as the WEKA
ML application does not support pcap file. Pcap, which stands for packet capture, is a
proprietary application file that is made up of an application programming interface
(API) for capturing network traffic. Different class categorizations were defined based
on the application the traffic is generated as shown in Table 4.1 below.
Table 4.1 Class Categorization.
Traffic Type Traffic Category
TCP PORT 2869 LAN_COMMS
NBNS NetBios_TCP/IP
HTTP Browser
DB-LSP-DISC Dropbox_Cloud
1 Metasploit is an exploit development framework initiated by H. D. Moore in 2003, which was later
acquired by Rapid7. It is a tool used for the development of exploits and the testing of these exploits on
live targets.
44
ICMP<or= 74BYTES Normal_Ping
UDP PORT 5938 Teamviewer
ICMP > 74BYTES Abnormal_Ping
(HIGH_PAYLOAD)
TCP/UDP PORT 1434 SQL_PAYLOAD_EXEC
UDP PORT 3478 Skype
4.4 Combine and convert data (ARFF format)
WEKA expects the data file to be represented in ARFF file. Before an ML
algorithm can be applied to the captured data, it is required to be converted into an
ARFF format (into the file with arff extension). Consequently, the exported CSV was
imported and converted into arff format using the WEKA arff viewer, so that it can be
readable by the WEKA application. Figure 4.3 below shows an ARFF file for the data
packet information exported from the Wireshark application.
45
Figure 4.3 Combine and convert data (ARFF format).
4.5 Interpretation of results and findings
Although the WEKA ML software offers various techniques and classifiers
for data analysis, this research only focuses on three classifiers selected based on their
unique features and capabilities. This is considered necessary to provide a
comparative and comprehensive assessment of the captured data. Classifiers in
WEKA are the models for predicting nominal or numeric quantities. The learning
algorithms that were used in this research in the analysis of the captured data are:
Naïve Bayes’, Multilayer Perceptrons and Random tree classifiers only. The two sets
of data captured using the Wireshark application are used to train the classifier models
within the WEKA application. Table 4.2 shows the WEKA output results comparing
analysis from the Classifiers models (Naïve Bayes’, Multilayer Perceptrons and
Random tree) used in this research.
46
Table 4.2 WEKA output results – Classifiers comparison analysis.
The results in Table 4.2 showing the analysis of the data captured reveal some
fascinating findings. When the “protocol and length” features were used, the Random
tree model achieved a classification accuracy of 98% when analyzed using a 10-folds
Output
Results
Naïve
Bayes
(Cross
Validation
)
Multilaye
r
Perceptro
n (Cross
Validation
)
Random
Tree
(Cross
Validation
)
Naïve
Bayes
(%
Split)
Multilaye
r
Perceptro
n (% Split)
Rando
m Tree
(%
Spilt)
Classificatio
n Accuracy
(%)
93.58 97.51 98.17 94.62 97.69 98.08
Error Rates
(%) 6.42 2.49 1.84 5.39 2.31 1.92
Kappa
statistic 0.7827 0.9077 0.9335
0.801
2 0.9076 0.9269
Mean
absolute
error
0.1014 0.0485 0.0315 0.092
2 0.0464 0.0328
Root mean
squared
error
0.2239 0.156 0.135 0.203
2 0.1501 0.1363
Relative
absolute
error (%)
35.37 16.92 10.99 32.65 16.41 11.60
Root
relative
squared
error (%)
59.19 41.24 35.68 55.63 41.08 37.32
47
cross-validation and percentage spilt analysis, respectively. Meanwhile, a Multilayer
Perceptron and Naïve Bayes obtained a classification accuracy of 97% and 94%,
respectively, when the analyses were conducted on the same set of captured data.
It was necessary to run the algorithm using the 10-fold cross-validation
technique. This ensures that the predictive ML model was presented an opportunity to
make a prediction for every instance of the data set (with different training folds) and
the presented result representing a summary of those predictions. This means that the
data set is divided into 10 parts: the first nine parts are used to train the algorithm,
whereas the 10th part is used to evaluate the algorithm. This process is repeated
yielding random partitions of the original sample. Finally, the results are again
averaged (or otherwise combined) to produce a single estimation.
4.5.1 Verification and Validation
Verification and validation of the results generated were essential by running
the model analysis using the percentage split technique. The percentage split
technique evaluates the classifier on how accurately it predicts a certain percentage of
the data, which is sampled for testing. In this research, the amount of data sampled
was based on how well the classifier predicted 66% of the tested data.
Furthermore, the confusion matrix as shown in Table 4.3 offers further
evidence to suggest that the classification accuracy of the Random Tree model
provides the most balanced result.
48
Table 4.3 WEKA output results – Confusion matrix analysis.
Table 4.3 contains a chart of actual classes compared with predicted classes.
There were two instances where SAFE detection was classified as NOT_SAFE
detection and only three cases where NOT_SAFE detections were classified as SAFE
detections. A detailed output of the WEKA results is captured in Appendix A, B, C,
D, E and F.
4.6 Using active learning method of ML for Cyber security intrusion detection
for PCN application
By presenting the WEKA ML outcome of each classifier models, it potentially
demonstrates the suitability of adapting the ML framework in mitigating the growing
trend of Cyber-attacks on PCNs, particularly in the Oil and Gas industry.
Although it was impractical in some cases to simulate malicious data
effectively due to lack of actual PCN infrastructure, this may have affected the
Confusion Matrix a (SAFE) b
(NOT_SAFE)
Naïve Bayes
Naïve Bayes (Cross Validation)
601 30 19 113
Naïve Bayes (% Split)
211 8 6 35
Multilayer Perceptron
Multilayer Perceptron (Cross Validation)
631 0 19 113
Multilayer Perceptron (% Split)
219 0 6 35
Random Tree
Random Tree (Cross Validation)
630 1 13 119
Random Tree (% Split)
217 2 3 38
49
detection accuracy of the ML algorithms. While there were some legal constraints
encountered in obtaining the actual PCN data to perform this experiment, the
Wireshark data capture provided an alternative method to demonstrate the potential
use of ML as it generates rules automatically by analyzing ethernet data traffics
similarly transmitted within the PCN environment. Furthermore, it is suggested that
these rules could be implemented by PCN administrators or analysts, on PCN
firewalls or just used as triggers to generate alarms for notifications of potential cyber
security gaps on the PCN environment. Nevertheless, it must be noted that this may
have affected the accuracy of a classifier in this experimentation.
Evidently, this research work has provided the foundation with which future
experimental work can be explored. Moreover, typical application where major
outcomes from this research can be adapted is the use of active learning method of
ML to further enhance data gathering in the detection of cyber intrusion. According to
[36], the concept of active learning has been used with statistically-based learning
architecture. Furthermore, active learning has also found to be used in conjunction
with Support Vector Machines (SVM) to improve the detection accuracy with less-
trained data set [37]. The idea of active learning based on this research is to create a
collaboration between the human and the machine in the process where the machine
tags data points and asks for confirmation from the annotator.
Moreover, [38] adopted a similar framework to identify documents in a large
pool that had certain qualities like those provided as samples. According to [38], the
concept of ML involves an oracle (such as a PCN administrator), which builds a loop
to iteratively improve the accuracy and performance of the classifier. The oracle
(PCN administrator) is important to provide a precise label to some of the instances
50
(data packets) in order to provide more information to the classifier, while the
classifier then updates its database with this information [38].
Foremost, some known instances and labels (data packets) are introduced into
the classifier or the ML model as depicted in Figure 4.4. The classifier will go through
all the data packets without knowing the labels or traffic rules, and determine the
traffic rule for which it is difficult to establish whether traffic is “SAFE” or “NOT
SAFE.” Then, the oracle (PCN Administrator) reads these traffic rules and decides to
accept or reject each of them (in this case acceptance would mean that the traffic is
“SAFE,” whereas rejection would mean that the traffic is “NOT SAFE”).
Figure 4.4 Active Learning Process [38].
Given all the information, the classifier will re-train and regulate itself to be
more suitable to this problem. This loop will resume until the PCN administrator is
satisfied with the performance of the classifier. Figure 4.5 below depicts the process
flow of how this model functions.
51
CHAPTER 5. CONCLUSION
Chapter 4 presented the findings of ML assessments on ethernet data that
comprises normal and malicious traffics. This chapter concludes the overall research
investigation. The contributions of the thesis are indicated; recommendations and
future research opportunities are identified.
5.1 Conclusion and Recommendations
This chapter concludes the overall research investigation. The contributions of
the thesis are indicated and future research is identified.
This research thesis had two main objectives at its focus.
1. To assess the rising incidents of cyber security exploits on PCNs in the
Petrochemical industry.
2. To develop and formulate a framework that will mitigate these rising
occurrences using a hybrid approach.
Although, the data set used in this research is small, the experimentation
revealed some results on which valuable deductions could be made. The ML analysis
was conducted on a set of data packet containing a pool mix of normal and malicious
data. The analysis showed that the “Traffic_Category” feature is the key attribute of
reference used in the analysis. Meanwhile, by using “protocol and length” features
only, the Random tree model achieved a classification accuracy of 98% when
analyzed using a 10-folds cross-validation and percentage spilt analysis, respectively.
Meanwhile, a Multilayer Perceptron and Naïve Bayes obtained classification
accuracies of 97% and 94%, respectively, when the analyses were conducted on the
53
same set of captured data. Evidently, the outcome is that this research has presented
an opportunity where this can be implemented alongside the traditional PCN risk
standards and conventional firewalls as triggers to dynamically generate alarms for
notifications of potential cyber security gaps within the PCN environment.
Nevertheless, the 2% error margin within the ML classifier is substantial enough to
cause a significant cyber exploit with unimaginable impact. It is however
recommended that this experimental work be further advanced on an actual PCN
environment using the concept of active learning. With this concept, ML is used to
create a collaboration between the human and the machine in the process where the
machine tags data points and asks for confirmation from the PCN Administrator to
further enhance data gathering in the detection of cyber intrusion.
54
APPENDIX. A
NAÏVE BAYES (CROSS VALIDATION)
=== Run information ===
Scheme: WEKA.classifiers.Bayes.NaiveBayes
Relation: Normal Data_Malicious Data(Comma)-
WEKA.filters.unsupervised.attribute.Remove-R1-
WEKA.filters.AllFilter-WEKA.filters.MultiFilter-
FWEKA.filters.AllFilter-
WEKA.filters.unsupervised.attribute.Remove-R7-
WEKA.filters.unsupervised.attribute.Remove-R5-
WEKA.filters.unsupervised.attribute.Remove-R1-2
Instances: 767
Attributes: 3
Protocol
Length
Result
Test mode: 10-fold cross-validation
=== Classifier model (full training set) ===
Naive Bayes Classifier
Class
Attribute SAFE NOT_SAFE
(0.83) (0.17)
==================================
Protocol
TCP 378.0 15.0
UDP 41.0 6.0
NBNS 23.0 1.0
HTTP 64.0 1.0
DB-LSP-DISC 33.0 1.0
ICMP 98.0 114.0
[total] 637.0 138.0
55
Length
mean 275.6529 876.0842
std. dev. 372.9078 367.2059
weight sum 631 132
precision 8.2139 8.2139
Time taken to build model: 0.05 seconds
=== Stratified cross-validation ===
=== Summary ===
Correctly Classified Instances 714 93.578 %
Incorrectly Classified Instances 49 6.422 %
Kappa statistic 0.7827
Mean absolute error 0.1014
Root mean squared error 0.2239
Relative absolute error 35.3463 %
Root relative squared error 59.1885 %
Total Number of Instances 763
Ignored Class Unknown Instances 4
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure MCC ROC
Area PRC Area Class
0.952 0.144 0.969 0.952 0.961 0.784 0.878 0.934
SAFE
0.856 0.048 0.790 0.856 0.822 0.784 0.888 0.884
NOT_SAFE
Weighted Avg. 0.936 0.127 0.938 0.936 0.937 0.784 0.879 0.925
=== Confusion Matrix ===
a b <-- classified as
601 30 | a = SAFE
19 113 | b = NOT_SAFE
56
APPENDIX. B
NAÏVE BAYES (PERCENTAGE SPLIT)
=== Run information ===
Scheme: WEKA.classifiers.Bayes.NaiveBayes
Relation: Normal Data_Malicious Data(Comma)-
WEKA.filters.unsupervised.attribute.Remove-R1-WEKA.filters.AllFilter-
WEKA.filters.MultiFilter-FWEKA.filters.AllFilter-
WEKA.filters.unsupervised.attribute.Remove-R7-
WEKA.filters.unsupervised.attribute.Remove-R5-
WEKA.filters.unsupervised.attribute.Remove-R1-2
Instances: 767
Attributes: 3
Protocol
Length
Result
Test mode: split 66.0% train, remainder test
=== Classifier model (full training set) ===
Naive Bayes Classifier
Class
Attribute SAFE NOT_SAFE
(0.83) (0.17)
==================================
Protocol
TCP 378.0 15.0
UDP 41.0 6.0
NBNS 23.0 1.0
HTTP 64.0 1.0
DB-LSP-DISC 33.0 1.0
ICMP 98.0 114.0
[total] 637.0 138.0
57
Length
mean 275.6529 876.0842
std. dev. 372.9078 367.2059
weight sum 631 132
precision 8.2139 8.2139
Time taken to build model: 0 seconds
=== Evaluation on test split ===
Time taken to test model on test split: 0 seconds
=== Summary ===
Correctly Classified Instances 246 94.6154 %
Incorrectly Classified Instances 14 5.3846 %
Kappa statistic 0.8012
Mean absolute error 0.0922
Root mean squared error 0.2032
Relative absolute error 32.6475 %
Root relative squared error 55.6315 %
Total Number of Instances 260
Ignored Class Unknown Instances 1
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure MCC ROC
Area PRC Area Class
0.963 0.146 0.972 0.963 0.968 0.802 0.904 0.960
SAFE
0.854 0.037 0.814 0.854 0.833 0.802 0.901 0.876
NOT_SAFE
Weighted Avg. 0.946 0.129 0.947 0.946 0.947 0.802 0.903 0.947
=== Confusion Matrix ===
a b <-- classified as
211 8 | a = SAFE
6 35 | b = NOT_SAFE
58
APPENDIX. C
MULTILAYER PERCEPTRON (CROSS VALIDATION)
=== Run information ===
Scheme: WEKA.classifiers.functions.MultilayerPerceptron -L 0.3 -M 0.2 -N
500 -V 0 -S 0 -E 20 -H a
Relation: Normal Data_Malicious Data(Comma)-
WEKA.filters.unsupervised.attribute.Remove-R1-WEKA.filters.AllFilter-
WEKA.filters.MultiFilter-FWEKA.filters.AllFilter-
WEKA.filters.unsupervised.attribute.Remove-R7-
WEKA.filters.unsupervised.attribute.Remove-R5-
WEKA.filters.unsupervised.attribute.Remove-R1-2
Instances: 767
Attributes: 3
Protocol
Length
Result
Test mode: 10-fold cross-validation
=== Classifier model (full training set) ===
Sigmoid Node 0
Inputs Weights
Threshold -6.701609106073109
Node 2 2.022658388075181
Node 3 3.4115657853982126
Node 4 2.3661383238993885
Node 5 2.4535891235346226
Sigmoid Node 1
Inputs Weights
Threshold 6.706022284568894
Node 2 -2.0664535731177893
Node 3 -3.380110415895844
59
Node 4 -2.3906272856469992
Node 5 -2.4212852245189156
Sigmoid Node 2
Inputs Weights
Threshold 1.2389105849750572
Attrib Protocol=TCP 1.5562765371242953
Attrib Protocol=UDP -1.5475321168186755
Attrib Protocol=NBNS -1.130914466657216
Attrib Protocol=HTTP 0.6447635193787378
Attrib Protocol=DB-LSP-DISC -1.1292570967597721
Attrib Protocol=ICMP -3.4330801543448985
Attrib Length -4.958940391477846
Sigmoid Node 3
Inputs Weights
Threshold 1.4384081759499916
Attrib Protocol=TCP 1.7757717668743795
Attrib Protocol=UDP -1.5234010753801517
Attrib Protocol=NBNS -1.3467652742850982
Attrib Protocol=HTTP 0.7816564846484115
Attrib Protocol=DB-LSP-DISC -1.265188338070933
Attrib Protocol=ICMP -4.150582386758663
Attrib Length -5.911288725560784
Sigmoid Node 4
Inputs Weights
Threshold 1.3317486924991702
Attrib Protocol=TCP 1.6211890650197724
Attrib Protocol=UDP -1.4922782883420325
Attrib Protocol=NBNS -1.177109463050558
Attrib Protocol=HTTP 0.6804947543129244
Attrib Protocol=DB-LSP-DISC -1.1857817570997022
Attrib Protocol=ICMP -3.630809553738711
Attrib Length -5.222928884763834
Sigmoid Node 5
Inputs Weights
60
Threshold 1.2946085326798238
Attrib Protocol=TCP 1.6498330921541253
Attrib Protocol=UDP -1.5552136751666403
Attrib Protocol=NBNS -1.2520708647725358
Attrib Protocol=HTTP 0.7043422292652405
Attrib Protocol=DB-LSP-DISC -1.1237095075576766
Attrib Protocol=ICMP -3.6403819042448466
Attrib Length -5.261345235018875
Class SAFE
Input
Node 0
Class NOT_SAFE
Input
Node 1
Time taken to build model: 0.55 seconds
=== Stratified cross-validation ===
=== Summary ===
Correctly Classified Instances 744 97.5098 %
Incorrectly Classified Instances 19 2.4902 %
Kappa statistic 0.9077
Mean absolute error 0.0485
Root mean squared error 0.156
Relative absolute error 16.9239 %
Root relative squared error 41.2421 %
Total Number of Instances 763
Ignored Class Unknown Instances 4
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC
Area Class
1.000 0.144 0.971 1.000 0.985 0.912 0.879 0.938 SAFE
0.856 0.000 1.000 0.856 0.922 0.912 0.888 0.883
NOT_SAFE
Weighted Avg. 0.975 0.119 0.976 0.975 0.974 0.912 0.881 0.929
61
APPENDIX. D
MULTILAYER PERCEPTRON (PERCENTAGE SPLIT)
=== Run information ===
Scheme: WEKA.classifiers.functions.MultilayerPerceptron -L 0.3 -M 0.2 -N
500 -V 0 -S 0 -E 20 -H a
Relation: Normal Data_Malicious Data(Comma)-
WEKA.filters.unsupervised.attribute.Remove-R1-WEKA.filters.AllFilter-
WEKA.filters.MultiFilter-FWEKA.filters.AllFilter-
WEKA.filters.unsupervised.attribute.Remove-R7-
WEKA.filters.unsupervised.attribute.Remove-R5-
WEKA.filters.unsupervised.attribute.Remove-R1-2
Instances: 767
Attributes: 3
Protocol
Length
Result
Test mode: split 66.0% train, remainder test
=== Classifier model (full training set) ===
Sigmoid Node 0
Inputs Weights
Threshold -6.701609106073109
Node 2 2.022658388075181
Node 3 3.4115657853982126
Node 4 2.3661383238993885
Node 5 2.4535891235346226
Sigmoid Node 1
Inputs Weights
Threshold 6.706022284568894
Node 2 -2.0664535731177893
Node 3 -3.380110415895844
Node 4 -2.3906272856469992
63
Node 5 -2.4212852245189156
Sigmoid Node 2
Inputs Weights
Threshold 1.2389105849750572
Attrib Protocol=TCP 1.5562765371242953
Attrib Protocol=UDP -1.5475321168186755
Attrib Protocol=NBNS -1.130914466657216
Attrib Protocol=HTTP 0.6447635193787378
Attrib Protocol=DB-LSP-DISC -1.1292570967597721
Attrib Protocol=ICMP -3.4330801543448985
Attrib Length -4.958940391477846
Sigmoid Node 3
Inputs Weights
Threshold 1.4384081759499916
Attrib Protocol=TCP 1.7757717668743795
Attrib Protocol=UDP -1.5234010753801517
Attrib Protocol=NBNS -1.3467652742850982
Attrib Protocol=HTTP 0.7816564846484115
Attrib Protocol=DB-LSP-DISC -1.265188338070933
Attrib Protocol=ICMP -4.150582386758663
Attrib Length -5.911288725560784
Sigmoid Node 4
Inputs Weights
Threshold 1.3317486924991702
Attrib Protocol=TCP 1.6211890650197724
Attrib Protocol=UDP -1.4922782883420325
Attrib Protocol=NBNS -1.177109463050558
Attrib Protocol=HTTP 0.6804947543129244
Attrib Protocol=DB-LSP-DISC -1.1857817570997022
Attrib Protocol=ICMP -3.630809553738711
Attrib Length -5.222928884763834
Sigmoid Node 5
Inputs Weights
Threshold 1.2946085326798238
64
Attrib Protocol=TCP 1.6498330921541253
Attrib Protocol=UDP -1.5552136751666403
Attrib Protocol=NBNS -1.2520708647725358
Attrib Protocol=HTTP 0.7043422292652405
Attrib Protocol=DB-LSP-DISC -1.1237095075576766
Attrib Protocol=ICMP -3.6403819042448466
Attrib Length -5.261345235018875
Class SAFE
Input
Node 0
Class NOT_SAFE
Input
Node 1
Time taken to build model: 0.55 seconds
=== Evaluation on test split ===
Time taken to test model on test split: 0 seconds
=== Summary ===
Correctly Classified Instances 254 97.6923 %
Incorrectly Classified Instances 6 2.3077 %
Kappa statistic 0.9076
Mean absolute error 0.0464
Root mean squared error 0.1501
Relative absolute error 16.4131 %
Root relative squared error 41.0833 %
Total Number of Instances 260
Ignored Class Unknown Instances 1
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC
Area Class
1.000 0.146 0.973 1.000 0.986 0.912 0.904 0.960 SAFE
0.854 0.000 1.000 0.854 0.921 0.912 0.901 0.877
NOT_SAFE
Weighted Avg. 0.977 0.123 0.978 0.977 0.976 0.912 0.903 0.947
65
APPENDIX. E
RANDOM TREE (CROSS VALIDATION)
=== Run information ===
Scheme: WEKA.classifiers.trees.RandomTree -K 0 -M 1.0 -V 0.001 -S 1
Relation: Normal Data_Malicious Data(Comma)-
WEKA.filters.unsupervised.attribute.Remove-R1-WEKA.filters.AllFilter-
WEKA.filters.MultiFilter-FWEKA.filters.AllFilter-
WEKA.filters.unsupervised.attribute.Remove-R7-
WEKA.filters.unsupervised.attribute.Remove-R5-
WEKA.filters.unsupervised.attribute.Remove-R1-2
Instances: 767
Attributes: 3
Protocol
Length
Result
Test mode: 10-fold cross-validation
=== Classifier model (full training set) ===
RandomTree
==========
Length < 797
| Length < 67
| | Length < 48.5 : NOT_SAFE (2/0)
| | Length >= 48.5
| | | Length < 56.5 : SAFE (22/0)
| | | Length >= 56.5
| | | | Length < 59
| | | | | Protocol = TCP : NOT_SAFE (1/0)
| | | | | Protocol = UDP : SAFE (1/0)
| | | | | Protocol = NBNS : SAFE (0/0)
| | | | | Protocol = HTTP : SAFE (0/0)
| | | | | Protocol = DB-LSP-DISC : SAFE (0/0)
67
| | | | | Protocol = ICMP : SAFE (0/0)
| | | | Length >= 59
| | | | | Protocol = TCP
| | | | | | Length < 61 : SAFE (92/7)
| | | | | | Length >= 61
| | | | | | | Length < 64 : NOT_SAFE (2/0)
| | | | | | | Length >= 64 : SAFE (48/4)
| | | | | Protocol = UDP : SAFE (2/1)
| | | | | Protocol = NBNS : SAFE (0/0)
| | | | | Protocol = HTTP : SAFE (0/0)
| | | | | Protocol = DB-LSP-DISC : SAFE (0/0)
| | | | | Protocol = ICMP : SAFE (0/0)
| Length >= 67
| | Protocol = TCP : SAFE (181/0)
| | Protocol = UDP
| | | Length < 206
| | | | Length < 164 : SAFE (13/0)
| | | | Length >= 164 : NOT_SAFE (2/0)
| | | Length >= 206 : SAFE (25/0)
| | Protocol = NBNS : SAFE (22/0)
| | Protocol = HTTP : SAFE (41/0)
| | Protocol = DB-LSP-DISC : SAFE (32/0)
| | Protocol = ICMP : SAFE (97/0)
Length >= 797
| Protocol = TCP : SAFE (46/0)
| Protocol = UDP : SAFE (0/0)
| Protocol = NBNS : SAFE (0/0)
| Protocol = HTTP : SAFE (21/0)
| Protocol = DB-LSP-DISC : SAFE (0/0)
| Protocol = ICMP : NOT_SAFE (113/0)
Size of the tree : 43
Time taken to build model: 0 seconds
=== Stratified cross-validation ===
=== Summary ===
68
Correctly Classified Instances 749 98.1651 %
Incorrectly Classified Instances 14 1.8349 %
Kappa statistic 0.9335
Mean absolute error 0.0315
Root mean squared error 0.135
Relative absolute error 10.986 %
Root relative squared error 35.6796 %
Total Number of Instances 763
Ignored Class Unknown Instances 4
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC
Area Class
0.998 0.098 0.980 0.998 0.989 0.935 0.963 0.987 SAFE
0.902 0.002 0.992 0.902 0.944 0.935 0.975 0.933
NOT_SAFE
Weighted Avg. 0.982 0.082 0.982 0.982 0.981 0.935 0.965 0.977
=== Confusion Matrix ===
a b <-- classified as
630 1 | a = SAFE
13 119 | b = NOT_SAFE
69
APPENDIX. F
RANDOM TREE (PERCENTAGE SPLIT)
=== Run information ===
Scheme: WEKA.classifiers.trees.RandomTree -K 0 -M 1.0 -V 0.001 -S 1
Relation: Normal Data_Malicious Data(Comma)-
WEKA.filters.unsupervised.attribute.Remove-R1-WEKA.filters.AllFilter-
WEKA.filters.MultiFilter-FWEKA.filters.AllFilter-
WEKA.filters.unsupervised.attribute.Remove-R7-
WEKA.filters.unsupervised.attribute.Remove-R5-
WEKA.filters.unsupervised.attribute.Remove-R1-2
Instances: 767
Attributes: 3
Protocol
Length
Result
Test mode: split 66.0% train, remainder test
=== Classifier model (full training set) ===
RandomTree
==========
Length < 797
| Length < 67
| | Length < 48.5 : NOT_SAFE (2/0)
| | Length >= 48.5
| | | Length < 56.5 : SAFE (22/0)
| | | Length >= 56.5
| | | | Length < 59
| | | | | Protocol = TCP : NOT_SAFE (1/0)
| | | | | Protocol = UDP : SAFE (1/0)
| | | | | Protocol = NBNS : SAFE (0/0)
| | | | | Protocol = HTTP : SAFE (0/0)
| | | | | Protocol = DB-LSP-DISC : SAFE (0/0)
70
| | | | | Protocol = ICMP : SAFE (0/0)
| | | | Length >= 59
| | | | | Protocol = TCP
| | | | | | Length < 61 : SAFE (92/7)
| | | | | | Length >= 61
| | | | | | | Length < 64 : NOT_SAFE (2/0)
| | | | | | | Length >= 64 : SAFE (48/4)
| | | | | Protocol = UDP : SAFE (2/1)
| | | | | Protocol = NBNS : SAFE (0/0)
| | | | | Protocol = HTTP : SAFE (0/0)
| | | | | Protocol = DB-LSP-DISC : SAFE (0/0)
| | | | | Protocol = ICMP : SAFE (0/0)
| Length >= 67
| | Protocol = TCP : SAFE (181/0)
| | Protocol = UDP
| | | Length < 206
| | | | Length < 164 : SAFE (13/0)
| | | | Length >= 164 : NOT_SAFE (2/0)
| | | Length >= 206 : SAFE (25/0)
| | Protocol = NBNS : SAFE (22/0)
| | Protocol = HTTP : SAFE (41/0)
| | Protocol = DB-LSP-DISC : SAFE (32/0)
| | Protocol = ICMP : SAFE (97/0)
Length >= 797
| Protocol = TCP : SAFE (46/0)
| Protocol = UDP : SAFE (0/0)
| Protocol = NBNS : SAFE (0/0)
| Protocol = HTTP : SAFE (21/0)
| Protocol = DB-LSP-DISC : SAFE (0/0)
| Protocol = ICMP : NOT_SAFE (113/0)
Size of the tree : 43
Time taken to build model: 0 seconds
=== Evaluation on test split ===
Time taken to test model on test split: 0 seconds
71
=== Summary ===
Correctly Classified Instances 255 98.0769 %
Incorrectly Classified Instances 5 1.9231 %
Kappa statistic 0.9269
Mean absolute error 0.0328
Root mean squared error 0.1363
Relative absolute error 11.5984 %
Root relative squared error 37.3158 %
Total Number of Instances 260
Ignored Class Unknown Instances 1
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC
Area Class
0.991 0.073 0.986 0.991 0.989 0.927 0.988 0.996 SAFE
0.927 0.009 0.950 0.927 0.938 0.927 0.986 0.899
NOT_SAFE
Weighted Avg. 0.981 0.063 0.981 0.981 0.981 0.927 0.988 0.981
=== Confusion Matrix ===
a b <-- classified as
217 2 | a = SAFE
3 38 | b = NOT_SAFE
72
REFERENCES
[1] (DHS) Department of Homelands Security, February 2016. [Online]. Available: https://www.dhs.gov/cybersecurity-overview.
[2] C. Wueest, “Security Response: Targeted Attacks against the Energy Sector,” CA. USA, 2014.
[3] (ABI) Allied Business Intelligence Research, January 2013. [Online]. Available: https://www.abiresearch.com/press/cyber-attacks-agains.
[4] E. Knapp, “Cyber security in process plants: Recognizing risks, addressing current threats,” 2015.
[5] Parsons, “Cybersecurity threats to the Oil & Gas Industry: Are you at Risk?,” 2015.
[6] Machine Learning Group at the University of Waikato, February 2017. [Online]. Available: http://www.cs.waikato.ac.nz/~ml/weka.
[7] M. Michela and C. Stuart, “Critical Infrastructure Security – Oil and Gas,” 2013.
[8] (ABI) Allied Business Intelligence Research, “PetroSecurity in the Digital Era: Legacy Systems vs. Cyber Threats,” 2013.
[9] W. Peter. [Online]. Available: https://www.eniday.com/en/sparks_en/cyber-threat-oil-and-gas-industry/.
[10] H. Abhiram, “Cyber Risk for Energy/Power Industry,” AON Risk Solution, India, 2016.
[11] S. Peerlkamp and M. Nieuwenhuis, “Process Control Network Security: Comparing frameworks to mitigate the specific threats to Process Control Network,” Amsterdam, 2010.
[12] M. Bill and R. Dale, “A Survey of SCADA and Critical Infrastructure Incidents,” in Proceedings of the 1st Annual conference on Research in information technology, Utah, 2012.
73
[13] B. Christopher and T.-R. Eneken, “The Cyber Attack on Saudi Aramco,” Survival: Global Politics and Strategy April–May 2013, vol. 55, pp. 81-96, April 2013.
[14] R. Costin, A. H. Mohamad, B. Sergey and M. Sergey, “From Shamoon to StoneDrill: Wipers attacking Saudi organizations and beyond,” 2017.
[15] A. Kiyuna and L. Conyers, CYBERWARFARE SOURCEBOOK, 1st Edition ed., Lulu, 2015.
[16] K. Brent, “The Vulnerability of Nuclear Facilities to Cyber Attack,” Strategic Insights, vol. 10, no. 1, p. 25, 2011.
[17] K. Eduard, “Industry Reactions to U.S. Department of Energy Cyberattacks: Feedback Friday,” 2015.
[18] SecurityWeek, “Oil and Gas Industry Increasingly Hit by Cyber-Attacks: Report,” 2016.
[19] The Industrial Control Systems Cyber Emergency Response Team (ICS-CERT), 2012. [Online]. Available: https://ics-cert.us-cert.gov.
[20] S. Chris, “Hacking oil and gas control systems: Understanding the cyber risk,” 2015.
[21] C. Yulia, B. Pete, B. Andrew, E. Peter, J. Kevin, S. Hugh and S. Kristan, “A review of cyber security risk assessment methods for SCADA systems,” Computers & Security, vol. 56, p. 1–27, February 2016.
[22] FireEye, “SPEAR-PHISHING ATTACKS: WHY THEY ARE SUCCESSFUL AND HOW TO STOP THEM,” California, 2016.
[23] The Industrial Control Systems Cyber Emergency Response Team (ICS-CERT), 2014. [Online]. Available: https://ics-cert.us-cert.gov.
[24] S. H. Houmb, “Protecting industrial control systems,” 2015.
74
[25] L. John, “ISO 31000: Risk Management - A practical guide for SMEs,” International Organization for Standardization, Geneva, 2015.
[26] SAS Institute Inc, April 2017. [Online]. Available: https://www.sas.com/en_us/ insights/analytics/machine-learning.html.
[27] R. Margaret, February 2017. [Online]. Available: http://whatis.techtarget.com/ definition/machine-learning.
[28] P. Dimitrios and M. Michela, “MACHINE LEARNING IN CYBERSECURITY TECHNOLOGIES,” United Kingdom, 2017.
[29] (ABI) Allied Business Intelligence Research, “Machine Learning in Cybersecurity to Boost Big Data, Intelligence, and Analytics Spending to $96 Billion by 2021,” (ABI) Allied Business Intelligence Research, January 2017.
[30] R. Lior and M. Oded, Data Mining with Decision Trees: Theory and Applications, 2nd ed., A. Yun, Ed., Toh Tuck Link: World Scientific Publishing, 2015.
[31] C. Alex, “Artificial Intelligence, Deep Learning, and Neural Networks Explained,” 2016.
[32] Wikipedia, May 2017. [Online]. Available: https://en.wikipedia.org/wiki/ Naive_Bayes_classifier.
[33] H. Chris, 2014. [Online]. Available: www.howtogeek.com.
[34] V. Roland, “Creating firewall rules with machine learning techniques,” Nijmegen.
[35] R. Karthik, “Instant Metasploit Starter,” in The art of ethical hacking made easy with Metasploit, India, Packt Publishing Limited, 2013, p. 52.
[36] C. David, G. Zoubin and J. Michael, "Active Learning with Statistical Models,” Journal of Artifical Intelligence Research, vol. 4, no. 1996, pp. 129-145, 1996.
[37] S. Greg and C. David, “Less is More: Active Learning with Support Vector Machines,” Proceedings of the Seventeenth International Conference on Machine Learning, no. 17, pp. 839-846, 2000.
75