13
Computer Networks 157 (2019) 99–111 Contents lists available at ScienceDirect Computer Networks journal homepage: www.elsevier.com/locate/comnet ASSCA: API sequence and statistics features combined architecture for malware detection Lu Xiaofeng a,, Jiang Fangshuo a , Zhou Xiao a , Yi Shengwei b , Sha Jing c , Pietro Lio d a School of Cyberspace Security, Beijing University of Post and Telecommunications, China b China Information Technology Security Evaluation Center, China c The Third Research Institute of Ministry of Public Security, China d Computer Laboratory, University of Cambridge, United Kingdom a r t i c l e i n f o Article history: Received 13 August 2018 Revised 29 March 2019 Accepted 15 April 2019 Available online 18 April 2019 Keywords: Computer virus and prevention Malware classification Machine learning Deep learning Call sequence a b s t r a c t In this paper, a new deep learning and machine learning combined model is proposed for malware be- havior analysis. One part of it analyzes the dependency relation in API (Application Programming Inter- face) call sequence at the functional level, and extracts features for random forest to learn and classify. The other part employs a bidirectional residual neural network to study the API sequence and discover malware with redundant information preprocessing. In the API call sequence, future information is much more important for conjecturing the semantic of the current API call. We conducted experiments on a malware dataset. The experiment results show that both methods can effectively detect malwares. How- ever, the combined framework has better classification performance. The classification accuracy of the combined malware detection architecture is 0.967. © 2019 Published by Elsevier B.V. 1. Introduction With the rapid development of the Internet, malwares have grown greatly in both categories and quantities, and the propa- gation mode has been constantly updated. Now, more and more targeted attacks interest in Internet of Things (IoT) devices. The Internet of Things encompasses hundreds or thousands of types of devices in every industry. According to [1], there will be more than 7 trillion IoT devices by 2025, with an estimate of about 1000 devices per person. Malwares, like Spam, Privacy leak, Bot- net, Distributed denial-of-service and Advanced persistent threat, are still rampant in the IoT paradigm. Weak security and rookie mistakes by IoT device manufacturers will compound that prob- lem. The McAfee Labs indicate 80% of IoT apps aren’t tested for vulnerabilities. Attacks on the IoT devices will increase rapidly be- cause of hyper growth of IoT devices, weak security and the high value of data on these devices [2]. Ransomware will be the most likely near-term threat, as it has proven to be a relatively easy way This research work was supported by National Natural Science Foundation of China (Grant No. 61472046), the Opening Project of Key Lab of Information Network Security of Ministry of Public Security (The Third Research Institute of Ministry of Public Securit) and Seed Funds of Beijing Association for Science and Technology, NSFOCUS Kunpeng Finds. Corresponding author. E-mail address: [email protected] (L. Xiaofeng). for criminals to make money. For example, ransomware on a smart television can encrypt the system files in the television and ransom money for the key to decode the encrypted files. We already see IoT devices being held for ransom in the power distribution and health care verticals. To defend against various malware, antivirus programs or fire- walls still plays an important role and takes up a large propor- tion. Such real-time defense approach detects different types of malwares ranging from virus to worm. Commercial anti-malware solutions rely on a signature database to detect malware due to the efficiency and simplicity [3]. This approach is implemented by scanning and checking if a file contains the contents which match the known signatures. There are several commonly used and effec- tive signature matching algorithms, such as Aho-Corasick [4] and Wu-Manber [5]. However, the malware manufacturers are aware of the rules be- hind an antivirus program, so they can change the digital signature of a malware to make it look harmless. Therefore, the antivirus programs require to be regularly updated by patching leaks as- sociated with latest malware types. Then, these antivirus systems have become increasingly puffy with the development of malware attacks [6,7]. The heavy resources consumption, such as memory, CPU and database updating, caused by the growing number of sig- natures leads to low detection performance. So, the host-based an- tivirus programs do not suit for resource-constrained IoT devices [1]. https://doi.org/10.1016/j.comnet.2019.04.007 1389-1286/© 2019 Published by Elsevier B.V.

ASSCA: API sequence and statistics features combined ...static.tongtianta.site/paper_pdf/47fe05c4-c968-11e... · Introduction With the rapid development of the Internet, malwares

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: ASSCA: API sequence and statistics features combined ...static.tongtianta.site/paper_pdf/47fe05c4-c968-11e... · Introduction With the rapid development of the Internet, malwares

Computer Networks 157 (2019) 99–111

Contents lists available at ScienceDirect

Computer Networks

journal homepage: www.elsevier.com/locate/comnet

ASSCA: API sequence and statistics features combined architecture for

malware detection

Lu Xiaofeng

a , ∗, Jiang Fangshuo

a , Zhou Xiao

a , Yi Shengwei b , Sha Jing

c , Pietro Lio

d

a School of Cyberspace Security, Beijing University of Post and Telecommunications, China b China Information Technology Security Evaluation Center, China c The Third Research Institute of Ministry of Public Security, China d Computer Laboratory, University of Cambridge, United Kingdom

a r t i c l e i n f o

Article history:

Received 13 August 2018

Revised 29 March 2019

Accepted 15 April 2019

Available online 18 April 2019

Keywords:

Computer virus and prevention

Malware classification

Machine learning

Deep learning

Call sequence

a b s t r a c t

In this paper, a new deep learning and machine learning combined model is proposed for malware be-

havior analysis. One part of it analyzes the dependency relation in API (Application Programming Inter-

face) call sequence at the functional level, and extracts features for random forest to learn and classify.

The other part employs a bidirectional residual neural network to study the API sequence and discover

malware with redundant information preprocessing. In the API call sequence, future information is much

more important for conjecturing the semantic of the current API call. We conducted experiments on a

malware dataset. The experiment results show that both methods can effectively detect malwares. How-

ever, the combined framework has better classification performance. The classification accuracy of the

combined malware detection architecture is 0.967.

© 2019 Published by Elsevier B.V.

1

g

g

t

I

o

t

1

n

a

m

l

v

c

v

l

C

S

P

N

f

t

m

I

h

w

t

m

s

t

s

t

t

W

h

o

h

1

. Introduction

With the rapid development of the Internet, malwares have

rown greatly in both categories and quantities, and the propa-

ation mode has been constantly updated. Now, more and more

argeted attacks interest in Internet of Things (IoT) devices. The

nternet of Things encompasses hundreds or thousands of types

f devices in every industry. According to [1] , there will be more

han 7 trillion IoT devices by 2025, with an estimate of about

0 0 0 devices per person. Malwares, like Spam, Privacy leak, Bot-

et, Distributed denial-of-service and Advanced persistent threat,

re still rampant in the IoT paradigm. Weak security and rookie

istakes by IoT device manufacturers will compound that prob-

em. The McAfee Labs indicate 80% of IoT apps aren’t tested for

ulnerabilities. Attacks on the IoT devices will increase rapidly be-

ause of hyper growth of IoT devices, weak security and the high

alue of data on these devices [2] . Ransomware will be the most

ikely near-term threat, as it has proven to be a relatively easy way

� This research work was supported by National Natural Science Foundation of

hina (Grant No. 61472046 ), the Opening Project of Key Lab of Information Network

ecurity of Ministry of Public Security (The Third Research Institute of Ministry of

ublic Securit) and Seed Funds of Beijing Association for Science and Technology,

SFOCUS Kunpeng Finds. ∗ Corresponding author.

E-mail address: [email protected] (L. Xiaofeng).

p

s

h

a

C

n

t

[

ttps://doi.org/10.1016/j.comnet.2019.04.007

389-1286/© 2019 Published by Elsevier B.V.

or criminals to make money. For example, ransomware on a smart

elevision can encrypt the system files in the television and ransom

oney for the key to decode the encrypted files. We already see

oT devices being held for ransom in the power distribution and

ealth care verticals.

To defend against various malware, antivirus programs or fire-

alls still plays an important role and takes up a large propor-

ion. Such real-time defense approach detects different types of

alwares ranging from virus to worm. Commercial anti-malware

olutions rely on a signature database to detect malware due to

he efficiency and simplicity [3] . This approach is implemented by

canning and checking if a file contains the contents which match

he known signatures. There are several commonly used and effec-

ive signature matching algorithms, such as Aho-Corasick [4] and

u-Manber [5] .

However, the malware manufacturers are aware of the rules be-

ind an antivirus program, so they can change the digital signature

f a malware to make it look harmless. Therefore, the antivirus

rograms require to be regularly updated by patching leaks as-

ociated with latest malware types. Then, these antivirus systems

ave become increasingly puffy with the development of malware

ttacks [6,7] . The heavy resources consumption, such as memory,

PU and database updating, caused by the growing number of sig-

atures leads to low detection performance. So, the host-based an-

ivirus programs do not suit for resource-constrained IoT devices

1] .

Page 2: ASSCA: API sequence and statistics features combined ...static.tongtianta.site/paper_pdf/47fe05c4-c968-11e... · Introduction With the rapid development of the Internet, malwares

100 L. Xiaofeng, J. Fangshuo and Z. Xiao et al. / Computer Networks 157 (2019) 99–111

T

l

a

[

m

t

r

p

a

p

f

A

s

[

t

c

t

s

b

s

a

d

f

m

[

t

w

f

v

w

m

f

p

t

t

t

q

p

u

s

A

m

A

h

g

h

c

m

c

p

m

a

u

d

O

i

m

t

c

Traditionally, there are two major approaches for the malware

detection based on the approach that is used to analyze malwares,

static analysis and dynamic analysis [8] . In static analysis, mal-

ware files are analyzed in binary form, or otherwise decompressed

and/or decompiled into assembly representations. In dynamic anal-

ysis, binary files are executed and the behavior of the program,

such as system calls, is recorded by hooking or making some in-

ternal access to the virtualized environment. In principle, dynamic

detection can achieve the faithful operation of malicious software

and is not easily confused [9] . Static analysis, on the other hand,

is vulnerable to obfuscation. Malware developers can increase the

difficulty of static analysis by code obfuscation techniques [10,11] .

The obfuscation algorithms include garbage insertion, which con-

sists on adding sequences that do not modify the behavior of the

program (e.g., nop instructions); code reordering, which changes

the order of program instructions and variable renaming; which

replaces a variable identifier with another one [12] .

In recent years, there have been more and more APT (Advanced

Persistent Threat) attacks and senior virus for industrial control

systems and IoT. The great number of variants of these viruses have

made it more difficult to detect based on fixed signatures, there-

fore the identification of unknown malware has become a new

challenge. Besides the signature-based antivirus approaches, using

the machine learning techniques to categorize the malwares can

be considered as another alternative solution.

As the development of artificial intelligence technology, the

researchers proposed to use machine learning technology to detect

and classify malicious samples, and further achieve a certain effect

[13,14] . Deep learning is oversensitive to the data which is never

seen. Recent work gives many examples of the oversensitivity

of deep neural network to context [15] . Deep learning model

needs to be combined with traditional machine learning model

to achieve more robust and human level intelligence [16] . In this

paper, we propose a dynamic analysis based malware detection

architecture, API Sequence and Statistics features Combined Archi-

tecture (ASSCA), which combines the machine learning technology

and deep learning technology. We study the API call sequence

feature by deep learning technology and study the API statistical

features by machine learning technology. The final classification

result of a suspicious software is determined according to the two

classifiers.

The contributions of this paper are as follows:

This paper proposes a system architecture that combined the

deep learning model based on sequence data and machine learn-

ing model based on API statistical features. The traditional machine

learning model and the recurrent neural network model have been

integrated to obtain a better classifier.

This paper proposes a bidirectional residual LSTM (Long Short-

erm Memory) model to study the API call sequence. Bidirectional

LSTM can use the previous context and the later context. We add

the residual connection into the bidirectional LSTM model to make

a great improvement in training deeper neural network. To the

best of our knowledge, this is the first paper to propose using bidi-

rectional residual neural network to address the malware problem.

This paper proposes a new API association analysis algorithm

based on argument hash, AMHARA algorithm. Then this paper

studies on optimizing this approach.

This paper studies the problem of the long sequence in recur-

rent neural network, proposing a new method of removing redun-

dant API call mode in the API call sequence.

2. Related work

Traditionally, there are the two major approaches for the mal-

ware detection, static analysis and dynamic analysis.

Static features are the main study objectives by the machine

earning method. The researcher decompiles the target software

nd then obtains the static features of the software. Santos I et al.

17] mainly used the static feature of PE files in Windows environ-

ent to analyze malwares. Liu et al. [18] decompiled an APK to get

he sensitive API, string, application permissions and certificates

equested by the APK to detect malwares. Yang et al. [19] utilized

ermission of APK, application action and application classification

s input features, proposed an improved random forest algorithm.

Some researchers extract features from different levels or as-

ects. It is noted that the classification method based on combined

eatures improves the recognition accuracy. MADAM [20] monitors

ndroid systems at four levels (kernels, applications, users, and

oftware packages) and retrieves five sets of functions. RanDroid

21] extracts requested permissions, vulnerable API calls along with

he existence of key information such as dynamic code, reflection

ode, native code, cryptographic code and database from applica-

ions and uses them as features in various machine learning clas-

ifiers.

These static features based approaches have good performance,

ut these approaches are hard to capture dynamic behavior, and

ome malwares just do malicious behavior during running. Static

nalysis, on the other hand, is susceptible to obfuscation. Malware

evelopers can increase the difficulty of static analysis by code ob-

uscation techniques [10,11] .

In recent years, many researchers utilized the neural network

ethod to classify malicious samples under large-scale data sets

22,23] . Alsulami et al. [22] use Microsoft Windows prefetch files

o detect malware. Saxe et al. [23] used the forward neural net-

ork to classify the static analysis results. After the population of

orward neural network, the deep learning model of CNN (Con-

olutional Neural Network) and RNN (Recurrent Neural Network)

ith their improved versions has become the major concern of

alicious sample detection.

Some researchers have improved the detection by adding other

eatures to the system call sequence [24–26] . Bojan et al. [24] pro-

osed using N-gram to extract dynamic system call sequence fea-

ures, after that the extracted time series feature was inputted into

he RNN model for classification. Shun et al. [25] proposed to ex-

ract the API sequence and API returned value as the feature se-

uence inputted into the RNN model. The feature sequence is in-

utted into RNN model to extract the high-level feature, then the

pper-level feature vector is inputted into the CNN model for clas-

ification. Microsoft researcher Jack Stoke et al. [26] inputted the

PI call sequences into the RNN and ESN (Echo State Networks)

odels. They used a recurrent model trained to predict the next

PI call, and used the hidden state of the model (that encodes the

istory of past events) as the fixed-length feature vector that is

iven to a separate classifier. Their raw data stream consists of 114

igh-level events generated by the anti-malware engine which en-

odes all of the low-level API.

The above-mentioned deep learning method has effect on the

alicious sample detection. Utilizing deep learning methods to

lassify the malwares don’t require much prior knowledge and its

rocedure is simple, but the extracted features don’t have a clear

eaning. And the network architecture of the above papers is usu-

lly simple, not large enough to have capability to detect most

nknown malwares. The deep neural network of our study is bi-

irectional LSTM that has two flows, different from normal LSTM.

ne way is from the beginning of input to the end. Another way

s from the end of input to the beginning. Both ways have their

emory and can improve the detection performance. The struc-

ure of our deep neural networks is different from theirs.

Some researchers have discovered the value of system

alls pairs and used system call pairs for malware detection.

Page 3: ASSCA: API sequence and statistics features combined ...static.tongtianta.site/paper_pdf/47fe05c4-c968-11e... · Introduction With the rapid development of the Internet, malwares

L. Xiaofeng, J. Fangshuo and Z. Xiao et al. / Computer Networks 157 (2019) 99–111 101

M

p

t

t

fi

t

e

h

t

i

e

f

f

s

N

t

t

N

s

q

m

t

s

t

t

g

t

l

i

[

m

i

m

b

c

m

c

3

3

c

g

u

p

[

s

a

T

c

d

w

c

v

s

t

m

s

t

Fig. 1. Malware classification system.

3

F

T

C

T

A

s

m

l

a

S

s

l

a

p

f

w

4

w

4

f

d

t

4

(

g

c

w

i

s

d

G

o

s

t

l

p

AMADROID [27] relies on the sequence of abstracted API calls

erformed by an app. It builds a statistical model to represent the

ransitions between the API calls performed by an app, specifically,

hey model these transitions as Markov chains to perform classi-

cation. VAED [28] keeps track of the system call to calculate the

ransition probability distribution of each pair system calls. How-

ver, as the total samples in their study are less 480, their method

as to be validated on a bigger dataset.

In our study, we believe that the calling relationship be-

ween system calls can reflect the true behavior of a function

n program execution. However, we propose another method to

xtract the system calls pair and we use the system call pair

requency instead of the system call transition probability as the

eatures.

Numerous researchers utilized N-gram to process the opcode

equences or system calls [29–31] . Zhang et al. [29] applied the

-gram model to process opcodes and classified malware by fea-

ures such as permission. Ravi et al. [18] utilized N-gram to process

he opcode sequences, and used the decision tree, KNN (K-Nearest

eighbor), Naive Bayes and SVM (Support Vector Machine) to clas-

ify malwares. Canfora et al. [31] used 3-gram to process the se-

uences of system calls to detect malwares.

In the literatures containing dynamic analysis, the N-gram

ethod is used to process the long sequence and obtain the fea-

ure vector. The N-gram method has certain limitations. Inserting

ome unrelated APIs into malicious code can reduce the effect of

he N-gram method. In our study, we utilize a variety of methods

o shorten the length of the system call sequence. We use the N-

ram method to remove API subsequences that have less effect on

he classification in the system call sequence.

Beginning in 2018, researchers attempted to combine feature

earning with deep learning. Alan Yuille pointed out that the train-

ng combination model is the key to cracking deep learning defects

16] . Guen [32] studied to combine multiple models for android

alware detection. Guen builded several separated deep learn-

ng models for features, and then combine multiple deep learning

odels. In contrast, we build machine learning models for features,

uild deep learning models for sequences, and then combine ma-

hine learning models with deep learning models. This combined

odel is more robust because of the structural knowledge in ma-

hine learning model [16] .

. System architecture

.1. System framework

There are two kinds of signature-based antivirus approaches ac-

ording to their infrastructures. One is host-based antivirus pro-

rams which install detection agent in the users’ devices and

pdates the signature database to ensure timely and com-

lete security protection. Another is cloud-based antivirus system

15] which places different types of detection agents over the cloud

ervers. A user can upload any type of file and receive a report

bout the malwares that might be contained in the file [16–18] .

his newly developed framework is cost-saving for resource-

onstrained devices.

The high volume and sometimes resource-constrained of IoT

evices make it impossible to manage and secure them by the way

e secure the traditional IT systems. For this reason, centralized

ontrol systems will be developed to manage and secure IoT de-

ices automatically and in aggregate [2] . In our study, we build a

erver-based centralized antivirus system. The antivirus server cap-

ures files in IoT networks and detects whether the files contain

alwares. A user can also upload any type of files to the antivirus

erver and receive a report about the malware that might be con-

ained in the file.

.2. Malware detection based on combined learning model

The structure of combined learning model ASSCA is shown in

ig. 1 . It also shows the malicious sample classification process.

he first step is to get samples, then samples are executed in

uckoo sandbox. The sandbox can extract all the system calls.

he antivirus server combines the deep learning model based on

PI sequence data and the machine learning model based on API

tatistical features to categorize the malwares. The deep learning

odel is the bidirectional residual LSTM model, and the machine

earning model is random forest model.

The lengths of the API call sequences extracted by the sandbox

re quite different, and most of the API sequences are very long.

o long sequence cannot be inputted into the deep learning model,

o in this paper, several algorithms of reducing API call sequence

ength are studied as well. Then, the preprocessed API sequences

re inputted into the deep learning model. At the same time, we

ropose a new API Call association analysis method to extract the

eatures from the API call sequences. The Tensorflow [33] frame-

ork is used to measure and test the LSTM model.

. Deep learning method

We use the Bidirectional Residual LSTM model to classify mal-

are.

.1. Sequence data preprocessing

The data extracted from sandbox is API call sequence. Each API

unction is represented by a specific integer for the subsequent

ata processing and model classification, and different API func-

ions match different integers.

.1.1. Redundant subsequence removal method

1) N -Gram sample subsequence extraction

The length of the API call sequence in each sample varies

reatly, ranging from 1–1,0 0 0,0 0 0. There are a large number of

ommon API subsequences in the API sequences of both plain soft-

are and malware [17,18] . These subsequences have little effect on

dentifying malware. The removal of those subsequences with a

mall effect on the classification can not only accelerate the proce-

ure but also increase the accuracy. In this paper, the text-based N-

ram method is used to extract the subsequences. The basic idea

f the N-Gram algorithm in malware detection is to carry out the

liding window operation with the size of N in the API stream. Af-

er the N-gram process, a sample forms a list of subsequences with

ength N, which are the vector space of the text content. For exam-

le, an API sequence is {24, 26, 18, 18, 6, 13}. With N being 4 and

Page 4: ASSCA: API sequence and statistics features combined ...static.tongtianta.site/paper_pdf/47fe05c4-c968-11e... · Introduction With the rapid development of the Internet, malwares

102 L. Xiaofeng, J. Fangshuo and Z. Xiao et al. / Computer Networks 157 (2019) 99–111

Table 1

4-Gram subsequence connection.

4-Gram subsequence 1 4-Gram subsequence 2 Connection

S 1 S 2 S 3 S 4 S 2 S 3 S 4 S 5 S 1 S 2 S 3 S 4 S 5 S 1 S 2 S 2 S 2 S 2 S 2 S 2 S 2 S 1 S 2 S 2 S 2 S 2

(

(

Table 2

Continuous same API mode removal.

Continuous same mode API Remove continuous same mode API

S 1 S 1 S 1 S 1 S 1 S 1 S 1 S 1 S 1 S 2 S 4 S 3 S 1 S 2 S 4 S 3 S 1 S 2 S 4 S 3

s

o

F

o

r

4

4

m

s

c

s

m

o

p

d

i

m

m

f

l

W

a

f

o

h

h

w

O

o

B

i

s

h

b

v{

o{

c

s

s

the sliding window moves with step 1, the 4-gram subsequences

list is {(24, 26, 18, 18), (26,18,18,6), (18,18,6,13)}.

2) Removes redundant subsequences by information gain

The dimension of subsequences lists extracted by N-Gram

method is very large, so it requires reduction. Only those subse-

quences that have a large effect on the classification of the mal-

wares should be reserved. In feature engineering, information gain

is a simple and efficient method. And it is proved to be very ef-

fective in handling N -Gram [34] . In this paper, information gain is

used to select the key subsequences.

For a single feature, the information entropy of the system will

change when the system has it or abandons it, and the difference

between two information entropies is the information gain brought

by this feature to the system. The amount of information is en-

tropy. Let C be a category. H ( C ) is the information entropy of cate-

gory C and can be calculated by Eq. (1) .

H ( C ) = −∑ n

i =1 p ( c i ) log p ( c i ) (1)

where p ( c i )is the proportion of samples of category c i .

The information gained by the subsequence T to class C is:

IG ( T ) = H ( C ) − H ( C| T ) =

V T ∈ { 0 , 1 } ∑

C ∈ { C i } P ( V T , C ) log

(P ( V T , C )

P ( V T ) P ( C )

)(2)

In Eq. (2) , V T is the value of subsequence T . V T = 1 , when T

shows in a sample, otherwise V T = 0 . P ( V T , C ) is the ratio of sub-

sequence T in category C. P ( V T ) is the ratio of subsequence T in all

the samples. P ( C ) is the ratio of category C in all the samples.

The subsequence that the information gain higher indicates that

the frequencies of subsequences in the two sample sets are quite

different, hence it can reflect the characteristics between normal

samples and malicious samples, providing a better discrimination.

3) Sample valid subsequences connection

The list of subsequences for each sample obtained by N-Gram

is taken as a data set. The feature subsequences selected by the

information gain method are called valid subsequences. The set

of valid subsequences is called a dictionary. The 4-gram subse-

quences included in the dictionary are extracted from all 4-Gram

subsequences and are arranged in the chronological order. Then,

these subsequences are connected to get the sample’s valid call se-

quence. The method of connecting the subsequences is as shown

in Table 1 . The connected call sequence removes the subsequences

with little effect on the classification, greatly reducing the length

of the sample API call sequence. Because of this, the model training

and detection are faster and more accurate than before.

4.1.2. Continuous same API mode removal

During the analysis API call sequence, we found that the same

API is often called multiple times, which indicates information re-

dundancy. In order to reduce the length of the numerical sequence

entered into the model and simultaneously keep the information

entropy of the system call, Tobiyama [25] proposed to remove the

continuous same API functions. This paper improves this approach.

We propose to remove the continuous same API modes in the pre-

processing as shown in Table 2 below. An API mode is an API call

ubsequence. For the single API appears more than twice, we keep

nly two of them to indicate that this call appears multiple times.

or the API mode that appears more than once, we keep only one

f them, because one pattern is enough to represent that the cor-

esponding behavior exists.

.2. Deep learning algorithm model

.2.1. Bidirectional LSTM model

Each hidden layer output of the recurrent neural network RNN

odel is retained in the network, together with the input of the

ystem to determine the output at the next moment. This recy-

ling facilitates the RNN’s great success in processing time series

uch as machine translation. The LSTM unit made a great improve-

ent on long term dependency comparing to basic RNN unit. It

vercomes gradient diffusion or gradient explosion during the back

ropagation. It plays a better role in dealing with long sequence

ata.

The LSTM model is used in the experiment. As the LSTM model

ncreases in the number of layers, the extracted features become

ore abstract, and the ability of expression grows stronger, but

ore time is consumed. This paper takes into account the above

actors in the study of two hidden layer LSTM structure.

The reason why RNN makes a good performance on sequence

earning is that it makes use of previous context in the sequence.

ith bidirectional RNN, we not only can use pervious context, but

lso the future context. Especially in API call sequence, future in-

ormation is much more important for conjecturing the semantic

f current API call. For example, an encryption call is a normal be-

avior, but if it follows an API which deletes a file, then the be-

avior of this program is malicious. With lots of such behavior, we

ill be sure that this program is a ransomware.

Bidirectional LSTM has two flow, different from normal LSTM.

ne way is forward, from the beginning of input to the end. An-

ther way is backward, from the end of input to the beginning.

oth ways have their own memory, as shown in Fig. 2 .

The input into the model is X = { x 0 , x t , . . . x n } time sequences. x t s the input at a moment, which is a system call function of a

ample obtained by preprocessing. It is represented by the one-

ot vector. W i , W f , W c , W o , U i , U f , U c , U o , V o is the weight matrix, b i ,

f , b c , b o is the offset vector.

First, calculate the input gate, memory gate of the middle state

alue, forget gate in the t moment.

i t = σ ( W i x t + U i h t−1 + b i ) C

′ t = tanh ( W c x t + U c h t−1 + b c )

f t = σ(W f x t + U f h t−1 + b f

) (3)

Second, calculate the memory gate status values, as well as the

utput gate and hidden layer output

C t = i t ∗ C ′ t + f t ∗ C t−1

o t = σ ( x t + U o h t−1 + b o ) h t = o t ∗ tanh ( C t )

(4)

The hidden layer h t contains the feature information about the

all sequence, and the feature information is inputted into the clas-

ification model for malicious sample classification detection.

After both directional processes, we calculate the elementwise

um of input of each direction. The following Eq. (5) shows how

Page 5: ASSCA: API sequence and statistics features combined ...static.tongtianta.site/paper_pdf/47fe05c4-c968-11e... · Introduction With the rapid development of the Internet, malwares

L. Xiaofeng, J. Fangshuo and Z. Xiao et al. / Computer Networks 157 (2019) 99–111 103

Fig. 2. Bidirectional Residual LSTM classification model structure.

d

i

−→←y

4

o

b

q

i

oes it work. The arrow upon the symbol indicates the correspond-

ng part is in forward flow or backward flow.

y n =

−−−→

lst m n

(−−→

y n −1

) −

y n =

←−−−lst m n

(←−−y n −1

)−→ ← −

(5)

n = y n + y n

.2.2. Residual connection

Besides bidirectional technique, we use residual connection in

ur network model too. After we improved the basic LSTM to

idirectional LSTM, we need to build more layers for better Se-

uence modeling. Residual connection makes a great improvement

n training deeper neural network. It overcomes the problem that

Page 6: ASSCA: API sequence and statistics features combined ...static.tongtianta.site/paper_pdf/47fe05c4-c968-11e... · Introduction With the rapid development of the Internet, malwares

104 L. Xiaofeng, J. Fangshuo and Z. Xiao et al. / Computer Networks 157 (2019) 99–111

Fig. 3. AMHARA Algorithm sketch.

s

t

c

a

t

r

r

v

s

s

i

[

a

t

t

w

p

T

t

l

p

r

m

f

o

m

g

T

f

a

c

S

c

deeper network has worse performance than simple network, so

we can a build much deeper network to represent a more complex

model.

We build the network as shown in Fig. 2 . The basic structure

we build is called SA(Sequence Analysis) Block. The SA Block con-

tains a Bidirectional LSTM layer, with an extra residual connection

which forwards input to the output of LSTM layer. The bidirectional

LSTM with an extra residual connection is defined as Bi-Residual

LSTM. So, we can sum the forward flow’s output, the backward

flow’s output and the residual connection’s output, as shown in

the following Eq. (6) . The last layer is a tanh activation function

layer, which improves the network’s nonlinearity. With SA Block,

we can build much deeper network without concerning gradient

vanishing or under fitting.

y n = tanh

(−→

y n +

← −y n + y n −1

)(6)

4.2.3. Max-Pooling

It is important to deal with hidden layer feature information

and to obtain abstract features that represent sequence informa-

tion. As a non-linear subsampling layer, Max-Pooling in the convo-

lution neural network model of computer vision field has achieved

good results. The subsampling layer plays a role in preventing over-

fitting and decreasing the dimensions of feature vector in CNN.

Similar to the goal of computer vision, Max-Pooling is used to get

the eigenvectors whose length is fixed in the classifier and to re-

duce the dimension without losing the hidden layer information.

Max-Pooling obtained the eigenvector as follows:

h max ( i ) = max ( h 0 ( i ) , h 1 ( i ) , . . . , h n ( i ) ) (7)

i ∈ ( 0 , 1 , 2 , . . . N − 1 ) , N is the number of neurons in per hidden

layer. Finally, the eigenvector inputted into the classification model

is connected between the output of the Max-Pooling and the last

state vector of the hidden layer [h max ,h n ].

4.2.4. Classification

The feature vector obtained from Max-pooling is finally entered

into the classifier. Because the malware classification is a binary

classification problem, the logistic function is chosen as classifier.

Finally, the classifier includes a fully connected layer and a logistic

layer.

5. Machine learning method

Deep learning method has shown better accuracy in some spe-

cific area these years, but traditional machine learning still is more

explainable and has lower hardware requirements. This paper pro-

poses a method to combine deep learning approach and traditional

machine learning approach to get a better result in chapter 6.5.

5.1. API call association analysis

With discrete API call sequences, the relationship between API

calls is an important potential feature for statistics-based tradi-

tional machine learning. Algorithms such as SVM or Decision Tree

cannot take advantage of call relationship in sequence as feature,

unlike recurrent neural networks. The abstract association graph

which transforms discrete API calls into a DAG (Directed Acyclic

Graph) by call relationship, is a good way to make call relationship

as a feature to put into machine learning algorithm.

However, this method leads to a large number of different as-

sociated graphs because of computer program’s complexity. One

of the solutions is to divide the association graphs into sub-graph

with fewer kinds. But there is often no clear boundary between the

graphs in practice, and the time complexity in searching subgraphs

is high.

So, for the above problems, we found a way to simplify the as-

ociation’s call graph. For a complex call graph, a pair (s, t) is used

o represent two associated calls s and t, which simplifies an asso-

iation graph into a set of pairs. The basic idea is to construct the

ssociated graph and find the associated call pairs then. However,

his method will take a lot time.

This paper proposes a quick API association analysis algo-

ithm based on argument hash (ArguMent-Hashing based API coR-

elation fast Analysis -AMHARA algorithm). The algorithm tra-

erses the whole API call sequence. For the current API call

i , it reversely traverses T API calls. The arguments of s i and

j are compared in the reverse traversal to a call s j (i − T < j <

) as Fig. 3 shows. Both API s i and s j have an argument list

s 0 i , s 1

i , s 2

i , . . . , s m

i ] , [ s 0

j , s 1

j , s 2

j , . . . , s n

j ] . The algorithm enumerates the

rgument in each argument list and compares them one by one. In

he following formula, if the compare function turns out that the

wo input arguments are exactly the same, the function returns 1

ith the same input argument, or return 0 with the different in-

ut arguments. Thus if S i, j > 0, then s i and s j are associated API.

he associated call pairs can be a feature of the sample:

S i, j =

m ∑

k =0

n ∑

l=0

Compare (s k i , s

l j

)(8)

In this algorithm, how to choose T is a hard problem. If T is

oo small, many association information will be lost. If T is too

arge, our algorithm will cost lots of time. We can use the locality

rinciple of program execution to determine T. To make a function,

elated APIs will always appear one by one. Even considering the

alware’s obfuscation technique, the associated APIs will not be

ar apart in the API sequence. We chose 100 as value of T based

n our experiment and hardware.

During the arguments are compared, the length of some argu-

ents are large, such as a buffer. If the average length of the ar-

ument is k, then the time complexity of this algorithm is O( mnk ).

his paper optimizes the argument comparison algorithm by the

ollowing algorithm. Firstly, it hashes the arguments item by item,

nd then it compares the argument hash value directly. The time

omplexity of the upper bound is O( mn ) as the following formula:

i, j =

m ∑

k =0

n ∑

l=0

Compare [hash

(s k i

), hash

(s l j )]

(9)

After the algorithm executes, we get a collection of associated

all pairs that not only records the associated call pairs existed in

Page 7: ASSCA: API sequence and statistics features combined ...static.tongtianta.site/paper_pdf/47fe05c4-c968-11e... · Introduction With the rapid development of the Internet, malwares

L. Xiaofeng, J. Fangshuo and Z. Xiao et al. / Computer Networks 157 (2019) 99–111 105

Fig. 4. Pseudocode of AMHARA Algorithm.

t

p

b

r

F

c

g

m

(

e

g

t

b

s

s

c

5

c

s

c

m

M

e

c

f

t

s

t

w

4

N

f

Fig. 5. “WannaCry” run shot.

Table 3

‘WannaCry’ API calls lists (Top 10).

API species API number

NtWriteFile 19,622

NtClose 16,819

NtCreateFile 10,486

DeleteFile 7226

NtQueryKey 7076

GetSystemMetrics 6861

NtReadFile 6670

GetFileAttributes 6486

MoveFileWithProgress 6394

RegQueryValueExW 6029

…… ……

Total 150,079

he sequence, but also records the number of the associated call

airs.

The algorithm showed in Fig. 4 will effectively reduce the num-

er of features, contracting the feature space to an acceptable

ange, with a big speed improvement than the original method.

or example, for a sequence of 400 species API calls, the number of

all graphs, the feature, exceed one million by using the association

raph method. Simplifying the feature by call pairs, this paper can

ake the number of extracted features reduced to less than 20,0 0 0

40 0 ∗ 40 0). At the same time, this algorithm improves processing

fficiency and avoids the complicated processing of the association

raph.

This algorithm effectively avoids the failure in finding the rela-

ionship between the API calls that are far apart in the sequence

ut are related essentially when the N-gram is used to process the

equence. It makes the extracted feature more responsive to the

ample’s true intent and avoids the presence of cheating in mali-

ious samples.

.2. API call frequency analysis

In the classification process, the statistic data of individual API

alls are added as the additional features, which improved the clas-

ification accuracy with only a small number of extra features. Be-

ause in some malicious samples, the number of single API call can

erely determine whether it is malicious or not. For instance, in

ay 2017, the popular “WannaCry” ransomware shown in Fig. 5

ncrypted files in the disk and deleted the original file, and it

aused an estimated $1 billion in damage costs in just its first

our days around the world. By “NtCreateFile” and “DeleteFile”

wo API calls, we can determine whether there is a similar ran-

omware behavior of the sample. As shown in Table 3 below, lists

he 150,079 API calls during "WannaCry" executing in the sandbox,

here “NtCreateFile” accounts for 7.0% and DeleteFile accounts for

.8%. When a software writes a big file, there would be lots of

tWriteFile system call, so that NtWriteFile cannot be used as a

eature of malware.

Page 8: ASSCA: API sequence and statistics features combined ...static.tongtianta.site/paper_pdf/47fe05c4-c968-11e... · Introduction With the rapid development of the Internet, malwares

106 L. Xiaofeng, J. Fangshuo and Z. Xiao et al. / Computer Networks 157 (2019) 99–111

Table 4

API pair matrix.

Table 5

Confusion matrix.

Actual Label

Positive Negative

Predicted Label Positive’ True Positive (TP) False Positive (FP)

Negative’ False Negative (FN) True Negative (TN)

i

c

b

T

F

T

t

p

o

f

R

6

p

c

V

t

s

p

p

m

o

o

t

m

6

6

i

8

u

S

c

l

e

From another point of view, if we treat API call pairs as a ma-

trix showed in Table 4 ’s left part. In the API calling pair analysis

progress, we filtered out the self-call relationship because a re-

peating API calling should not be counted as a call relationship.

To fix the lack of feature caused by the filtered call pairs, we can

make API pair matrix a full matrix showed in Table 4 ’s right part

by adding the statistics data of individual API calls.

5.3. Random forest classifier

As a kind of statistical classification model, random forest has

good performance in all aspects. The random feature selection en-

sures a smaller generalization error, which is quite advantageous

to the case of small data set. The high dimensional data can be

processed without manual feature selection. It is easy to carry out

large-scale parallel training on a distributed computing platform

too. So, we choose the random forest as the classifier, which per-

forms better in our experiment indeed.

6. Experiments

6.1. Evaluation indicators

This paper uses Google’s Tensorflow [33] framework to imple-

ment the algorithm. In terms of the classification algorithm, the

evaluation indicators have accuracy (ACC), Precision and Recall rate

Curve (PRC), Receiver Operating Characteristic curve (ROC), Area

Under roc Curve (AUC). ROC curve takes a better advantage than

the PRC, in which the curve is basically unchanged when positive

and negative sample distribution changes during the test set. ROC

curve information is large, but it is not clear which classifier has

a better effect. As a numerical representation of the AUC, a higher

AUC value indicates the better effect of the classifier. Therefore, in

the binary-class algorithm, ROC and AUC are selected as evaluation

indexes.

The following Table 5 is a confusing matrix, which divides an

instance into a positive class and a negative class. Positive means

the sample does have malicious behavior. Negative means the sam-

ple does not have malicious behavior. True means the predicted

label and the actual label are the same. So, the false means that

the predicted label does not match the actual label. Here are some

definations:

• True positive (TP) = correctly identified

• False positive (FP) = incorrectly identified

• True negative (TN) = correctly rejected

• False negative (FN) = incorrectly rejected

ROC is a curve in a two-dimensional coordinate axis, the hor-

zontal coordinate is the false positive rate (FPR), and the vertical

oordinate is the true positive rate (TPR). TPR and FPR are defined

y the confusion matrix, as in Eqs. (10) and (11) .

P R =

T P

( T P + F N ) (10)

P R =

F P

( F P + T N ) (11)

A determined classifier and test set can only get one set of FPR,

PR. A curve requires a series of FPR, TPR. Because the output of

he classifier is the probability value, the number of positive sam-

les and negative samples is different by setting different thresh-

lds, and the FPR and TPR are different. Then setting the threshold

rom high to low, a series of FPR and TPR are obtained to form the

OC curve.

.2. Experiment data

The experiment samples come from four data sets: normal sam-

les from Windows7 and Windows XP system’s exe files, the mali-

ious samples were randomly picked up from the Virus Share [35] ,

irus Total [36] collector. Using the open source cuckoo sandbox

ool as a dynamic analysis tool, the cuckoo sandbox monitors the

ystem’s kernel API during the sample execution, and each sam-

le gets the corresponding execution API call sequence. In the ex-

eriment, the sandbox executes each sample a fixed time of two

inutes. The sample’s label is derived from the detection results

f the major security vendors obtained from VirusTotal.com. Based

n the weighted probability statistics for the detection results, the

hreat level of the sample is calculated. Altogether there are 1430

alicious samples and 1352 normal samples.

.3. Deep learning experiment

.3.1. Experimental parameters

In the experiment, the samples were divided into the train-

ng set, verification set and test set, respectively accounting for

0%, 10% and 10% of the total samples. In the experiments, we

se Tensorflow to train Bi-Residual LSTM model. We only use two

A block in our network because we are restricted by hardware

onditions.

Table 6 shows the parameters in each layer of the model. The

ength of the API sequence is 10 0 0. We did several times of the

xperiment to find out the best sequence length, as Fig. 6 (c) shows.

Page 9: ASSCA: API sequence and statistics features combined ...static.tongtianta.site/paper_pdf/47fe05c4-c968-11e... · Introduction With the rapid development of the Internet, malwares

L. Xiaofeng, J. Fangshuo and Z. Xiao et al. / Computer Networks 157 (2019) 99–111 107

Table 6

Experiment parameter.

Indicator Input size Layer parameter

Embedding layer BatchSize x 10 0 0 × 1 Vocab size: 400

Embedding size: 128

First Bi-LSTM layer BatchSize x 10 0 0 × 128 Dropout rate:0.5

Hidden layer size: 128

SA Block Layers Block0 BatchSize x 10 0 0 × 128 Dropout rate: 0.5

Forward and Backward LSTM size: 128

BN’s momentum:0.99

BN’s epsilon = 0.001

Block1 BatchSize x 10 0 0 × 128 Dropout rate: 0.5

Forward and Backward LSTM size: 128

BN’s momentum:0.99

BN’s epsilon = 0.001

Maxpooling BatchSize x 128 × 10 0 0 Polling size: 10 0 0

Polling axis: 2

Output size: BatchSize x 128

Fully connected layer BatchSize x 256

(128 from maxpooling, 128 from last

output on timestep direction)

Size: 256 × 1

Bias: True

Optimizer: RMSProp – Learning rate:0.001

Weight decay: 0.9

Fig. 6. Bi-LSTM experiment.

W

h

l

d

a

6

s

c

a

r

v

e did not test longer sequence because of the limitation of the

ardware. With our redundant removal algorithm, sequences with

ength of 10 0 0 is enough to show the behavior of a sample.

Another parameter 128 is the size of LSTM layer. Typically, in

eep learning model, the size of the layer is the power of 2, such

s 32, 64, 128, 256. Our model is under-fitting when the size is

4 and the model is well-fitting when the size is 128. And, at the

ame size, the LSTM layer has much more parameters than the

onvolution layer. So, we determined the size of a layer to be 128

nd applied it on all LSTM layers. Other parameters like dropout

ate, batch normalization layer’s momentum, we choose most used

alue.

Page 10: ASSCA: API sequence and statistics features combined ...static.tongtianta.site/paper_pdf/47fe05c4-c968-11e... · Introduction With the rapid development of the Internet, malwares

108 L. Xiaofeng, J. Fangshuo and Z. Xiao et al. / Computer Networks 157 (2019) 99–111

Table 7

The effect of sequence length on measuring time.

Sequence length Each epoch average consumption time AUC

200 273 s 0.9644

500 485 s 0.9718

10 0 0 852 s 0.9839

Table 8

Different classification methods AUC.

Method AUC

Redundant subsequence removal method 0.983.

Continuous same model API call removal 0.947

The original sequence 0.930

u

2

t

o

m

o

h

d

t

o

E

u

f

n

d

6

fl

p

e

h

o

f

t

c

u

o

e

e

t

r

a

t

p

o

M

M

N

A

c

t

a

t

o

e

t

t

p

s

o

8

6.3.2. Deep learning results

Different classification results were obtained by different pre-

processing methods in the experiments as shown in Fig. 6 below.

• We study the influence of different N-Gram methods when us-

ing the redundant subsequence removal method described in

preprocessing 3.2.1. It can be seen from the Fig. 6 a that the best

result is led by the 3-Gram, and AUC is 0.9539. Some previous

research also proved that the classification result with 3-Gram

is the best [37] .

• The selecting subsequences of different numbers by the infor-

mation gain introduced in preprocessing 3.2.1 are studied. As

shown in Fig. 6 b, it can be seen that the subsequence number

10 0 0 under the information gain selection is the best result,

and AUC is 0.964. The smaller the subsequence number indi-

cates that the larger the useful information about sample classi-

fication contained in these subsequences. This can increase the

classifying accuracy with the length of the API call sequences

inputted into the model being fixed.

• The impact of the API call sequences’ length of a sample is

studied. The sample API call sequence is preprocessed by fixing

the N-gram method to 3-Gram and the subsequence number in

the information gain selection being 10 0 0. Then, the processed

sample time sequences’ length is cut to 20 0, 50 0 and 10 0 0 re-

spectively to get Fig. 6 c series of curves. The effect of the se-

quence length on measuring time is shown in Table 7 . It can

be seen from the experiment that the length of the sequence

is longer, the result turns out to be better as well as the longer

measuring time. Taking time consumption and accuracy into ac-

count, we choose 10 0 0 as the length of sequences.

• We compare the effect of different preprocessing methods.

Fig. 6 d shows the results of different preprocessing methods.

Fig. 6 d compares the results by the redundant subsequence re-

moval method and the method of using the same mode call

eliminated method. It can be seen from the experiment that the

result of using two kinds of preprocessing methods are better

than not using the preprocessing method. The best result is the

preprocessing method with redundant subsequence removal.

Table 8 below is the AUC obtained by the different prepro-

cessing methods, and with the redundant subsequence removal

method, AUC is 98.3%. The AUC and ROC together show that us-

ing redundant subsequence removal method can reduce the length

of the call sequence. These methods make the training faster and

more accurate in classification.

6.4. Machine learning experiment

6.4.1. Experimental parameters

The random forest algorithm function in Scikit–Learn [38] is

used to classifying malicious samples. 5-Fold cross validation is

sed to test the performance of the classifier. The sample set is

782 and was randomly divided into 5 equal parts: one part as

est set, and the remaining 4 parts as 5-Fold cross validation.

The experiment mainly studies the influence of the number

f decision trees in random forest, the decision split evaluation

ethod, the Bootstrap sampling method and the feature quantity

n the classification.

The number of decision trees in the random forest indicates

ow many decisions trees are produced in the random forest. The

ecision tree best split methods are used in the decision tree struc-

ure processing to form a standard to measure the relative merits

f a decision. There are mean Squared Error and Mean Absolute

rror methods. Bootstrap sampling parameters indicate whether to

se Bootstrap sampling during decision tree construction, and the

eature quantity parameter specifies how many dimension features

eed to be considered while searching for optimal segmentation in

ecision tree construction.

.4.2. Machine learning results

By random forest algorithm, there are multiple parameters in-

uencing the result of the experiment. A series of comparative ex-

eriments are conducted to verify the influence of different param-

ters on the AUC value. The consumed time of the training model

as been considered as a performance index.

We study the effect of the number of decision trees on the

verall performance of the random forest. In this experiment, we

ound that the AUC value increased as the number of decision

rees increased. As shown in Fig. 7 a, it also led to increasing of the

onsumption time in training process at the same time. The AUC

pgrade will gradually reduce with the increase in the number

f decision trees. The parameter can be selected according to the

xperimental condition and hardware scale.

The best split methods on AUC are studied by the mean square

rror and the average absolute error method. As shown in Fig. 7 b,

he Mean Square Error (MSE) is better than the Mean Absolute Er-

or (MAE) in the experiment. The formula for the two methods is

s follows: f i is the predicted result of segmentation, y i is the ac-

ual result, both f i and y i are less than 1. The reason is that the

ower operation of MSE reduces the influence of the smaller error

f a feature on the result.

SE =

1

n

n ∑

i

( f i − y i ) 2

(12)

AE =

1

n

n ∑

i

| f i − y i | (13)

Fig. 7 c shows the effect of whether Bootstrap sampling is used.

o matter in which cases, Bootstrap sampling has an impact on

UC. After using the Bootstrap method, the classifier is more effi-

ient. The sampling is carried out during structuring the decision

ree, and it can better estimate the overall hypothesis distribution

nd get a better classifier.

The feature quantity of the best split is studied, too. In order

o find it, we have conducted experiments to compare the effects

f different numbers of features on the experimental results. The

xperimental results are shown in Fig. 7 d. As the total number of

he features is 22,043, according to the experimental results, the

raining consumption time is proportion to the number of features.

Finally, we study whether the parameter optimization can im-

rove the AUC and reduce the training time. The experimental re-

ults are shown in Table 9 . The parameter optimization can not

nly increase the overall AUC, but also shorten the training time of

0% to 84%.

Page 11: ASSCA: API sequence and statistics features combined ...static.tongtianta.site/paper_pdf/47fe05c4-c968-11e... · Introduction With the rapid development of the Internet, malwares

L. Xiaofeng, J. Fangshuo and Z. Xiao et al. / Computer Networks 157 (2019) 99–111 109

Fig. 7. Random forest experiment.

Table 9

Performance in parameters optimization.

item Optimization method Before optimization After optimization

Number of decision tree Properly reduce the number of decision tree AUC:0.9852Time consuming: 3431 s AUC:0.9840Time consuming: 719 s

Feature number Properly reduce the number of feature AUC:0.9834Time consuming: 716 s AUC:0.9871Time consuming: 115 s

6

A

A

m

m

S

s

m

f

m

CC

P

b

oFig. 8. ROC curve of combined model.

.5. Random forest & bi-residual LSTM combined model

After choosing the best parameter for each model, we achieved

UC score 0.987 and accuracy score 0.935 on random forest model,

UC score 0.993 and accuracy score 0.957 on Bi-Residual LSTM

odel. Then we begin to proceed with the combination experi-

ent. First, we calculate the probability of malware in each model.

econd, we decide which model has higher confidence in its re-

ult by formula 14. P LSTM

stands for the result probability of LSTM

odel. P RF stands for the result of random forest model. P stands

or the combined model’s result. C RF and C LSTM

stand for each

odel’s confidence.

LST M

= min ( abs ( 1 − P LST M

) , P LST M

) RF = min ( abs ( 1 − P RF ) , P RF )

=

{P LST M

( i f C LST M

≤ C RF ) P RF (i f C LST M

> C RF )

(14)

The combined model’s ROC curve is shown in Fig. 8 . The com-

ined model’s AUC is 0.996, which is better than the above results

f single model’s experiment.

Page 12: ASSCA: API sequence and statistics features combined ...static.tongtianta.site/paper_pdf/47fe05c4-c968-11e... · Introduction With the rapid development of the Internet, malwares

110 L. Xiaofeng, J. Fangshuo and Z. Xiao et al. / Computer Networks 157 (2019) 99–111

Table 10

Accuricies on 5 test data set splits.

Test Data Spilt Split 0 Split 1 Split 2 Split 3 Split 4

Accuracy 96.7% 96.7% 98.5% 96.7% 95.1%

Table 11

The accuracy of different algorithms.

Method Year of the method Platform Accuracy

Confidence algorithm [30] 2012 Windows 0.900

OPCODE 2013 Windows 0.929

KNN [39] 2016 Windows 0.90

Neural Networks [24] 2016 Windows 0.856

CNN + LSTM [28] 2016 Windows 0.894

MAMADROID [27] 2016 Android 0.87

Random Forest [29] 2017 Android 0.953

MADAM [20] 2018 Android 0.96

RanDroid [21] 2018 Android 0.977

Combined model in this paper 2018 Windows 0.967

A

F

o

S

r

n

R

[

[

[

[

We split the test data set into 5 splits. We ran our algorithm

on these splits several times and got the accuracy of each split.

The results are shown in the following Table 10 . It shows that the

results of our model are statistically stable. The average accuracy

of the combined model of random forest and Bi-Residual LSTM is

96.7%.

The following Table 11 shows the accuracy of different algo-

rithms (ACC). [28] preprocesses with the same API function calls

merge method and the convolution neural network CNN + cyclic

neural network LSTM model. The ACC of [28] is 0.894. We com-

pare our model with several malware detection models and

Table 11 shows the comparison result. The ACC of the methods to

detect the malwares at Andriod platform are high mostly, because

many features besides API can be utilized by their detection

models, such as the permission of APK, users behavior, software

packages, and native code. The experiment shows that the ac-

curacy of the method in this paper is better than most of other

methods.

7. Conclusion

Based on the dynamic analysis of the malwares, this paper pro-

poses a combined learning framework based on machine learning

and deep learning for malware detection. In this paper, a fast anal-

ysis algorithm based on parameter hash is introduced. The redun-

dant data processing and the deep learning model based on recur-

rent neural network are studied in the aspect of malware detection

with deep learning.

This is the first study on combining a machine learning model

with a deep learning model to overcome the shortcoming of deep

learning model, so the combined model is more robust. Experi-

ments show that the AUC of the combined learning model is 0.996

and the ACC of the model is 0.967 on the test set, and the method

is able to detect unknown malicious samples.

In our study, the experimental data set is not large, and it still

needs to be increased. In the future, we are going to test the model

on larger malware sample data set. About the model parameters,

there are still some adjustable aspects, which also need more ex-

periments to verify. And we want to study different combination

methods, and study the advantages and disadvantages of different

combinations methods.

At present, there are few researches on utilizing deep learning

model with machine learning model in the field of malware detec-

tion, and there is still much research space. This paper wishes to

attract more scholars’ attention and participate in this field.

cknowledgements

This research work was supported by National Natural Science

oundation of China (Grant No. 61472046), the Opening Project

f Key Lab of Information Network Security of Ministry of Public

ecurity (The Third Research Institute of Ministry of Public Secu-

ity)and Seed Funds of Beijing Association for Science and Tech-

ology, NSFOCUS Kunpeng Fund.

eferences

[1] H. Sun , X. Wang , R. Buyya , et al. , CloudEyes: cloud-based malware detectionwith reversible sketch for resource-constrained internet of things (IoT) devices,

Software Pract. Exp. 47 (3) (2017) . [2] McAfee, McAfee Labs labs 2017 Threats threats Predictionspredictions,

https://www.mcafee.com/enterprise/zh- cn/security- awareness/threats-

predictions-2017.html . [3] P. Faruki , A. Bharmal , V. Laxmi , et al. , Android security: a survey of issues,

malware penetration, and defenses, IEEE Commun. Surv. Tut. 17 (2) (2017)998–1022 .

[4] Z. Chen , C. Ji , Efficient string matching: an aid to bibliographic search, Com-mun. ACM 18 (1975) 333–340 .

[5] S. Wu , U. Manber , A Fast Algorithm for Multi-Pattern Searching., University of

Arizona, 1994 Technical Report TR-94-17 . [6] AV-Comparative. On-Demand Detection of Malicious Software. Technical Re-

port AV-comparative AV-comparatives Innsbruck, Austria 2010 Available fromwww.av-comparatives.org .

[7] W. Yan , N. Ansari , Why anti-virus products slow down your machine? in: Pro-ceedings of IEEE International Conference on Computer Communications and

Networks, San Francisco, CA, USA, 2009, pp. 1–6 . [8] Saxe J., Berlin K. Deep neural network based malware detection using two di-

mensional binary program features[J]. 2015:11–20.

[9] I. Santos , F. Brezo , X. Ugarte-Pedrero , et al. , Opcode sequences as represen-tation of executables for data-mining-based unknown malware detection, Inf.

Sci. 231 (9) (2013) 64–82 . [10] N. Kuzurin , A. Shokurov , N. Varnovsky , V. Zakharov , On the concept of software

obfuscation in computer security, Lect. Notes Comput. Sci. 4779 (2007) 281 . [11] Karim M., Walenstein A., Lakhotia A., Parida L., Malware phylogeny genera-

tion using permutations of code, Journal in. Computer Comput. Virology Virol.

1(2005) 13–23. [12] M. Christodorescu , S. Jha , in: Testing Malware Detectors, 29, ACM SIGSOFT

Software Engineering Notes, 2004, pp. 34–44 . [13] X. Wang , J. Liu , X. Chen , Say No to Overfitting, University of Pittsburgh, 2015 .

[14] Lipton Z C., Berkowitz J., Elkan C.. A critical review of recurrent neural net-works for sequence learning[J]. arXiv preprint , 2015.

[15] Athalye A., Carlini N., Wagner D. . Obfuscated gradients give a false sense of

security: ccircumventing defenses to adversarial examples[J]. 2018. [16] Yuille A.L., Liu C. . Deep nets: wwhat have they ever done for vision?[J]. 2018.

[17] I. Santos , F. Brezo , X. Ugarte-Pedrero , et al. , Opcode sequences as represen-tation of executables for data-mining-based unknown malware detection, Inf.

Sci. 231 (2013) 64–82 . [18] Y.A.N.G. LIU , Employing the Algorithms of Random Forest and Neural Networks

for the Detection and Analysis of Malicious Code of Android Applications, Bei-

jing Jiaotong University, 2015 . [19] Y.A.N.G. Hong-yu , X.U. jin , Android malware detection based on improved ran-

dom forest, J. Commun. (04) (2017) 8–16 . [20] A. Saracino , D. Sgandurra , G. Dini , et al. , MADAM: effective and efficient be-

havior-based android malware detection and prevention, IEEE Trans. Depend.Secure Comput. (99) (2018) 1-1 .

[21] J.D. Koli , RanDroid:android malware detection using random machine learn-

ing classifiers, International Conference on Technologies for Smart-City EnergySecurity and Power(ICSESP), 2018, IEEE, 2018 .

22] B. Alsulami , A. Srinivasan , H. Dong , et al. , Lightweight behavioral malware de-tection for windows platforms, in: International Conference on Malicious and

Unwanted Software, IEEE, 2018, pp. 75–81 . 23] J. Saxe , K. Berlin , Deep neural network based malware detection using two

dimensional binary program features, in: Malicious and Unwanted Software

(MALWARE), 2015 10th International Conference on, IEEE, 2015, pp. 11–20 . [24] B. Kolosnjaji , A. Zarras , G. Webster , et al. , Deep learning for classification of

malware system call sequences, in: Australasian Joint Conference on ArtificialIntelligence, Springer International Publishing, 2016, pp. 137–149 .

25] S. Tobiyama , Y. Yamaguchi , H. Shimada , et al. , Malware detection with deepneural network using process behavior, in: Computer Software and Ap-

plications Conference (COMPSAC), 2016 IEEE 40th Annual, 2, IEEE, 2016,pp. 577–582 .

26] R. Pascanu , W. Stokes J , H. Sanossian , et al. , Malware classification with recur-

rent networks, in: Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEEInternational Conference on, IEEE, 2015, pp. 1916–1920 .

[27] Onwuzurike L., Mariconti E., Andriotis P., et al. MaMaDroid: dDetecting an-droid malware by building markov chains of behavioral models (Extended Ver-

sion)[J]. 2016.

Page 13: ASSCA: API sequence and statistics features combined ...static.tongtianta.site/paper_pdf/47fe05c4-c968-11e... · Introduction With the rapid development of the Internet, malwares

L. Xiaofeng, J. Fangshuo and Z. Xiao et al. / Computer Networks 157 (2019) 99–111 111

[

[

[

[

[[

[

[

[[

28] P. Mishra , S. Pilli E , V. Varadharajan , et al. , VAED: vMI-assisted evasion detec-tion approach for infrastructure as a service cloud, Concurrency Comput Pract

Exp 29 (1) (2017) 1–30 . 29] Z. Jiawang , L. Yanwei , Malware detection system implementation of Android

application based on machine learning, Appl. Res. Comput. (06) (2017) 1–6 . 30] C. Ravi , R. Manoharan , Malware detection using windows API sequence and

machine learning, Int. J. Comput. Appl. 43 (17) (2012) 12–16 . [31] G. Canfora , E. Medvet , F. Mercaldo , et al. , Detecting android malware using se-

quences of system calls, in: International Workshop on Software Development

Lifecycle for Mobile, ACM, 2015, pp. 13–20 . 32] K.T. Guen , K.B. Joong , R. Mina , et al. , A multimodal deep learning method for

android malware detection using various features, IEEE Trans. Inf. Forens. Se-cur. (2018) 1-1 .

33] Tensorflow. https://www.tensorflow.org/ , 2018. 34] J. Houvardas , E. Stamatatos , N-gram feature selection for authorship identifica-

tion, in: International Conference on Artificial Intelligence: Methodology, Sys-

tems, and Applications, Berlin, Heidelberg, Springer, 2006, pp. 77–86 . 35] Virusshare https://virusshare.com , 2018.

36] VirusTotal. http://www.virustotal.com , 2018.

[37] Q. Huang , Malicious Executables Detection Based On N-Gram System Call Se-quences, Harbin Institute of Technology, 2009 .

38] Scikit-Learn. http://scikit-learn.org/ 39] L.I.A.O. Guohui , L.I.U. Jiayong , A malicious code detection method based on

data mining and machine learning, J. Inf. Secur. Res. (01) (2016) 74–79 .

Xiaofeng Lu is an Associate professor of School of Cy-

berspace Security, Beijing University of Post and Telecom-munications. He received his PhD degree in Beijing Uni-

versity of Aeronautics & Astronautics, Beijing, China, in2010. During his PhD, he held visiting scholar positions

at the Computer Laboratory, University of Cambridge, UK.His main research interests include cyberspace security,

information security and cryptography.