Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Computer Networks 157 (2019) 99–111
Contents lists available at ScienceDirect
Computer Networks
journal homepage: www.elsevier.com/locate/comnet
ASSCA: API sequence and statistics features combined architecture for
malware detection
�
Lu Xiaofeng
a , ∗, Jiang Fangshuo
a , Zhou Xiao
a , Yi Shengwei b , Sha Jing
c , Pietro Lio
d
a School of Cyberspace Security, Beijing University of Post and Telecommunications, China b China Information Technology Security Evaluation Center, China c The Third Research Institute of Ministry of Public Security, China d Computer Laboratory, University of Cambridge, United Kingdom
a r t i c l e i n f o
Article history:
Received 13 August 2018
Revised 29 March 2019
Accepted 15 April 2019
Available online 18 April 2019
Keywords:
Computer virus and prevention
Malware classification
Machine learning
Deep learning
Call sequence
a b s t r a c t
In this paper, a new deep learning and machine learning combined model is proposed for malware be-
havior analysis. One part of it analyzes the dependency relation in API (Application Programming Inter-
face) call sequence at the functional level, and extracts features for random forest to learn and classify.
The other part employs a bidirectional residual neural network to study the API sequence and discover
malware with redundant information preprocessing. In the API call sequence, future information is much
more important for conjecturing the semantic of the current API call. We conducted experiments on a
malware dataset. The experiment results show that both methods can effectively detect malwares. How-
ever, the combined framework has better classification performance. The classification accuracy of the
combined malware detection architecture is 0.967.
© 2019 Published by Elsevier B.V.
1
g
g
t
I
o
t
1
n
a
m
l
v
c
v
l
C
S
P
N
f
t
m
I
h
w
t
m
s
t
s
t
t
W
h
o
h
1
. Introduction
With the rapid development of the Internet, malwares have
rown greatly in both categories and quantities, and the propa-
ation mode has been constantly updated. Now, more and more
argeted attacks interest in Internet of Things (IoT) devices. The
nternet of Things encompasses hundreds or thousands of types
f devices in every industry. According to [1] , there will be more
han 7 trillion IoT devices by 2025, with an estimate of about
0 0 0 devices per person. Malwares, like Spam, Privacy leak, Bot-
et, Distributed denial-of-service and Advanced persistent threat,
re still rampant in the IoT paradigm. Weak security and rookie
istakes by IoT device manufacturers will compound that prob-
em. The McAfee Labs indicate 80% of IoT apps aren’t tested for
ulnerabilities. Attacks on the IoT devices will increase rapidly be-
ause of hyper growth of IoT devices, weak security and the high
alue of data on these devices [2] . Ransomware will be the most
ikely near-term threat, as it has proven to be a relatively easy way
� This research work was supported by National Natural Science Foundation of
hina (Grant No. 61472046 ), the Opening Project of Key Lab of Information Network
ecurity of Ministry of Public Security (The Third Research Institute of Ministry of
ublic Securit) and Seed Funds of Beijing Association for Science and Technology,
SFOCUS Kunpeng Finds. ∗ Corresponding author.
E-mail address: [email protected] (L. Xiaofeng).
p
s
h
a
C
n
t
[
ttps://doi.org/10.1016/j.comnet.2019.04.007
389-1286/© 2019 Published by Elsevier B.V.
or criminals to make money. For example, ransomware on a smart
elevision can encrypt the system files in the television and ransom
oney for the key to decode the encrypted files. We already see
oT devices being held for ransom in the power distribution and
ealth care verticals.
To defend against various malware, antivirus programs or fire-
alls still plays an important role and takes up a large propor-
ion. Such real-time defense approach detects different types of
alwares ranging from virus to worm. Commercial anti-malware
olutions rely on a signature database to detect malware due to
he efficiency and simplicity [3] . This approach is implemented by
canning and checking if a file contains the contents which match
he known signatures. There are several commonly used and effec-
ive signature matching algorithms, such as Aho-Corasick [4] and
u-Manber [5] .
However, the malware manufacturers are aware of the rules be-
ind an antivirus program, so they can change the digital signature
f a malware to make it look harmless. Therefore, the antivirus
rograms require to be regularly updated by patching leaks as-
ociated with latest malware types. Then, these antivirus systems
ave become increasingly puffy with the development of malware
ttacks [6,7] . The heavy resources consumption, such as memory,
PU and database updating, caused by the growing number of sig-
atures leads to low detection performance. So, the host-based an-
ivirus programs do not suit for resource-constrained IoT devices
1] .
100 L. Xiaofeng, J. Fangshuo and Z. Xiao et al. / Computer Networks 157 (2019) 99–111
T
l
a
[
m
t
r
p
a
p
f
A
s
[
t
c
t
s
b
s
a
d
f
m
[
t
w
f
v
w
m
f
p
t
t
t
q
p
u
s
A
m
A
h
g
h
c
m
c
p
m
a
u
d
O
i
m
t
c
Traditionally, there are two major approaches for the malware
detection based on the approach that is used to analyze malwares,
static analysis and dynamic analysis [8] . In static analysis, mal-
ware files are analyzed in binary form, or otherwise decompressed
and/or decompiled into assembly representations. In dynamic anal-
ysis, binary files are executed and the behavior of the program,
such as system calls, is recorded by hooking or making some in-
ternal access to the virtualized environment. In principle, dynamic
detection can achieve the faithful operation of malicious software
and is not easily confused [9] . Static analysis, on the other hand,
is vulnerable to obfuscation. Malware developers can increase the
difficulty of static analysis by code obfuscation techniques [10,11] .
The obfuscation algorithms include garbage insertion, which con-
sists on adding sequences that do not modify the behavior of the
program (e.g., nop instructions); code reordering, which changes
the order of program instructions and variable renaming; which
replaces a variable identifier with another one [12] .
In recent years, there have been more and more APT (Advanced
Persistent Threat) attacks and senior virus for industrial control
systems and IoT. The great number of variants of these viruses have
made it more difficult to detect based on fixed signatures, there-
fore the identification of unknown malware has become a new
challenge. Besides the signature-based antivirus approaches, using
the machine learning techniques to categorize the malwares can
be considered as another alternative solution.
As the development of artificial intelligence technology, the
researchers proposed to use machine learning technology to detect
and classify malicious samples, and further achieve a certain effect
[13,14] . Deep learning is oversensitive to the data which is never
seen. Recent work gives many examples of the oversensitivity
of deep neural network to context [15] . Deep learning model
needs to be combined with traditional machine learning model
to achieve more robust and human level intelligence [16] . In this
paper, we propose a dynamic analysis based malware detection
architecture, API Sequence and Statistics features Combined Archi-
tecture (ASSCA), which combines the machine learning technology
and deep learning technology. We study the API call sequence
feature by deep learning technology and study the API statistical
features by machine learning technology. The final classification
result of a suspicious software is determined according to the two
classifiers.
The contributions of this paper are as follows:
This paper proposes a system architecture that combined the
deep learning model based on sequence data and machine learn-
ing model based on API statistical features. The traditional machine
learning model and the recurrent neural network model have been
integrated to obtain a better classifier.
This paper proposes a bidirectional residual LSTM (Long Short-
erm Memory) model to study the API call sequence. Bidirectional
LSTM can use the previous context and the later context. We add
the residual connection into the bidirectional LSTM model to make
a great improvement in training deeper neural network. To the
best of our knowledge, this is the first paper to propose using bidi-
rectional residual neural network to address the malware problem.
This paper proposes a new API association analysis algorithm
based on argument hash, AMHARA algorithm. Then this paper
studies on optimizing this approach.
This paper studies the problem of the long sequence in recur-
rent neural network, proposing a new method of removing redun-
dant API call mode in the API call sequence.
2. Related work
Traditionally, there are the two major approaches for the mal-
ware detection, static analysis and dynamic analysis.
Static features are the main study objectives by the machine
earning method. The researcher decompiles the target software
nd then obtains the static features of the software. Santos I et al.
17] mainly used the static feature of PE files in Windows environ-
ent to analyze malwares. Liu et al. [18] decompiled an APK to get
he sensitive API, string, application permissions and certificates
equested by the APK to detect malwares. Yang et al. [19] utilized
ermission of APK, application action and application classification
s input features, proposed an improved random forest algorithm.
Some researchers extract features from different levels or as-
ects. It is noted that the classification method based on combined
eatures improves the recognition accuracy. MADAM [20] monitors
ndroid systems at four levels (kernels, applications, users, and
oftware packages) and retrieves five sets of functions. RanDroid
21] extracts requested permissions, vulnerable API calls along with
he existence of key information such as dynamic code, reflection
ode, native code, cryptographic code and database from applica-
ions and uses them as features in various machine learning clas-
ifiers.
These static features based approaches have good performance,
ut these approaches are hard to capture dynamic behavior, and
ome malwares just do malicious behavior during running. Static
nalysis, on the other hand, is susceptible to obfuscation. Malware
evelopers can increase the difficulty of static analysis by code ob-
uscation techniques [10,11] .
In recent years, many researchers utilized the neural network
ethod to classify malicious samples under large-scale data sets
22,23] . Alsulami et al. [22] use Microsoft Windows prefetch files
o detect malware. Saxe et al. [23] used the forward neural net-
ork to classify the static analysis results. After the population of
orward neural network, the deep learning model of CNN (Con-
olutional Neural Network) and RNN (Recurrent Neural Network)
ith their improved versions has become the major concern of
alicious sample detection.
Some researchers have improved the detection by adding other
eatures to the system call sequence [24–26] . Bojan et al. [24] pro-
osed using N-gram to extract dynamic system call sequence fea-
ures, after that the extracted time series feature was inputted into
he RNN model for classification. Shun et al. [25] proposed to ex-
ract the API sequence and API returned value as the feature se-
uence inputted into the RNN model. The feature sequence is in-
utted into RNN model to extract the high-level feature, then the
pper-level feature vector is inputted into the CNN model for clas-
ification. Microsoft researcher Jack Stoke et al. [26] inputted the
PI call sequences into the RNN and ESN (Echo State Networks)
odels. They used a recurrent model trained to predict the next
PI call, and used the hidden state of the model (that encodes the
istory of past events) as the fixed-length feature vector that is
iven to a separate classifier. Their raw data stream consists of 114
igh-level events generated by the anti-malware engine which en-
odes all of the low-level API.
The above-mentioned deep learning method has effect on the
alicious sample detection. Utilizing deep learning methods to
lassify the malwares don’t require much prior knowledge and its
rocedure is simple, but the extracted features don’t have a clear
eaning. And the network architecture of the above papers is usu-
lly simple, not large enough to have capability to detect most
nknown malwares. The deep neural network of our study is bi-
irectional LSTM that has two flows, different from normal LSTM.
ne way is from the beginning of input to the end. Another way
s from the end of input to the beginning. Both ways have their
emory and can improve the detection performance. The struc-
ure of our deep neural networks is different from theirs.
Some researchers have discovered the value of system
alls pairs and used system call pairs for malware detection.
L. Xiaofeng, J. Fangshuo and Z. Xiao et al. / Computer Networks 157 (2019) 99–111 101
M
p
t
t
fi
t
e
h
t
i
e
f
f
s
N
t
t
N
s
q
m
t
s
t
t
g
t
l
i
[
m
i
m
b
c
m
c
3
3
c
g
u
p
[
s
a
T
c
d
w
c
v
s
t
m
s
t
Fig. 1. Malware classification system.
3
F
T
C
T
A
s
m
l
a
S
s
l
a
p
f
w
4
w
4
f
d
t
4
(
g
c
w
i
s
d
G
o
s
t
l
p
AMADROID [27] relies on the sequence of abstracted API calls
erformed by an app. It builds a statistical model to represent the
ransitions between the API calls performed by an app, specifically,
hey model these transitions as Markov chains to perform classi-
cation. VAED [28] keeps track of the system call to calculate the
ransition probability distribution of each pair system calls. How-
ver, as the total samples in their study are less 480, their method
as to be validated on a bigger dataset.
In our study, we believe that the calling relationship be-
ween system calls can reflect the true behavior of a function
n program execution. However, we propose another method to
xtract the system calls pair and we use the system call pair
requency instead of the system call transition probability as the
eatures.
Numerous researchers utilized N-gram to process the opcode
equences or system calls [29–31] . Zhang et al. [29] applied the
-gram model to process opcodes and classified malware by fea-
ures such as permission. Ravi et al. [18] utilized N-gram to process
he opcode sequences, and used the decision tree, KNN (K-Nearest
eighbor), Naive Bayes and SVM (Support Vector Machine) to clas-
ify malwares. Canfora et al. [31] used 3-gram to process the se-
uences of system calls to detect malwares.
In the literatures containing dynamic analysis, the N-gram
ethod is used to process the long sequence and obtain the fea-
ure vector. The N-gram method has certain limitations. Inserting
ome unrelated APIs into malicious code can reduce the effect of
he N-gram method. In our study, we utilize a variety of methods
o shorten the length of the system call sequence. We use the N-
ram method to remove API subsequences that have less effect on
he classification in the system call sequence.
Beginning in 2018, researchers attempted to combine feature
earning with deep learning. Alan Yuille pointed out that the train-
ng combination model is the key to cracking deep learning defects
16] . Guen [32] studied to combine multiple models for android
alware detection. Guen builded several separated deep learn-
ng models for features, and then combine multiple deep learning
odels. In contrast, we build machine learning models for features,
uild deep learning models for sequences, and then combine ma-
hine learning models with deep learning models. This combined
odel is more robust because of the structural knowledge in ma-
hine learning model [16] .
. System architecture
.1. System framework
There are two kinds of signature-based antivirus approaches ac-
ording to their infrastructures. One is host-based antivirus pro-
rams which install detection agent in the users’ devices and
pdates the signature database to ensure timely and com-
lete security protection. Another is cloud-based antivirus system
15] which places different types of detection agents over the cloud
ervers. A user can upload any type of file and receive a report
bout the malwares that might be contained in the file [16–18] .
his newly developed framework is cost-saving for resource-
onstrained devices.
The high volume and sometimes resource-constrained of IoT
evices make it impossible to manage and secure them by the way
e secure the traditional IT systems. For this reason, centralized
ontrol systems will be developed to manage and secure IoT de-
ices automatically and in aggregate [2] . In our study, we build a
erver-based centralized antivirus system. The antivirus server cap-
ures files in IoT networks and detects whether the files contain
alwares. A user can also upload any type of files to the antivirus
erver and receive a report about the malware that might be con-
ained in the file.
.2. Malware detection based on combined learning model
The structure of combined learning model ASSCA is shown in
ig. 1 . It also shows the malicious sample classification process.
he first step is to get samples, then samples are executed in
uckoo sandbox. The sandbox can extract all the system calls.
he antivirus server combines the deep learning model based on
PI sequence data and the machine learning model based on API
tatistical features to categorize the malwares. The deep learning
odel is the bidirectional residual LSTM model, and the machine
earning model is random forest model.
The lengths of the API call sequences extracted by the sandbox
re quite different, and most of the API sequences are very long.
o long sequence cannot be inputted into the deep learning model,
o in this paper, several algorithms of reducing API call sequence
ength are studied as well. Then, the preprocessed API sequences
re inputted into the deep learning model. At the same time, we
ropose a new API Call association analysis method to extract the
eatures from the API call sequences. The Tensorflow [33] frame-
ork is used to measure and test the LSTM model.
. Deep learning method
We use the Bidirectional Residual LSTM model to classify mal-
are.
.1. Sequence data preprocessing
The data extracted from sandbox is API call sequence. Each API
unction is represented by a specific integer for the subsequent
ata processing and model classification, and different API func-
ions match different integers.
.1.1. Redundant subsequence removal method
1) N -Gram sample subsequence extraction
The length of the API call sequence in each sample varies
reatly, ranging from 1–1,0 0 0,0 0 0. There are a large number of
ommon API subsequences in the API sequences of both plain soft-
are and malware [17,18] . These subsequences have little effect on
dentifying malware. The removal of those subsequences with a
mall effect on the classification can not only accelerate the proce-
ure but also increase the accuracy. In this paper, the text-based N-
ram method is used to extract the subsequences. The basic idea
f the N-Gram algorithm in malware detection is to carry out the
liding window operation with the size of N in the API stream. Af-
er the N-gram process, a sample forms a list of subsequences with
ength N, which are the vector space of the text content. For exam-
le, an API sequence is {24, 26, 18, 18, 6, 13}. With N being 4 and
102 L. Xiaofeng, J. Fangshuo and Z. Xiao et al. / Computer Networks 157 (2019) 99–111
Table 1
4-Gram subsequence connection.
4-Gram subsequence 1 4-Gram subsequence 2 Connection
S 1 S 2 S 3 S 4 S 2 S 3 S 4 S 5 S 1 S 2 S 3 S 4 S 5 S 1 S 2 S 2 S 2 S 2 S 2 S 2 S 2 S 1 S 2 S 2 S 2 S 2
(
(
Table 2
Continuous same API mode removal.
Continuous same mode API Remove continuous same mode API
S 1 S 1 S 1 S 1 S 1 S 1 S 1 S 1 S 1 S 2 S 4 S 3 S 1 S 2 S 4 S 3 S 1 S 2 S 4 S 3
s
o
F
o
r
4
4
m
s
c
s
m
o
p
d
i
m
m
f
l
W
a
f
o
h
h
w
O
o
B
i
s
h
b
v{
o{
c
s
s
the sliding window moves with step 1, the 4-gram subsequences
list is {(24, 26, 18, 18), (26,18,18,6), (18,18,6,13)}.
2) Removes redundant subsequences by information gain
The dimension of subsequences lists extracted by N-Gram
method is very large, so it requires reduction. Only those subse-
quences that have a large effect on the classification of the mal-
wares should be reserved. In feature engineering, information gain
is a simple and efficient method. And it is proved to be very ef-
fective in handling N -Gram [34] . In this paper, information gain is
used to select the key subsequences.
For a single feature, the information entropy of the system will
change when the system has it or abandons it, and the difference
between two information entropies is the information gain brought
by this feature to the system. The amount of information is en-
tropy. Let C be a category. H ( C ) is the information entropy of cate-
gory C and can be calculated by Eq. (1) .
H ( C ) = −∑ n
i =1 p ( c i ) log p ( c i ) (1)
where p ( c i )is the proportion of samples of category c i .
The information gained by the subsequence T to class C is:
IG ( T ) = H ( C ) − H ( C| T ) =
∑
V T ∈ { 0 , 1 } ∑
C ∈ { C i } P ( V T , C ) log
(P ( V T , C )
P ( V T ) P ( C )
)(2)
In Eq. (2) , V T is the value of subsequence T . V T = 1 , when T
shows in a sample, otherwise V T = 0 . P ( V T , C ) is the ratio of sub-
sequence T in category C. P ( V T ) is the ratio of subsequence T in all
the samples. P ( C ) is the ratio of category C in all the samples.
The subsequence that the information gain higher indicates that
the frequencies of subsequences in the two sample sets are quite
different, hence it can reflect the characteristics between normal
samples and malicious samples, providing a better discrimination.
3) Sample valid subsequences connection
The list of subsequences for each sample obtained by N-Gram
is taken as a data set. The feature subsequences selected by the
information gain method are called valid subsequences. The set
of valid subsequences is called a dictionary. The 4-gram subse-
quences included in the dictionary are extracted from all 4-Gram
subsequences and are arranged in the chronological order. Then,
these subsequences are connected to get the sample’s valid call se-
quence. The method of connecting the subsequences is as shown
in Table 1 . The connected call sequence removes the subsequences
with little effect on the classification, greatly reducing the length
of the sample API call sequence. Because of this, the model training
and detection are faster and more accurate than before.
4.1.2. Continuous same API mode removal
During the analysis API call sequence, we found that the same
API is often called multiple times, which indicates information re-
dundancy. In order to reduce the length of the numerical sequence
entered into the model and simultaneously keep the information
entropy of the system call, Tobiyama [25] proposed to remove the
continuous same API functions. This paper improves this approach.
We propose to remove the continuous same API modes in the pre-
processing as shown in Table 2 below. An API mode is an API call
ubsequence. For the single API appears more than twice, we keep
nly two of them to indicate that this call appears multiple times.
or the API mode that appears more than once, we keep only one
f them, because one pattern is enough to represent that the cor-
esponding behavior exists.
.2. Deep learning algorithm model
.2.1. Bidirectional LSTM model
Each hidden layer output of the recurrent neural network RNN
odel is retained in the network, together with the input of the
ystem to determine the output at the next moment. This recy-
ling facilitates the RNN’s great success in processing time series
uch as machine translation. The LSTM unit made a great improve-
ent on long term dependency comparing to basic RNN unit. It
vercomes gradient diffusion or gradient explosion during the back
ropagation. It plays a better role in dealing with long sequence
ata.
The LSTM model is used in the experiment. As the LSTM model
ncreases in the number of layers, the extracted features become
ore abstract, and the ability of expression grows stronger, but
ore time is consumed. This paper takes into account the above
actors in the study of two hidden layer LSTM structure.
The reason why RNN makes a good performance on sequence
earning is that it makes use of previous context in the sequence.
ith bidirectional RNN, we not only can use pervious context, but
lso the future context. Especially in API call sequence, future in-
ormation is much more important for conjecturing the semantic
f current API call. For example, an encryption call is a normal be-
avior, but if it follows an API which deletes a file, then the be-
avior of this program is malicious. With lots of such behavior, we
ill be sure that this program is a ransomware.
Bidirectional LSTM has two flow, different from normal LSTM.
ne way is forward, from the beginning of input to the end. An-
ther way is backward, from the end of input to the beginning.
oth ways have their own memory, as shown in Fig. 2 .
The input into the model is X = { x 0 , x t , . . . x n } time sequences. x t s the input at a moment, which is a system call function of a
ample obtained by preprocessing. It is represented by the one-
ot vector. W i , W f , W c , W o , U i , U f , U c , U o , V o is the weight matrix, b i ,
f , b c , b o is the offset vector.
First, calculate the input gate, memory gate of the middle state
alue, forget gate in the t moment.
i t = σ ( W i x t + U i h t−1 + b i ) C
′ t = tanh ( W c x t + U c h t−1 + b c )
f t = σ(W f x t + U f h t−1 + b f
) (3)
Second, calculate the memory gate status values, as well as the
utput gate and hidden layer output
C t = i t ∗ C ′ t + f t ∗ C t−1
o t = σ ( x t + U o h t−1 + b o ) h t = o t ∗ tanh ( C t )
(4)
The hidden layer h t contains the feature information about the
all sequence, and the feature information is inputted into the clas-
ification model for malicious sample classification detection.
After both directional processes, we calculate the elementwise
um of input of each direction. The following Eq. (5) shows how
L. Xiaofeng, J. Fangshuo and Z. Xiao et al. / Computer Networks 157 (2019) 99–111 103
Fig. 2. Bidirectional Residual LSTM classification model structure.
d
i
−→←y
4
o
b
q
i
oes it work. The arrow upon the symbol indicates the correspond-
ng part is in forward flow or backward flow.
y n =
−−−→
lst m n
(−−→
y n −1
) −
y n =
←−−−lst m n
(←−−y n −1
)−→ ← −
(5)
n = y n + y n
.2.2. Residual connection
Besides bidirectional technique, we use residual connection in
ur network model too. After we improved the basic LSTM to
idirectional LSTM, we need to build more layers for better Se-
uence modeling. Residual connection makes a great improvement
n training deeper neural network. It overcomes the problem that
104 L. Xiaofeng, J. Fangshuo and Z. Xiao et al. / Computer Networks 157 (2019) 99–111
Fig. 3. AMHARA Algorithm sketch.
s
t
c
a
t
r
r
v
s
s
i
[
a
t
t
w
p
T
t
l
p
r
m
f
o
m
g
T
f
a
c
S
c
deeper network has worse performance than simple network, so
we can a build much deeper network to represent a more complex
model.
We build the network as shown in Fig. 2 . The basic structure
we build is called SA(Sequence Analysis) Block. The SA Block con-
tains a Bidirectional LSTM layer, with an extra residual connection
which forwards input to the output of LSTM layer. The bidirectional
LSTM with an extra residual connection is defined as Bi-Residual
LSTM. So, we can sum the forward flow’s output, the backward
flow’s output and the residual connection’s output, as shown in
the following Eq. (6) . The last layer is a tanh activation function
layer, which improves the network’s nonlinearity. With SA Block,
we can build much deeper network without concerning gradient
vanishing or under fitting.
y n = tanh
(−→
y n +
← −y n + y n −1
)(6)
4.2.3. Max-Pooling
It is important to deal with hidden layer feature information
and to obtain abstract features that represent sequence informa-
tion. As a non-linear subsampling layer, Max-Pooling in the convo-
lution neural network model of computer vision field has achieved
good results. The subsampling layer plays a role in preventing over-
fitting and decreasing the dimensions of feature vector in CNN.
Similar to the goal of computer vision, Max-Pooling is used to get
the eigenvectors whose length is fixed in the classifier and to re-
duce the dimension without losing the hidden layer information.
Max-Pooling obtained the eigenvector as follows:
h max ( i ) = max ( h 0 ( i ) , h 1 ( i ) , . . . , h n ( i ) ) (7)
i ∈ ( 0 , 1 , 2 , . . . N − 1 ) , N is the number of neurons in per hidden
layer. Finally, the eigenvector inputted into the classification model
is connected between the output of the Max-Pooling and the last
state vector of the hidden layer [h max ,h n ].
4.2.4. Classification
The feature vector obtained from Max-pooling is finally entered
into the classifier. Because the malware classification is a binary
classification problem, the logistic function is chosen as classifier.
Finally, the classifier includes a fully connected layer and a logistic
layer.
5. Machine learning method
Deep learning method has shown better accuracy in some spe-
cific area these years, but traditional machine learning still is more
explainable and has lower hardware requirements. This paper pro-
poses a method to combine deep learning approach and traditional
machine learning approach to get a better result in chapter 6.5.
5.1. API call association analysis
With discrete API call sequences, the relationship between API
calls is an important potential feature for statistics-based tradi-
tional machine learning. Algorithms such as SVM or Decision Tree
cannot take advantage of call relationship in sequence as feature,
unlike recurrent neural networks. The abstract association graph
which transforms discrete API calls into a DAG (Directed Acyclic
Graph) by call relationship, is a good way to make call relationship
as a feature to put into machine learning algorithm.
However, this method leads to a large number of different as-
sociated graphs because of computer program’s complexity. One
of the solutions is to divide the association graphs into sub-graph
with fewer kinds. But there is often no clear boundary between the
graphs in practice, and the time complexity in searching subgraphs
is high.
So, for the above problems, we found a way to simplify the as-
ociation’s call graph. For a complex call graph, a pair (s, t) is used
o represent two associated calls s and t, which simplifies an asso-
iation graph into a set of pairs. The basic idea is to construct the
ssociated graph and find the associated call pairs then. However,
his method will take a lot time.
This paper proposes a quick API association analysis algo-
ithm based on argument hash (ArguMent-Hashing based API coR-
elation fast Analysis -AMHARA algorithm). The algorithm tra-
erses the whole API call sequence. For the current API call
i , it reversely traverses T API calls. The arguments of s i and
j are compared in the reverse traversal to a call s j (i − T < j <
) as Fig. 3 shows. Both API s i and s j have an argument list
s 0 i , s 1
i , s 2
i , . . . , s m
i ] , [ s 0
j , s 1
j , s 2
j , . . . , s n
j ] . The algorithm enumerates the
rgument in each argument list and compares them one by one. In
he following formula, if the compare function turns out that the
wo input arguments are exactly the same, the function returns 1
ith the same input argument, or return 0 with the different in-
ut arguments. Thus if S i, j > 0, then s i and s j are associated API.
he associated call pairs can be a feature of the sample:
S i, j =
m ∑
k =0
n ∑
l=0
Compare (s k i , s
l j
)(8)
In this algorithm, how to choose T is a hard problem. If T is
oo small, many association information will be lost. If T is too
arge, our algorithm will cost lots of time. We can use the locality
rinciple of program execution to determine T. To make a function,
elated APIs will always appear one by one. Even considering the
alware’s obfuscation technique, the associated APIs will not be
ar apart in the API sequence. We chose 100 as value of T based
n our experiment and hardware.
During the arguments are compared, the length of some argu-
ents are large, such as a buffer. If the average length of the ar-
ument is k, then the time complexity of this algorithm is O( mnk ).
his paper optimizes the argument comparison algorithm by the
ollowing algorithm. Firstly, it hashes the arguments item by item,
nd then it compares the argument hash value directly. The time
omplexity of the upper bound is O( mn ) as the following formula:
i, j =
m ∑
k =0
n ∑
l=0
Compare [hash
(s k i
), hash
(s l j )]
(9)
After the algorithm executes, we get a collection of associated
all pairs that not only records the associated call pairs existed in
L. Xiaofeng, J. Fangshuo and Z. Xiao et al. / Computer Networks 157 (2019) 99–111 105
Fig. 4. Pseudocode of AMHARA Algorithm.
t
p
b
r
F
c
g
m
(
e
g
t
b
s
s
c
5
c
s
c
m
M
e
c
f
t
s
t
w
4
N
f
Fig. 5. “WannaCry” run shot.
Table 3
‘WannaCry’ API calls lists (Top 10).
API species API number
NtWriteFile 19,622
NtClose 16,819
NtCreateFile 10,486
DeleteFile 7226
NtQueryKey 7076
GetSystemMetrics 6861
NtReadFile 6670
GetFileAttributes 6486
MoveFileWithProgress 6394
RegQueryValueExW 6029
…… ……
Total 150,079
he sequence, but also records the number of the associated call
airs.
The algorithm showed in Fig. 4 will effectively reduce the num-
er of features, contracting the feature space to an acceptable
ange, with a big speed improvement than the original method.
or example, for a sequence of 400 species API calls, the number of
all graphs, the feature, exceed one million by using the association
raph method. Simplifying the feature by call pairs, this paper can
ake the number of extracted features reduced to less than 20,0 0 0
40 0 ∗ 40 0). At the same time, this algorithm improves processing
fficiency and avoids the complicated processing of the association
raph.
This algorithm effectively avoids the failure in finding the rela-
ionship between the API calls that are far apart in the sequence
ut are related essentially when the N-gram is used to process the
equence. It makes the extracted feature more responsive to the
ample’s true intent and avoids the presence of cheating in mali-
ious samples.
.2. API call frequency analysis
In the classification process, the statistic data of individual API
alls are added as the additional features, which improved the clas-
ification accuracy with only a small number of extra features. Be-
ause in some malicious samples, the number of single API call can
erely determine whether it is malicious or not. For instance, in
ay 2017, the popular “WannaCry” ransomware shown in Fig. 5
ncrypted files in the disk and deleted the original file, and it
aused an estimated $1 billion in damage costs in just its first
our days around the world. By “NtCreateFile” and “DeleteFile”
wo API calls, we can determine whether there is a similar ran-
omware behavior of the sample. As shown in Table 3 below, lists
he 150,079 API calls during "WannaCry" executing in the sandbox,
here “NtCreateFile” accounts for 7.0% and DeleteFile accounts for
.8%. When a software writes a big file, there would be lots of
tWriteFile system call, so that NtWriteFile cannot be used as a
eature of malware.
106 L. Xiaofeng, J. Fangshuo and Z. Xiao et al. / Computer Networks 157 (2019) 99–111
Table 4
API pair matrix.
Table 5
Confusion matrix.
Actual Label
Positive Negative
Predicted Label Positive’ True Positive (TP) False Positive (FP)
Negative’ False Negative (FN) True Negative (TN)
i
c
b
T
F
T
t
p
o
f
R
6
p
c
V
t
s
p
p
m
o
o
t
m
6
6
i
8
u
S
c
l
e
From another point of view, if we treat API call pairs as a ma-
trix showed in Table 4 ’s left part. In the API calling pair analysis
progress, we filtered out the self-call relationship because a re-
peating API calling should not be counted as a call relationship.
To fix the lack of feature caused by the filtered call pairs, we can
make API pair matrix a full matrix showed in Table 4 ’s right part
by adding the statistics data of individual API calls.
5.3. Random forest classifier
As a kind of statistical classification model, random forest has
good performance in all aspects. The random feature selection en-
sures a smaller generalization error, which is quite advantageous
to the case of small data set. The high dimensional data can be
processed without manual feature selection. It is easy to carry out
large-scale parallel training on a distributed computing platform
too. So, we choose the random forest as the classifier, which per-
forms better in our experiment indeed.
6. Experiments
6.1. Evaluation indicators
This paper uses Google’s Tensorflow [33] framework to imple-
ment the algorithm. In terms of the classification algorithm, the
evaluation indicators have accuracy (ACC), Precision and Recall rate
Curve (PRC), Receiver Operating Characteristic curve (ROC), Area
Under roc Curve (AUC). ROC curve takes a better advantage than
the PRC, in which the curve is basically unchanged when positive
and negative sample distribution changes during the test set. ROC
curve information is large, but it is not clear which classifier has
a better effect. As a numerical representation of the AUC, a higher
AUC value indicates the better effect of the classifier. Therefore, in
the binary-class algorithm, ROC and AUC are selected as evaluation
indexes.
The following Table 5 is a confusing matrix, which divides an
instance into a positive class and a negative class. Positive means
the sample does have malicious behavior. Negative means the sam-
ple does not have malicious behavior. True means the predicted
label and the actual label are the same. So, the false means that
the predicted label does not match the actual label. Here are some
definations:
• True positive (TP) = correctly identified
• False positive (FP) = incorrectly identified
• True negative (TN) = correctly rejected
• False negative (FN) = incorrectly rejected
ROC is a curve in a two-dimensional coordinate axis, the hor-
zontal coordinate is the false positive rate (FPR), and the vertical
oordinate is the true positive rate (TPR). TPR and FPR are defined
y the confusion matrix, as in Eqs. (10) and (11) .
P R =
T P
( T P + F N ) (10)
P R =
F P
( F P + T N ) (11)
A determined classifier and test set can only get one set of FPR,
PR. A curve requires a series of FPR, TPR. Because the output of
he classifier is the probability value, the number of positive sam-
les and negative samples is different by setting different thresh-
lds, and the FPR and TPR are different. Then setting the threshold
rom high to low, a series of FPR and TPR are obtained to form the
OC curve.
.2. Experiment data
The experiment samples come from four data sets: normal sam-
les from Windows7 and Windows XP system’s exe files, the mali-
ious samples were randomly picked up from the Virus Share [35] ,
irus Total [36] collector. Using the open source cuckoo sandbox
ool as a dynamic analysis tool, the cuckoo sandbox monitors the
ystem’s kernel API during the sample execution, and each sam-
le gets the corresponding execution API call sequence. In the ex-
eriment, the sandbox executes each sample a fixed time of two
inutes. The sample’s label is derived from the detection results
f the major security vendors obtained from VirusTotal.com. Based
n the weighted probability statistics for the detection results, the
hreat level of the sample is calculated. Altogether there are 1430
alicious samples and 1352 normal samples.
.3. Deep learning experiment
.3.1. Experimental parameters
In the experiment, the samples were divided into the train-
ng set, verification set and test set, respectively accounting for
0%, 10% and 10% of the total samples. In the experiments, we
se Tensorflow to train Bi-Residual LSTM model. We only use two
A block in our network because we are restricted by hardware
onditions.
Table 6 shows the parameters in each layer of the model. The
ength of the API sequence is 10 0 0. We did several times of the
xperiment to find out the best sequence length, as Fig. 6 (c) shows.
L. Xiaofeng, J. Fangshuo and Z. Xiao et al. / Computer Networks 157 (2019) 99–111 107
Table 6
Experiment parameter.
Indicator Input size Layer parameter
Embedding layer BatchSize x 10 0 0 × 1 Vocab size: 400
Embedding size: 128
First Bi-LSTM layer BatchSize x 10 0 0 × 128 Dropout rate:0.5
Hidden layer size: 128
SA Block Layers Block0 BatchSize x 10 0 0 × 128 Dropout rate: 0.5
Forward and Backward LSTM size: 128
BN’s momentum:0.99
BN’s epsilon = 0.001
Block1 BatchSize x 10 0 0 × 128 Dropout rate: 0.5
Forward and Backward LSTM size: 128
BN’s momentum:0.99
BN’s epsilon = 0.001
Maxpooling BatchSize x 128 × 10 0 0 Polling size: 10 0 0
Polling axis: 2
Output size: BatchSize x 128
Fully connected layer BatchSize x 256
(128 from maxpooling, 128 from last
output on timestep direction)
Size: 256 × 1
Bias: True
Optimizer: RMSProp – Learning rate:0.001
Weight decay: 0.9
Fig. 6. Bi-LSTM experiment.
W
h
l
d
a
6
s
c
a
r
v
e did not test longer sequence because of the limitation of the
ardware. With our redundant removal algorithm, sequences with
ength of 10 0 0 is enough to show the behavior of a sample.
Another parameter 128 is the size of LSTM layer. Typically, in
eep learning model, the size of the layer is the power of 2, such
s 32, 64, 128, 256. Our model is under-fitting when the size is
4 and the model is well-fitting when the size is 128. And, at the
ame size, the LSTM layer has much more parameters than the
onvolution layer. So, we determined the size of a layer to be 128
nd applied it on all LSTM layers. Other parameters like dropout
ate, batch normalization layer’s momentum, we choose most used
alue.
108 L. Xiaofeng, J. Fangshuo and Z. Xiao et al. / Computer Networks 157 (2019) 99–111
Table 7
The effect of sequence length on measuring time.
Sequence length Each epoch average consumption time AUC
200 273 s 0.9644
500 485 s 0.9718
10 0 0 852 s 0.9839
Table 8
Different classification methods AUC.
Method AUC
Redundant subsequence removal method 0.983.
Continuous same model API call removal 0.947
The original sequence 0.930
u
2
t
o
m
o
h
d
t
o
E
u
f
n
d
6
fl
p
e
h
o
f
t
c
u
o
e
e
t
r
a
t
p
o
M
M
N
A
c
t
a
t
o
e
t
t
p
s
o
8
6.3.2. Deep learning results
Different classification results were obtained by different pre-
processing methods in the experiments as shown in Fig. 6 below.
• We study the influence of different N-Gram methods when us-
ing the redundant subsequence removal method described in
preprocessing 3.2.1. It can be seen from the Fig. 6 a that the best
result is led by the 3-Gram, and AUC is 0.9539. Some previous
research also proved that the classification result with 3-Gram
is the best [37] .
• The selecting subsequences of different numbers by the infor-
mation gain introduced in preprocessing 3.2.1 are studied. As
shown in Fig. 6 b, it can be seen that the subsequence number
10 0 0 under the information gain selection is the best result,
and AUC is 0.964. The smaller the subsequence number indi-
cates that the larger the useful information about sample classi-
fication contained in these subsequences. This can increase the
classifying accuracy with the length of the API call sequences
inputted into the model being fixed.
• The impact of the API call sequences’ length of a sample is
studied. The sample API call sequence is preprocessed by fixing
the N-gram method to 3-Gram and the subsequence number in
the information gain selection being 10 0 0. Then, the processed
sample time sequences’ length is cut to 20 0, 50 0 and 10 0 0 re-
spectively to get Fig. 6 c series of curves. The effect of the se-
quence length on measuring time is shown in Table 7 . It can
be seen from the experiment that the length of the sequence
is longer, the result turns out to be better as well as the longer
measuring time. Taking time consumption and accuracy into ac-
count, we choose 10 0 0 as the length of sequences.
• We compare the effect of different preprocessing methods.
Fig. 6 d shows the results of different preprocessing methods.
Fig. 6 d compares the results by the redundant subsequence re-
moval method and the method of using the same mode call
eliminated method. It can be seen from the experiment that the
result of using two kinds of preprocessing methods are better
than not using the preprocessing method. The best result is the
preprocessing method with redundant subsequence removal.
Table 8 below is the AUC obtained by the different prepro-
cessing methods, and with the redundant subsequence removal
method, AUC is 98.3%. The AUC and ROC together show that us-
ing redundant subsequence removal method can reduce the length
of the call sequence. These methods make the training faster and
more accurate in classification.
6.4. Machine learning experiment
6.4.1. Experimental parameters
The random forest algorithm function in Scikit–Learn [38] is
used to classifying malicious samples. 5-Fold cross validation is
sed to test the performance of the classifier. The sample set is
782 and was randomly divided into 5 equal parts: one part as
est set, and the remaining 4 parts as 5-Fold cross validation.
The experiment mainly studies the influence of the number
f decision trees in random forest, the decision split evaluation
ethod, the Bootstrap sampling method and the feature quantity
n the classification.
The number of decision trees in the random forest indicates
ow many decisions trees are produced in the random forest. The
ecision tree best split methods are used in the decision tree struc-
ure processing to form a standard to measure the relative merits
f a decision. There are mean Squared Error and Mean Absolute
rror methods. Bootstrap sampling parameters indicate whether to
se Bootstrap sampling during decision tree construction, and the
eature quantity parameter specifies how many dimension features
eed to be considered while searching for optimal segmentation in
ecision tree construction.
.4.2. Machine learning results
By random forest algorithm, there are multiple parameters in-
uencing the result of the experiment. A series of comparative ex-
eriments are conducted to verify the influence of different param-
ters on the AUC value. The consumed time of the training model
as been considered as a performance index.
We study the effect of the number of decision trees on the
verall performance of the random forest. In this experiment, we
ound that the AUC value increased as the number of decision
rees increased. As shown in Fig. 7 a, it also led to increasing of the
onsumption time in training process at the same time. The AUC
pgrade will gradually reduce with the increase in the number
f decision trees. The parameter can be selected according to the
xperimental condition and hardware scale.
The best split methods on AUC are studied by the mean square
rror and the average absolute error method. As shown in Fig. 7 b,
he Mean Square Error (MSE) is better than the Mean Absolute Er-
or (MAE) in the experiment. The formula for the two methods is
s follows: f i is the predicted result of segmentation, y i is the ac-
ual result, both f i and y i are less than 1. The reason is that the
ower operation of MSE reduces the influence of the smaller error
f a feature on the result.
SE =
1
n
n ∑
i
( f i − y i ) 2
(12)
AE =
1
n
n ∑
i
| f i − y i | (13)
Fig. 7 c shows the effect of whether Bootstrap sampling is used.
o matter in which cases, Bootstrap sampling has an impact on
UC. After using the Bootstrap method, the classifier is more effi-
ient. The sampling is carried out during structuring the decision
ree, and it can better estimate the overall hypothesis distribution
nd get a better classifier.
The feature quantity of the best split is studied, too. In order
o find it, we have conducted experiments to compare the effects
f different numbers of features on the experimental results. The
xperimental results are shown in Fig. 7 d. As the total number of
he features is 22,043, according to the experimental results, the
raining consumption time is proportion to the number of features.
Finally, we study whether the parameter optimization can im-
rove the AUC and reduce the training time. The experimental re-
ults are shown in Table 9 . The parameter optimization can not
nly increase the overall AUC, but also shorten the training time of
0% to 84%.
L. Xiaofeng, J. Fangshuo and Z. Xiao et al. / Computer Networks 157 (2019) 99–111 109
Fig. 7. Random forest experiment.
Table 9
Performance in parameters optimization.
item Optimization method Before optimization After optimization
Number of decision tree Properly reduce the number of decision tree AUC:0.9852Time consuming: 3431 s AUC:0.9840Time consuming: 719 s
Feature number Properly reduce the number of feature AUC:0.9834Time consuming: 716 s AUC:0.9871Time consuming: 115 s
6
A
A
m
m
S
s
m
f
m
CC
P
b
oFig. 8. ROC curve of combined model.
.5. Random forest & bi-residual LSTM combined model
After choosing the best parameter for each model, we achieved
UC score 0.987 and accuracy score 0.935 on random forest model,
UC score 0.993 and accuracy score 0.957 on Bi-Residual LSTM
odel. Then we begin to proceed with the combination experi-
ent. First, we calculate the probability of malware in each model.
econd, we decide which model has higher confidence in its re-
ult by formula 14. P LSTM
stands for the result probability of LSTM
odel. P RF stands for the result of random forest model. P stands
or the combined model’s result. C RF and C LSTM
stand for each
odel’s confidence.
LST M
= min ( abs ( 1 − P LST M
) , P LST M
) RF = min ( abs ( 1 − P RF ) , P RF )
=
{P LST M
( i f C LST M
≤ C RF ) P RF (i f C LST M
> C RF )
(14)
The combined model’s ROC curve is shown in Fig. 8 . The com-
ined model’s AUC is 0.996, which is better than the above results
f single model’s experiment.
110 L. Xiaofeng, J. Fangshuo and Z. Xiao et al. / Computer Networks 157 (2019) 99–111
Table 10
Accuricies on 5 test data set splits.
Test Data Spilt Split 0 Split 1 Split 2 Split 3 Split 4
Accuracy 96.7% 96.7% 98.5% 96.7% 95.1%
Table 11
The accuracy of different algorithms.
Method Year of the method Platform Accuracy
Confidence algorithm [30] 2012 Windows 0.900
OPCODE 2013 Windows 0.929
KNN [39] 2016 Windows 0.90
Neural Networks [24] 2016 Windows 0.856
CNN + LSTM [28] 2016 Windows 0.894
MAMADROID [27] 2016 Android 0.87
Random Forest [29] 2017 Android 0.953
MADAM [20] 2018 Android 0.96
RanDroid [21] 2018 Android 0.977
Combined model in this paper 2018 Windows 0.967
A
F
o
S
r
n
R
[
[
[
[
We split the test data set into 5 splits. We ran our algorithm
on these splits several times and got the accuracy of each split.
The results are shown in the following Table 10 . It shows that the
results of our model are statistically stable. The average accuracy
of the combined model of random forest and Bi-Residual LSTM is
96.7%.
The following Table 11 shows the accuracy of different algo-
rithms (ACC). [28] preprocesses with the same API function calls
merge method and the convolution neural network CNN + cyclic
neural network LSTM model. The ACC of [28] is 0.894. We com-
pare our model with several malware detection models and
Table 11 shows the comparison result. The ACC of the methods to
detect the malwares at Andriod platform are high mostly, because
many features besides API can be utilized by their detection
models, such as the permission of APK, users behavior, software
packages, and native code. The experiment shows that the ac-
curacy of the method in this paper is better than most of other
methods.
7. Conclusion
Based on the dynamic analysis of the malwares, this paper pro-
poses a combined learning framework based on machine learning
and deep learning for malware detection. In this paper, a fast anal-
ysis algorithm based on parameter hash is introduced. The redun-
dant data processing and the deep learning model based on recur-
rent neural network are studied in the aspect of malware detection
with deep learning.
This is the first study on combining a machine learning model
with a deep learning model to overcome the shortcoming of deep
learning model, so the combined model is more robust. Experi-
ments show that the AUC of the combined learning model is 0.996
and the ACC of the model is 0.967 on the test set, and the method
is able to detect unknown malicious samples.
In our study, the experimental data set is not large, and it still
needs to be increased. In the future, we are going to test the model
on larger malware sample data set. About the model parameters,
there are still some adjustable aspects, which also need more ex-
periments to verify. And we want to study different combination
methods, and study the advantages and disadvantages of different
combinations methods.
At present, there are few researches on utilizing deep learning
model with machine learning model in the field of malware detec-
tion, and there is still much research space. This paper wishes to
attract more scholars’ attention and participate in this field.
cknowledgements
This research work was supported by National Natural Science
oundation of China (Grant No. 61472046), the Opening Project
f Key Lab of Information Network Security of Ministry of Public
ecurity (The Third Research Institute of Ministry of Public Secu-
ity)and Seed Funds of Beijing Association for Science and Tech-
ology, NSFOCUS Kunpeng Fund.
eferences
[1] H. Sun , X. Wang , R. Buyya , et al. , CloudEyes: cloud-based malware detectionwith reversible sketch for resource-constrained internet of things (IoT) devices,
Software Pract. Exp. 47 (3) (2017) . [2] McAfee, McAfee Labs labs 2017 Threats threats Predictionspredictions,
https://www.mcafee.com/enterprise/zh- cn/security- awareness/threats-
predictions-2017.html . [3] P. Faruki , A. Bharmal , V. Laxmi , et al. , Android security: a survey of issues,
malware penetration, and defenses, IEEE Commun. Surv. Tut. 17 (2) (2017)998–1022 .
[4] Z. Chen , C. Ji , Efficient string matching: an aid to bibliographic search, Com-mun. ACM 18 (1975) 333–340 .
[5] S. Wu , U. Manber , A Fast Algorithm for Multi-Pattern Searching., University of
Arizona, 1994 Technical Report TR-94-17 . [6] AV-Comparative. On-Demand Detection of Malicious Software. Technical Re-
port AV-comparative AV-comparatives Innsbruck, Austria 2010 Available fromwww.av-comparatives.org .
[7] W. Yan , N. Ansari , Why anti-virus products slow down your machine? in: Pro-ceedings of IEEE International Conference on Computer Communications and
Networks, San Francisco, CA, USA, 2009, pp. 1–6 . [8] Saxe J., Berlin K. Deep neural network based malware detection using two di-
mensional binary program features[J]. 2015:11–20.
[9] I. Santos , F. Brezo , X. Ugarte-Pedrero , et al. , Opcode sequences as represen-tation of executables for data-mining-based unknown malware detection, Inf.
Sci. 231 (9) (2013) 64–82 . [10] N. Kuzurin , A. Shokurov , N. Varnovsky , V. Zakharov , On the concept of software
obfuscation in computer security, Lect. Notes Comput. Sci. 4779 (2007) 281 . [11] Karim M., Walenstein A., Lakhotia A., Parida L., Malware phylogeny genera-
tion using permutations of code, Journal in. Computer Comput. Virology Virol.
1(2005) 13–23. [12] M. Christodorescu , S. Jha , in: Testing Malware Detectors, 29, ACM SIGSOFT
Software Engineering Notes, 2004, pp. 34–44 . [13] X. Wang , J. Liu , X. Chen , Say No to Overfitting, University of Pittsburgh, 2015 .
[14] Lipton Z C., Berkowitz J., Elkan C.. A critical review of recurrent neural net-works for sequence learning[J]. arXiv preprint , 2015.
[15] Athalye A., Carlini N., Wagner D. . Obfuscated gradients give a false sense of
security: ccircumventing defenses to adversarial examples[J]. 2018. [16] Yuille A.L., Liu C. . Deep nets: wwhat have they ever done for vision?[J]. 2018.
[17] I. Santos , F. Brezo , X. Ugarte-Pedrero , et al. , Opcode sequences as represen-tation of executables for data-mining-based unknown malware detection, Inf.
Sci. 231 (2013) 64–82 . [18] Y.A.N.G. LIU , Employing the Algorithms of Random Forest and Neural Networks
for the Detection and Analysis of Malicious Code of Android Applications, Bei-
jing Jiaotong University, 2015 . [19] Y.A.N.G. Hong-yu , X.U. jin , Android malware detection based on improved ran-
dom forest, J. Commun. (04) (2017) 8–16 . [20] A. Saracino , D. Sgandurra , G. Dini , et al. , MADAM: effective and efficient be-
havior-based android malware detection and prevention, IEEE Trans. Depend.Secure Comput. (99) (2018) 1-1 .
[21] J.D. Koli , RanDroid:android malware detection using random machine learn-
ing classifiers, International Conference on Technologies for Smart-City EnergySecurity and Power(ICSESP), 2018, IEEE, 2018 .
22] B. Alsulami , A. Srinivasan , H. Dong , et al. , Lightweight behavioral malware de-tection for windows platforms, in: International Conference on Malicious and
Unwanted Software, IEEE, 2018, pp. 75–81 . 23] J. Saxe , K. Berlin , Deep neural network based malware detection using two
dimensional binary program features, in: Malicious and Unwanted Software
(MALWARE), 2015 10th International Conference on, IEEE, 2015, pp. 11–20 . [24] B. Kolosnjaji , A. Zarras , G. Webster , et al. , Deep learning for classification of
malware system call sequences, in: Australasian Joint Conference on ArtificialIntelligence, Springer International Publishing, 2016, pp. 137–149 .
25] S. Tobiyama , Y. Yamaguchi , H. Shimada , et al. , Malware detection with deepneural network using process behavior, in: Computer Software and Ap-
plications Conference (COMPSAC), 2016 IEEE 40th Annual, 2, IEEE, 2016,pp. 577–582 .
26] R. Pascanu , W. Stokes J , H. Sanossian , et al. , Malware classification with recur-
rent networks, in: Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEEInternational Conference on, IEEE, 2015, pp. 1916–1920 .
[27] Onwuzurike L., Mariconti E., Andriotis P., et al. MaMaDroid: dDetecting an-droid malware by building markov chains of behavioral models (Extended Ver-
sion)[J]. 2016.
L. Xiaofeng, J. Fangshuo and Z. Xiao et al. / Computer Networks 157 (2019) 99–111 111
[
[
[
[
[[
[
[
[[
28] P. Mishra , S. Pilli E , V. Varadharajan , et al. , VAED: vMI-assisted evasion detec-tion approach for infrastructure as a service cloud, Concurrency Comput Pract
Exp 29 (1) (2017) 1–30 . 29] Z. Jiawang , L. Yanwei , Malware detection system implementation of Android
application based on machine learning, Appl. Res. Comput. (06) (2017) 1–6 . 30] C. Ravi , R. Manoharan , Malware detection using windows API sequence and
machine learning, Int. J. Comput. Appl. 43 (17) (2012) 12–16 . [31] G. Canfora , E. Medvet , F. Mercaldo , et al. , Detecting android malware using se-
quences of system calls, in: International Workshop on Software Development
Lifecycle for Mobile, ACM, 2015, pp. 13–20 . 32] K.T. Guen , K.B. Joong , R. Mina , et al. , A multimodal deep learning method for
android malware detection using various features, IEEE Trans. Inf. Forens. Se-cur. (2018) 1-1 .
33] Tensorflow. https://www.tensorflow.org/ , 2018. 34] J. Houvardas , E. Stamatatos , N-gram feature selection for authorship identifica-
tion, in: International Conference on Artificial Intelligence: Methodology, Sys-
tems, and Applications, Berlin, Heidelberg, Springer, 2006, pp. 77–86 . 35] Virusshare https://virusshare.com , 2018.
36] VirusTotal. http://www.virustotal.com , 2018.
[37] Q. Huang , Malicious Executables Detection Based On N-Gram System Call Se-quences, Harbin Institute of Technology, 2009 .
38] Scikit-Learn. http://scikit-learn.org/ 39] L.I.A.O. Guohui , L.I.U. Jiayong , A malicious code detection method based on
data mining and machine learning, J. Inf. Secur. Res. (01) (2016) 74–79 .
Xiaofeng Lu is an Associate professor of School of Cy-
berspace Security, Beijing University of Post and Telecom-munications. He received his PhD degree in Beijing Uni-
versity of Aeronautics & Astronautics, Beijing, China, in2010. During his PhD, he held visiting scholar positions
at the Computer Laboratory, University of Cambridge, UK.His main research interests include cyberspace security,
information security and cryptography.