12
Research Article Session-Based Webshell Detection Using Machine Learning in Web Logs Yixin Wu, 1 Yuqiang Sun, 1 Cheng Huang , 1 Peng Jia, 2 and Luping Liu 2 1 College of Cybersecurity, Sichuan University, Chengdu, China 2 College of Electronics and Information Engineering, Sichuan University, Chengdu, China Correspondence should be addressed to Cheng Huang; [email protected] Received 2 April 2019; Revised 14 July 2019; Accepted 24 July 2019; Published 22 November 2019 Guest Editor: Sebastian Schrittwieser Copyright © 2019 Yixin Wu et al. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Attackers upload webshell into a web server to achieve the purpose of stealing data, launching a DDoS attack, modifying files with malicious intentions, etc. Once these objects are accomplished, it will bring huge losses to website managers. With the gradual development of encryption and confusion technology, the most common detection approach using taint analysis and feature matching might become less useful. Instead of applying source file codes, POST contents, or all received traffic, this paper demonstrated an intelligent and efficient framework that employs precise sessions derived from the web logs to detect webshell communication. Features were extracted from the raw sequence data in web logs while a statistical method based on time interval was proposed to identify sessions specifically. Besides, the paper leveraged long short-term memory and hidden Markov model to constitute the framework, respectively. Finally, the framework was evaluated with real data. e experiment shows that the LSTM- based model can achieve a higher accuracy rate of 95.97% with a recall rate of 96.15%, which has a much better performance than the HMM-based model. Moreover, the experiment demonstrated the high efficiency of the proposed approach in terms of the quick detection without source code, especially when it only considers detecting for a period of time, as it takes 98.5% less time than the cited related approach to get the result. As long as the webshell behavior is detected, we can pinpoint the anomaly session and utilize the statistical method to find the webshell file accurately. 1. Introduction Webshells have become the main threat challenges for pro- tecting the security of websites. According to the weekly safety report issued by National Computer Network Emergency Response Technical Team/Coordination Center of China (CNCERT/CC) in 2019, the number of websites with back- doors is growing almost every week [1]. As a web service- based backdoor program, webshell is installed through vul- nerabilities in web applications or weak server security configuration such as SQL injection, the file including and uploading. Webshells with encryption and obfuscation are often used in attacks mostly because they are difficult to detect by WAF and other antivirus software. A hacker can initiate attacks using webshell tools such as Chinese Chopper [2] and achieve quite a few large malicious attacks like data theft, DDoS attacks, and watering hole attacks [3]. When a malicious webshell attack occurs, there are some exceptions as alertness for admin such as files with an ab- normal timestamp, high traffic for a user in very short period time, and files including malicious codes. Attackers can also hide webshell logins in fake error pages. Due to the high usage of webshell in cyberattacks, there has been much previous research in this field. But most researchers focus on the contents of suspicious files [4–6] or POST contents in HTTP requests [7] and thus ignore fea- tures in a sequence of web server logs. With the development of encryption and obfuscation technology [8, 9], it is quite difficult to detect webshell in thousands of website source files, but when we only need to deal with the sequence of several fields in the web logs without considering complex text processing, the problem becomes much simpler. is paper focuses on providing a new webshell detection method based on user’s sessions in web logs to improve the accuracy Hindawi Security and Communication Networks Volume 2019, Article ID 3093809, 11 pages https://doi.org/10.1155/2019/3093809

Research Article - Hindawi Publishing Corporationdownloads.hindawi.com/journals/scn/2019/3093809.pdf · As a web service-based backdoor program, webshell is installed through vul-

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Research Article - Hindawi Publishing Corporationdownloads.hindawi.com/journals/scn/2019/3093809.pdf · As a web service-based backdoor program, webshell is installed through vul-

Research ArticleSession-Based Webshell Detection Using Machine Learning inWeb Logs

Yixin Wu1 Yuqiang Sun1 Cheng Huang 1 Peng Jia2 and Luping Liu2

1College of Cybersecurity Sichuan University Chengdu China2College of Electronics and Information Engineering Sichuan University Chengdu China

Correspondence should be addressed to Cheng Huang opcodesecgmailcom

Received 2 April 2019 Revised 14 July 2019 Accepted 24 July 2019 Published 22 November 2019

Guest Editor Sebastian Schrittwieser

Copyright copy 2019 Yixin Wu et al is is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

Attackers upload webshell into a web server to achieve the purpose of stealing data launching a DDoS attack modifying les withmalicious intentions etc Once these objects are accomplished it will bring huge losses to website managers With the gradualdevelopment of encryption and confusion technology the most common detection approach using taint analysis and featurematching might become less useful Instead of applying source le codes POST contents or all received trac this paperdemonstrated an intelligent and ecient framework that employs precise sessions derived from the web logs to detect webshellcommunication Features were extracted from the raw sequence data in web logs while a statistical method based on time intervalwas proposed to identify sessions specically Besides the paper leveraged long short-term memory and hidden Markov model toconstitute the framework respectively Finally the framework was evaluated with real datae experiment shows that the LSTM-based model can achieve a higher accuracy rate of 9597 with a recall rate of 9615 which has a much better performance thanthe HMM-based model Moreover the experiment demonstrated the high eciency of the proposed approach in terms of thequick detection without source code especially when it only considers detecting for a period of time as it takes 985 less timethan the cited related approach to get the result As long as the webshell behavior is detected we can pinpoint the anomaly sessionand utilize the statistical method to nd the webshell le accurately

1 Introduction

Webshells have become the main threat challenges for pro-tecting the security of websites According to the weekly safetyreport issued by National Computer Network EmergencyResponse Technical TeamCoordination Center of China(CNCERTCC) in 2019 the number of websites with back-doors is growing almost every week [1] As a web service-based backdoor program webshell is installed through vul-nerabilities in web applications or weak server securityconguration such as SQL injection the le including anduploading Webshells with encryption and obfuscation areoften used in attacks mostly because they are dicult to detectby WAF and other antivirus software A hacker can initiateattacks using webshell tools such as Chinese Chopper [2] andachieve quite a few large malicious attacks like data theftDDoS attacks and watering hole attacks [3]

When a malicious webshell attack occurs there are someexceptions as alertness for admin such as les with an ab-normal timestamp high trac for a user in very short periodtime and les including malicious codes Attackers can alsohide webshell logins in fake error pages

Due to the high usage of webshell in cyberattacks therehas been much previous research in this eld But mostresearchers focus on the contents of suspicious les [4ndash6] orPOST contents in HTTP requests [7] and thus ignore fea-tures in a sequence of web server logs With the developmentof encryption and obfuscation technology [8 9] it is quitedicult to detect webshell in thousands of website sourceles but when we only need to deal with the sequence ofseveral elds in the web logs without considering complextext processing the problem becomes much simpler ispaper focuses on providing a newwebshell detectionmethodbased on userrsquos sessions in web logs to improve the accuracy

HindawiSecurity and Communication NetworksVolume 2019 Article ID 3093809 11 pageshttpsdoiorg10115520193093809

of webshell detection simply and effectively Moreover if thewebsite defenders can identify malicious webshell imme-diately they can prevent related cyber threats

is paper presents a comprehensive framework includinglog data collection feature extraction session identificationand comparison of two machine learning methods e maininnovations of this paper can be summarized as follows

(i) e presented model utilized a session-basedmethod to detect webshell which is simpler andmore efficient than existing methods Sessionidentification can be more precise using a statisticalmethod which roughly calculates the time intervalof each entry in each session at first and then set thethreshold by the quantile in the statistics to dividethe session more detailed e experiment indicatesthat the model obtained the highest accuracy andrecall rate when the threshold was 70 quantile

(ii) e paper compared the long short-term memorywith the hidden Markov model which are com-monly used to process sequence data and it sug-gested that the long short-term memory modelwhich can provide a high accuracy rate of 9597with a recall rate of 9615 has much better per-formance than the latter one

(iii) e proposed model could only need access log filesto check the malicious behavior of webshell withoutany web application source files e results showthat the LSTM-based model takes 985 less timethan previous related approach to detect whether awebsite has a webshell attack in a period of time

As a remainder related work is briefly discussed inSection 2 A complete framework is explained in Section 3Section 4 presents the entire experiment and evaluationprocess in detail At last the conclusion is in Section 5

2 Related Work

21OutlineofWebshell Webshell is a kind of software whichusually assists the administrator to manipulate the server Butin some cases attackers will use some malicious webshells tocontrol the server to achieve malicious purposes For at-tackers webshell is a kind of backdoor generally written insome scripting language like ASP PHP or JSPese kinds ofweb scripts are able to create dynamic interactive sites so theattacker is able to control the web server through web pagesese behaviors will be recorded in the access log [10]

211 Principle of Webshell Webshell can work by sendingHTTP requests to the specific page and use some functionsto execute the instruction sent by the attacker e results ofthese instructions will be sent to the attacker as the responseof these HTTP requests [11]

212 Classification of Webshell According to the functionand size of scripting language webshell can be roughlydivided into three categories as follows

(i) Big Trojan It has a large size and comprehensivefunctions for command execution database oper-ations and other malicious intentions Besides afriendly graphical interface is applied to the bigTrojan

(ii) OneWord Trojan It is a Trojan with only one line ofcode Due to its shortness it is often embedded innormal files or pictures It can perform manyfunctions like the big Trojan when connected to theinitial attack tools such as Chinese Chopper

(iii) Small Trojan It is small and easy to hide butgenerally only has an upload function Since mostwebsites have size restrictions when uploading filesattackers generally first obtain upload permissionthrough the small Trojan and then upload the bigTrojan to the website to perform key functions

213 Escape of Webshell In order not to be discovered bythe administrator of the website webshells usually undergo alot of distortion Some escape methods are as follows

(i) Pass parameters with less common fields whilewebsites generally use request field to pass pa-rameters this method uses some unusual fields suchas HTTP referrer and user agent

(ii) Encrypt sensitive features attackers use somecommon encryption algorithms such as base64 [12]and rot 13 to encrypt some key functions For agreater probability of escape some tools evencustomize encryption algorithms

(iii) Multiple encoding and compression hackerschange the original static features of the code bymultiple encoding combined with compressiontechnology to reduce the possibility of detection

22Method ofWebshell Detection We did some research onthe previous detection of webshell At the earliest webshelldetection takes a manual identification method is is theoldest and most traditional way to detect webshells whichplaces high demands on the administrators of the websiteAdministrators are supposed to have a comprehensive graspof the website files and have a high recognition ability forsome newly added exception files [13] such as some namingfiles passbyphp passasp and ajsp Besides these small filesshould be treated carefully because there are probably oneword Trojans After finding suspicious files we need toanalyze the contents of the file e most thorough way is totake a look at the entire file carefully but it will take a bunchof time A better way is to search for some sensitive functionssuch as exec() shell exec() and system() and check theirparameters carefully [14]

Static feature webshell detection is the hot trend of re-search It is an upgraded version of manual identificationbut they are almost identical is method focuses on thefeatures of file contents Due to numerous features it oftenuses machine learning to improve effectiveness and accu-racy For example in [15] an approach based on optimal

2 Security and Communication Networks

thresholds was proposed to identify files containing mali-cious codes from web applicationse detection system willscan and find malicious codes in each file of the web ap-plication analyzing the features including keywords filepermissions and owner Instead of the source file codesTian et al [16] divided the POST contents in the HTTPrequest into several words which were represented in theform of vectors using the Word2vec model [17] and thenwere input into the CNN in a fixed-size matrix Experimentshave shown that such a method can achieve high accuracywhich was also the first time that CNN [18] applied towebshell detection However if we just detect by signaturematching we can only do well in detecting webshells thathave known features and it does not apply to detect un-known webshells In the literature [19] an approach that canbe used to predict unknown webshells was proposed Su-pervised machine learning and matrix decomposition wereused to generate original and unknown webshell features byanalyzing different features of known pages e paper [20]focuses on the detection of PHP webshell which used thetext classifier fastText [21] developed by Facebook and therandom forest algorithm to build the model As a compiledintermediate language of PHP scripts PHP opcode sequencewas regarded as an important feature of webshell detectionis paper can be a good inspiration because it starts tobreak away from the webshell file itself

ere are also some detection methods based on otherfeatures For example dynamic feature detection uses thesystem commands network traffic and state exceptions usedby webshell to determine the threat level of the actionWebshell is usually confused and encrypted to avoid de-tection of static features When the webshell is runningsystem commands must be sent to the system to reach thepurpose of operating the database or even the system ismethod monitors and even intercepts system commands bydetecting system calls and deeply detects the security of thescript from the behavior mode Zhang et al [22] proposed acharacter-level content feature transformationmethod whichcombined the features of CNN and LSTM to construct a newwebshell traffic detection model is model preserves thesequential features in network traffic and reduces the featuredimension Experiments have shown that this model can beused to detect unknown webshells while running on largewebsites Besides Shi et al [23] proposed a log-based methodfor webshell detection which uses features such as textfeatures statistical features and page association about thedegree of graph theory But the session simply uses the IP fieldand the user-agent field for a rough distinction which treatsall access by a user as a session e request field is used formachine learning and it is a text processing problem like thestatic detection we mentioned earlier in essence

Among the methods we mentioned above the defendersneed to scan and look for malicious codes inside every file ofthe web application or every POST content in the web logs[24] e drawbacks of these methods are the heavyworkload What is worse with the development of en-cryption and obfuscation techniques webshell detectionaccuracy will be much lower because webshell detectiontools are mostly based on signature matching As we

mentioned before webshell mainly works by sending HTTPrequests Based on this theory a new direction for webshelldetection is proposed which focuses on using the raw se-quence data without POST contents in web logs

3 Framework

31 Architecture e purpose of this paper is to detectwebshell according to the sessions extracted from web serverlogs without POST contents or source file codes It iscomposed of several components as illustrated in Figure 1Firstly we use normal log files on a large website and collecthoneypot logs with webshell communication for experi-mental data collection In the second part instead ofextracting the fields directly from the web logs we use the IPfield and the user-agent field to roughly divide the collectedlogs into different sessions count the time interval in everysession and set the threshold to identify the session in moredetail first en we use the hidden Markov model and longshort-term memory to build our model respectively andcompare which one has better performance

32 Raw Data Collection In the process of generating datawe use different kinds of webshells including big Trojanssmall Trojans and one word Trojans which differ in size andfunction Besides these webshells are placed in differentdepths of the website we built and different webshell toolssuch as Chinese Chopper andWeBaCoo are used to initializethe attack In this paper the log format of the website we useis Apache but our framework can be applied to othercommon server log formats such as Nginx

33 Data Processing

331 Data Extraction All the logs we collected are in thecommon Apache2 format ere are close to 10 fields in thisformat but there is no need to use all of them ereforesome fields were extracted from the web logs e wholeprocess is shown in Figure 2 Feature sequence consists ofsource IP address timestamps user agent status code bytesreferrer and request in which the request field is subdividedinto method field and path field

e IP field and user-agent field are used to identifysessions Because there may be multiple visitors under thesame IP it is necessary to combine user agent to makejudgments At the same time we roughly regard a visitor as asession for the time beinge status code field and byte fieldcan be utilized to construct feature vectors without anychanges In the method field the GETmethod is representedby 1 POSTmethod is represented by 2 and other methodsare represented by 0 Meanwhile in the referrer field ldquominus rdquo isrepresented by 0 while 1 for the website 2 for the webcrawler and 3 for the external websites In the timestampsfield we calculate the time difference between entries inevery session and fill 0 in the last vector of every session

e path field is encoded with the degree of relevance ofeach entry access path in the session e first entry of eachsession is coded as minus 1 because there is no access in front of it

Security and Communication Networks 3

and the relative relationship cannot be discriminated fromthe second le the access to the same le is marked as 0 andwhen accessing dicopyerent les the distance between dicopyerentles is calculated by the number of directory switches plusone to encodee process of encoding all paths for a sessionis shown in Figure 3 When all the feature elds are pro-cessed a session will transform into a sequence of features asillustrated in Table 1 Finally the feature vector we get will besix-dimensional including byte eld method eld path eldreferrer eld status code eld and time interval

332 Session Identication In the previous section we justmade a rough distinction between the sessions in every log letreating a visitor as a session Butmore often a visitor can accessat dicopyerent times and generate multiple sessionsus we use amore scientic statistical method for session identication

We calculate and count the time interval between entriesin every session which is roughly generated in the previoussection After counting the time intervals in all sessions we

sort all of them and explore the most appropriate quantile asa threshold We compare every time interval to a thresholdand if the time interval is greater than the threshold thesession will be subdivided into two smaller sessions and soforth e whole process is illustrated in Figure 4

When all the data have been processed the framework willcheck all the sessions and delete those sessions that containonly one entry as it assumes that if a user has only one access itis highly unlikely that there is a webshell communication

34 Classication Model Webshell detection is a two-classtask and because of variable length serialization in the logsequence we choose long short-term memory and thehidden Markov model to construct the classication modelrespectively

341 Long Short-Term Memory Long short-term memorynetworks are well suited to classify and process based on

Normal Abnormal

Originaldata

Data collection

Sequence

Featureextraction

IP address

User agent

Rough division

Precise division

Classificationmodel

Traintestdataset

LSTM

Normal Webshell

ClassifyHMM

Time interval

Statisticalanalysis

Figure 1 e architecture of the proposed method

119393195 -- [24Mar2018062553 ndash 0400] ldquoGET imgrssgif HTTP11rdquo 200 1036 ldquohttpswwwexamplecomrdquoldquoMozilla50(Windows NT 100 Win64 x64)rdquo

Host

Path

Time

Method

Referrer

Status

Bytes

User agent

119393195

imgrssgif

24Mar2018062553

GET

httpwwwexamplecom

200

1036

Mozilla50 (Windows NT 100 Win64 x64)

Figure 2 Feature extraction based on log entry

4 Security and Communication Networks

sequence data with its faster convergence and the ability todetect long-term dependencies in data e concept wasoriginally proposed in 1997 by Hochreiter and Schmidhuber[25] e emergence of LSTM is mainly to solve the problemof gradient disappearance and gradient explosion in longsequence training Compared with traditional RNN wherethere is only one delivery state the LSTM has two deliverystates namely the long-term state c(t) and the short-termstate h(t) Besides it introduces three ldquogatesrdquo to control thelong-term state as illustrated in Figure 5

(i) Forget gate f(t) control which of the long-termstates of last moment c(tminus 1) should be discarded

(ii) Input gate i(t) control which parts of the currenttime input x(t) and the last short-term state h(tminus 1)will be added to the long-term state

(iii) Output gate o(t) control which parts of the currentlong-term state c(t) should be output as h(t)

119393195 -- [24Mar2018062552 - 0400] ldquoGET HTTP11rdquo 200 13108 ldquo-rdquo ldquoMozilla50 (Windows NT 100Win64 x64) AppleWebKit53736 (KHTML like Gecko)

Chrome6503325181 Safari53736rdquo

119393195 -- [24Mar2018062553 - 0400] ldquoGETphotonopng HTTP11rdquo 200 4898 ldquohttps examplecomrdquo

ldquoMozilla50 (Windows NT 100 Win64 x64)AppleWebKit53736 (KHTML like Gecko)

Chrome6503325181 Safari53736rdquo

119393195--[24Mar2018062555 - 0400] ldquoGETimgrssgif HTTP11rdquo 200 1036 ldquohttpsexamplecomrdquo

ldquoMozilla50 (Windows NT 100 Win64 x64)AppleWebKit53736 (KHTML like Gecko)

Chrome6503325181 Safari53736rdquo

119393195 -- [24Mar2018101129 - 0400] ldquoGETbook HTTP11rdquo 200 13125 ldquo-rdquo ldquoMozilla50 (WindowsNT 100 Win64 x64) AppleWebKit53736 (KHTML

like Gecko) Chrome6503325181 Safari53736rdquo

Raw data

photonopng

imgrssgif

book

Path

ndash1

2

3

2

Encoded path

Figure 3 e process of encoding all paths for a session

Table 1 Session transformation

Log messages Feature vectorLog entry0 [bytes0 method0 path0 referrer0 statuscode0 t1 minus t0]Log entry1 [bytes1 method1 path1 referrer1 statuscode1 t2 minus t1]Log entry2 [bytes2 method2 path2 referrer2 statuscode2 t3 minus t2]⋮ ⋮Log entryn [bytesnmethodn pathn referrern statuscoden 0]

Statistical analysisof time interval

Division

Division

Originsequence

Rough sessions

Session 1

Session 2

Precise sessions

Session 3

Session 4

IP addressuser-agent

Figure 4 Schematic diagram of session identication

LSTM block

C(tminus1) C(t) h(t)

g(t)

h(tminus1) x(t)

Main layer for analyzingthe current input x(t) andthe previous short-term

state h(tminus1)

Figure 5 e core of LSTM

Security and Communication Networks 5

e model which called SB-LSTM (session-based longshort-term memory) consists of four LSTM layers each ofwhich has sixteen neurons and a dense layer e LSTMlayer will learn the feature of variable length sequenceswhich are activated by tanh function and logistic functione dense layer acts as a ldquoclassifierrdquo throughout the LSTMwhich can map the learned ldquodistributed feature represen-tationrdquo to the sample tag space At last the detection modelused categorical cross entropy as the loss function and theoptimizer is Adam

In more detail the input data size is m times n times 6 m is thenumber of sessions and n is the maximum length of allsessions Current long-term state c(t) and output y(t) arecomputed as follows

g(t) tanh WTxg middot x(t) + W

Thg middot h(tminus 1) + bg1113872 1113873

c(t) f(t) otimes c(tminus 1) + i(t) otimesg(t)

y(t) h(t) o(t) otimes tanh c(t)1113872 1113873

(1)

where Wxg denotes the weight matrix of the main layerconnected to the input vector x(t) of size 1times 6 while Whg

denotes the weight matrix of the main layer connected to theprevious short-term state h(tminus 1) bg represents the coefficientof variation for the main layer

e architecture of this model for a session is depicted inFigure 6

342 Hidden Markov Model e hidden Markov model(HMM) is a statistical Markov model a highly effectivemeans of modeling a family of unaligned sequences [26 27]which describes a hidden Markov chain stochasticallygenerating unobservable state sequences and then gener-ating a stochastical observation sequence from each state Asits most remarkable features the state at any time t dependsonly on the state of the previous moment while it is in-dependent of the time t the state and the observation atother times Besides it also assumes that observations at anytime depend only on the state at that moment independentof other observations and states

In this paper we mainly focus on supervised learningmethod of the hidden Markov chain e train data includeobservation sequences O and corresponding state sequencesI which can be written as

O1 I1( 1113857 O2 I2( 1113857 Os Is( 11138571113864 1113865 (2)

We can use the maximum likelihood estimation methodto estimate the parameters of the hidden Markov model

(1) We assume that the sample is in the state i at time tand the frequency of transition to state j at time t + i

is Aij and then the probability estimate of thetransition state is computed as follows

1113955aij Aij

1113936Nj1Aij

i 1 2 N j 1 2 N (3)

(2) We assume that the sample state is j and the fre-quency of observation k is Bjk and then the

probability of observation k when the state is j can bewritten as

1113955bjk Bjk

1113936Mk1Bjk

j 1 2 N k 1 2 M (4)

(3) e initial state probability π is estimated as thefrequency of the initial state of i in the S samplesearchitecture of this model is illustrated in Figure 7

In more detail the input data size is x times 6 and x is thenumber of entries in all sessions

4 Experiment and Evaluation

In this section we will introduce the detailed composition ofthe dataset the experiment details and the results

41 Dataset ere are two datasets for our experimentsone for the model training and the other for the model testAs for the first dataset a honeypot [28 29] was built tocapture and analyze webshell attacks Tens of testers wereinvited to attack this honeypot One word Trojans smallTrojans and big Trojans which were coded in differentprogram languages were provided to these testers Wecollected a total of 13986 log entries from the honeypot asnegative samples while 57160 logs entries were collectedfrom real-world websites as positive samples In the processof training we also used a 10-folder cross validation toperform a simple verification of the model As for the latterone a real-world website which contains a total of 2324 filesand 248 k lines of log entries in total was utilized to testSince the log entries with webshell communication in thereal environment may only account for a few percents oreven a few thousandths in the log file we control thepositive and negative sample ratio of the dataset to bearound 6 1 in experiments After feature vector extractionand session identification all log entries become in-dependent sessions Every session consists of six sequenceswhich are the byte sequence of the HTTP response themethod sequence the path sequence that is encoded by thedegree of association the referrer sequence that is dividedinto different cases for encoding the status code sequenceand the time interval sequence Figure 8 is an overview ofour model

42 Experiment Design To evaluate the performance of ourmodel we first explored the effects of different quantiles asthresholds and selected the best values for next experimentsen we validated the effectiveness of the session identi-fication method based on the time interval and quantile Inaddition we compare long short-term memory with thehidden Markov models which are commonly used in se-quence data processing and find the one with better per-formance Finally we leverage a real-world website to testthe SB-LSTM model and compare it with results from otherresearch

6 Security and Communication Networks

All the experiments were performed in a PC machinewith an i7-7700HQ processor 16GB of memory and aGeForce GTX 1060 GPU which has 6GB of memory eSB-LSTMmodel is implemented by Tensormacrow [30] and theHHM is implemented by seqlearn [31]

Before using the raw data to build the model we did astandardization to make all the data appropriate for theneural network Data were processed according to the fol-lowing steps

(i) Since we need to input a 3-dimensional matrix whenconstructing the SB-LSTM model we chose Max-Min scaling normalization to reduce the ecopyect ofpadding 0 when normalizing After scaling all data

are between 0 and 1 e function of doing Max-Min scaling normalization is

xlowast x minus min

max minus min (5)

(ii) e label of all the training data is stored in a list 0for normal access and 1 for webshell access

(iii) When we used long short-term memory to buildour model we must make all data to become amatrix so we padded the length of the sequence to zwhich is the maximum session length Besides weheld a list to show all the valid length (originallength) of the data to ensure the number of itera-tions in LSTM

(iv) When we used the hidden Markov model to buildour model we only need to put all the vectors into asequence because of the input data size of thehidden Markov model

We use 10-folder cross validation [32] e dataset isequally divided into 10 subsets each of which is tested onceand the rest as a training set e cross validation is repeated10 times one subset is selected each time as a test set and theaverage cross validation recognition accuracy rate of 10times is taken as a resulte dataset used for the experimentis unbalanced so we choose accuracy recall precision F1score [33] the receiver operating characteristics (ROC)curves [34] and area under curve (AUC) measure forevaluating the proposed method Besides we can visualizethe relation between TPR and FPR of a classier eseindicators can be expressed as

Normal Webshell

y1 y2 yn

b11 b12 b1nb21

b22

b2n

a12

a21

Figure 7 e architecture of HMM

Next

Predict

h40

h30

h20

LSTM layer 1

LSTM layer 2

LSTM layer 3

LSTM layer 4

ForwardDropoutBackward

Y1

h10

Input data

h4n

Dense

h41

h31

h21

h11 LSTM layer 1

LSTM layer 2

Y2

X3

h42

h32

h22

h12

X2

X2X1

(Possibility 1 possibility 2)

LSTM layer 4

LSTM layer 3

Figure 6 e architecture of LSTM

Security and Communication Networks 7

Accuracy 1

|C|1113944

|C|

i1

TPi + TNi

TPi + TNi + FPi + FNi

Recall 1

|C|1113944

|C|

i1

TPi

TPi + FNi

Precision 1

|C|1113944

|C|

i1

TPi

TPi + FPi

F1 score 2precision times recallprecision + recall

AUC 1113936iisinpositiveclassranki minus (M(1 + M)2)

M times N

(6)

where M is the number of positive samples and N is thenumber of negative samples e score indicates theprobability that each test sample belongs to a positivesample while rank is a positive sample set sorted in thedescending order based on score

43 Experiment Results First we explored the influence ofdifferent thresholds in session identification e thresholdswere set to 55 60 65 70 75 80 85 and 90quantile respectively e comparison results are shown inTable 2

As is shown in Table 2 performance is improved with anincreasing threshold in a certain range and reaches thehighest value when the threshold is 70 Moreover whenthe performance reaches the highest it will continuouslydecrease with a rising threshold Excessive thresholds candecrease the performance of the model because two accesses

that were not originally part of the same session are placed inthe same session In comparison with the threshold of 55to 90 we find the appropriate threshold is 70 and we useit for the next experiments

en we validated the effectiveness of session identifi-cation based on the time interval and the quantile Wecounted the number of access sessions for every user wherethe user is identified by the IP field and the user-agent fielde top 100 users with the largest number of sessions areshown in Figure 9

As is illustrated in Figure 9 most users have multiplesessions up to a maximum of 166 erefore it is quitenecessary for us to conduct a more detailed sessionidentification

In addition we compare the performance of two ma-chine learning models e comparisons are presented inTable 3

As is shown in Table 3 SB-LSTM has a much betterperformance than the HHM-based model e recall rate ofthe HMM-based model under such dataset training is closeto 0 e reason that the HMM assumes that the state at any

B0M0P0R0S0T0

B1M1P1R1S1T1

B2M2P2R2S2T2

BnMnPnRnSnTn

Session content

Sessiondata

Test

Train

Train

Model

Trainedmodel

B bytesM methodP path

R referrerS status codeT time interval

Figure 8 Overview of webshell detection with our model

Table 2 e influence of different thresholds in sessionidentification

reshold () Accuracy Recall Precision F1 score55 0932 08624 08757 0869060 09364 0869 08958 0882265 09598 09568 08977 0926370 09597 09615 09036 0931775 09696 09368 09009 0918580 09376 09401 08518 0893785 08974 07228 06932 0707790 08779 07186 06418 06780

8 Security and Communication Networks

time depends only on the state of the previous momentmight lead to the low recall Besides the HMM-based modelalso has a much lower accuracy than SB-LSTM e ROC isshown in Figure 10 e curve shows the SB-LSTM modelcould achieve an ecopyective and accurate result in which thetrue positive rate could reach 08 when the false positive rateonly reaches 001

Afterwards we found the webshell communication andwe can easily nd the webshell by counting all the accesspaths of the session Besides if the administrator is familiarwith the les included in website dictionary hemay not needto perform statistics to nd the webshell

At last we leverage a real-world website to test the SB-LSTM model and compare it with results from other re-search We rst explored the inmacruence of the dicopyerent sizesof logs on the modelrsquos runtimee experiment used two logles including a recent one and the largest one the websitecan provide Besides we ran this experiment three times andgot the average as results e results are presented in Ta-ble 4 en we compared these to the result of the citedrelated work We downloaded all the source les of thewebsite and reproduced a cited related work which is a le-based detection method For the purpose of maximizing theexperimental results we used the largest log the website canprovide which records all access to the website for nearly ayear and a half to compare We do the same things in termsof the nal results and the comparison results are shown inTable 5

As is illustrated in Table 4 when the size of the log isquite small SB-LSTM can get the result of detection veryquickly which is ecient for detecting whether there iswebshell communication for a certain period of time

0

20

40

60

80

100

120

140

160

180U

ser1

Use

r7

Use

r13

Use

r19

Use

r25

Use

r31

Use

r37

Use

r43

Use

r49

Use

r55

Use

r61

Use

r67

Use

r73

Use

r79

Use

r85

Use

r91

Use

r97

Figure 9 e top 100 users with the largest number of sessions

Table 3 Comparison of two machine learning methods

Model name Accuracy () Recall ()SB-LSTM 9597 9615HMM-based model [31] 6827 000

10

08

06

04

02

000000 0025 0050 0075 0100 0125 0150 0175 0200

Figure 10 ROC curve based on SB-LSTM

Table 4 e inmacruence of the dicopyerent sizes of logs

Model name Sizes Entries Time

SB-LSTM 177 kB 908 05897 s466MB 248433 287127 s

Table 5 Comparison of dicopyerent detection methods

Model name Sizesamount TimeSB-LSTM 466MB 287127 sFRF-WD [20] 2324 398415 s

Security and Communication Networks 9

As revealed in Table 5 even if we use all the log entriessince the website was built the runtime required for ana-lyzing is 279 less than the result of the cited related work

Compared with the results of the first experiment itdemonstrated that the SB-LSTM model is efficient fordetecting whether there is webshell communication for acertain period of time as it requires to scan all the sourcefiles in the common approach while the only input for theSB-LSTM model is the recent log

5 Conclusion

In the previous studies a huge number of features wereextracted from webshell and these features were regarded askeywords to judge whether there is a webshell or notHowever it is almost certain that these approaches would beless useful with the gradual development of encryption andconfusion technology is paper mainly focuses on thechallenge that detects webshell out of itself Instead ofleveraging POST contents source file codes or receivingtraffic the framework we proposed uses sessions generatedfrom the websitersquos logs which highly reduces the cost of timeand space but maintains a high recall rate and accuracyFeatures were extracted in raw sequence data in the web logsand a statistical method was applied to identify sessionsprecisely e results of experiments show that 70 quantilecan be the right threshold that makes the model obtain thehighest accuracy and recall rate and the long short-termmemory which can achieve a high accuracy rate of 9597with a recall rate of 9615 has much better performancethan the hidden Markov model on webshell detectionMoreover the experiment demonstrated the high efficiencyof the proposed approach in terms of the runtime as it takes985 less time than the cited related approach to get theresults In order to be closer to the real-world applicationthe model can be employed to identify webshell files after thewebshell communication is detected by using a statisticalmethod

Data Availability

eWeb logs data used to support the findings of this studyhave not been made available because it was extracted fromreal websites and contains many sensitive information

Conflicts of Interest

e authors declare that they have no conflicts of interest

Acknowledgments

is work was supported in part by the Fundamental ResearchFunds for the Central Universities and the Sichuan UniversityPostdoc Research Foundation under Grant 19XJ0002

References

[1] CNCERT weekly reportrdquo 2019 httpwwwcertorgcnpublishenglishuploadFileWeekly20Report20of20CNCERT-Issue207202019pdf

[2] D H Tony Lee and I Ahl ldquoBreaking down the China chopperweb shell-part Irdquo 2019 httpswwwfireeyecomblogthreat-research201308breaking-down-the-china-chopper-web-shell-part-ihtml

[3] B Yong X Liu Y Liu H Yin L Huang and Q Zhou ldquoWebbehavior detection based on deep neural networkrdquo in Pro-ceedings of the IEEE SmartWorld Ubiquitous Intelligence ampComputing Advanced amp Trusted Computing Scalable Com-puting amp Communications Cloud amp Big Data ComputingInternet of People and Smart City Innovation (SmartWorldSCALCOMUICATCCBDComIOPSCI) pp 1911ndash1916Hong Kong China July 2018

[4] L Y Deng D L Lee Y-H Chen and L X Yann ldquoLexicalanalysis for the webshell attacksrdquo in Proceedings of the 2016International Symposium on Computer Consumer and Con-trol (IS3C) pp 579ndash582 Xirsquoan China July 2016

[5] V-G Le H-T Nguyen D-N Lu and N-H Nguyen ldquoAsolution for automatically malicious web shell and web ap-plication vulnerability detectionrdquo in Proceedings of the In-ternational Conference on Computational CollectiveIntelligence pp 367ndash378 Halkidiki Greece September 2016

[6] J Wang Z Zhou and J Chen ldquoEvaluating CNN and LSTMfor web attack detectionrdquo in Proceedings of the 2018 10thInternational Conference on Machine Learning and Com-puting pp 283ndash287 Macau China February 2018

[7] W Yang B Sun and B Cui ldquoA webshell detection tech-nology based on http traffic analysisrdquo Innovative Mobile andInternet Services in Ubiquitous Computing in Proceedings ofthe International Conference on Innovative Mobile and In-ternet Services in Ubiquitous Computing pp 336ndash342MatsueJapan July 2018

[8] J Kim D-H Yoo H Jang and K Jeong ldquoWebshark 10 abenchmark collection for malicious web shell detectionrdquo JIPSvol 11 no 2 pp 229ndash238 2015

[9] P M Wrench and B V Irwin ldquoTowards a php webshelltaxonomy using deobfuscation-assisted similarity analysisrdquo inProceedings of the 2015 Information Security for South Africa(ISSA) pp 1ndash8 Johannesburg South Africa July 2015

[10] ZMeng R Mei T Zhang andW-PWen ldquoResearch of linuxwebshell detection based on SVM classifierrdquo Netinfo Securityvol 5 pp 5ndash9 2014

[11] O Starov J Dahse S S Ahmad T Holz and N NikiforakisldquoNo honor among thieves a large-scale analysis of maliciousweb shellsrdquo in Proceedings of the 25th International Confer-ence on World Wide Web pp 1021ndash1032 Montreal CanadaApril 2016

[12] S Josefsson ldquoe base16 base32 and base64 data encodingsrdquo2009 httpwwwhjpatdocrfcrfc3548html

[13] Z-H Lv H-B Yan and R Mei ldquoAutomatic and accuratedetection of webshell based on convolutional neural net-workrdquo in Proceedings of the China Cyber Security AnnualConference pp 73ndash85 Beijing China August 2018

[14] Compromised web servers and web shells-threat awarenessand guidancerdquo 2017 httpswwwus-certgovncasalertsTA15-314A

[15] T D Tu C Guang G Xiaojun and P Wubin ldquoWebshelldetection techniques in web applicationsrdquo in Proceedings of theFifth International Conference on Computing Communicationsand Networking Technologies (ICCCNT) pp 1ndash7 Hefei China2014

[16] Y Tian J Wang Z Zhou and S Zhou ldquoCNN-webshellmalicious web shell detection with convolutional neuralnetworkrdquo in Proceedings of the 2017 VI International

10 Security and Communication Networks

Conference on Network Communication and Computingpp 75ndash79 Kunming China December 2017

[17] T Walkowiak S Datko and H Maciejewski ldquoBag-of-wordsbag-of-topics and word-to-vec based subject classification oftext documents in polish-a comparative studyrdquo Contempo-rary Complex Systems and eir Dependability pp 526ndash5352018

[18] D H Hubel and T NWiesel ldquoReceptive fields and functionalarchitecture of monkey striate cortexrdquo e Journal of Phys-iology vol 195 no 1 pp 215ndash243 1968

[19] X Sun X Lu and H Dai ldquoA matrix decomposition basedwebshell detection methodrdquo in Proceedings of the 2017 In-ternational Conference on Cryptography Security and Privacypp 66ndash70 Wuhan China March 2017

[20] Y Fang Y Qiu L Liu and C Huang ldquoDetecting webshellbased on random forest with fasttextrdquo in Proceedings of the2018 International Conference on Computing and ArtificialIntelligence pp 52ndash56 Chengdu China March 2018

[21] Package Information [ebol] 2019 httpspeclphpnetpackagevld

[22] H Zhang H Guan H Yan et al ldquoWebshell traffic detectionwith character-level features based on deep learningrdquo IEEEAccess vol 6 pp 75268ndash75277 2018

[23] L Shi and Y Fang ldquoWebshell detection method researchbased on web logrdquo Journal of Information Security Researchvol 1 p 11 2016

[24] R Sasi ldquoWeb backdoors-attack evasion and detectionrdquo inProceedings of the C0C0N Sec Conference pp 989ndash1003Cochin India 2011

[25] J Schmidhuber and S Hochreiter ldquoLong short-term mem-oryrdquo Neural Computation vol 9 no 8 pp 1735ndash1780 1997

[26] R Hughey and A Krogh ldquoHidden markov models for se-quence analysis extension and analysis of the basic methodrdquoBioinformatics vol 12 no 2 pp 95ndash107 1996

[27] B Schuller G Rigoll and M Lang ldquoHidden markov model-based speech emotion recognitionrdquo in Proceedings of the 2003IEEE International Conference on Acoustics Speech andSignal Processing Atlanta GA USA April 2003

[28] I Kuwatly M Sraj Z Al Masri and H Artail ldquoA dynamichoneypot design for intrusion detectionrdquo in Proceedings of theIEEEACS International Conference on Pervasive ServicesLebanon July 2004

[29] C Kreibich and J Crowcroft ldquoHoneycombrdquo ACM SIG-COMM Computer Communication Review vol 34 no 1pp 51ndash56 2004

[30] TensorFlowrdquo 2019 httpswwwtensorfloworg[31] L larsmans ldquoseqlearnrdquo 2016 httpsgithubcomlarsmans

seqlearn[32] R Kohavi et al ldquoA study of cross-validation and bootstrap for

accuracy estimation and model selectionrdquo IJCAI vol 14no 2 pp 1137ndash1145 1995

[33] C Goutte and E Gaussier ldquoA probabilistic interpretation ofprecision recall and f-score with implication for evaluationLecture Notes in Computer Sciencerdquo in Proceedings of theEuropean Conference on Information Retrieval pp 345ndash359Santiago de Compostela Spain March 2005

[34] T Fawcett ldquoAn introduction to ROC analysisrdquo PatternRecognition Letters vol 27 no 8 pp 861ndash874 2006

Security and Communication Networks 11

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 2: Research Article - Hindawi Publishing Corporationdownloads.hindawi.com/journals/scn/2019/3093809.pdf · As a web service-based backdoor program, webshell is installed through vul-

of webshell detection simply and effectively Moreover if thewebsite defenders can identify malicious webshell imme-diately they can prevent related cyber threats

is paper presents a comprehensive framework includinglog data collection feature extraction session identificationand comparison of two machine learning methods e maininnovations of this paper can be summarized as follows

(i) e presented model utilized a session-basedmethod to detect webshell which is simpler andmore efficient than existing methods Sessionidentification can be more precise using a statisticalmethod which roughly calculates the time intervalof each entry in each session at first and then set thethreshold by the quantile in the statistics to dividethe session more detailed e experiment indicatesthat the model obtained the highest accuracy andrecall rate when the threshold was 70 quantile

(ii) e paper compared the long short-term memorywith the hidden Markov model which are com-monly used to process sequence data and it sug-gested that the long short-term memory modelwhich can provide a high accuracy rate of 9597with a recall rate of 9615 has much better per-formance than the latter one

(iii) e proposed model could only need access log filesto check the malicious behavior of webshell withoutany web application source files e results showthat the LSTM-based model takes 985 less timethan previous related approach to detect whether awebsite has a webshell attack in a period of time

As a remainder related work is briefly discussed inSection 2 A complete framework is explained in Section 3Section 4 presents the entire experiment and evaluationprocess in detail At last the conclusion is in Section 5

2 Related Work

21OutlineofWebshell Webshell is a kind of software whichusually assists the administrator to manipulate the server Butin some cases attackers will use some malicious webshells tocontrol the server to achieve malicious purposes For at-tackers webshell is a kind of backdoor generally written insome scripting language like ASP PHP or JSPese kinds ofweb scripts are able to create dynamic interactive sites so theattacker is able to control the web server through web pagesese behaviors will be recorded in the access log [10]

211 Principle of Webshell Webshell can work by sendingHTTP requests to the specific page and use some functionsto execute the instruction sent by the attacker e results ofthese instructions will be sent to the attacker as the responseof these HTTP requests [11]

212 Classification of Webshell According to the functionand size of scripting language webshell can be roughlydivided into three categories as follows

(i) Big Trojan It has a large size and comprehensivefunctions for command execution database oper-ations and other malicious intentions Besides afriendly graphical interface is applied to the bigTrojan

(ii) OneWord Trojan It is a Trojan with only one line ofcode Due to its shortness it is often embedded innormal files or pictures It can perform manyfunctions like the big Trojan when connected to theinitial attack tools such as Chinese Chopper

(iii) Small Trojan It is small and easy to hide butgenerally only has an upload function Since mostwebsites have size restrictions when uploading filesattackers generally first obtain upload permissionthrough the small Trojan and then upload the bigTrojan to the website to perform key functions

213 Escape of Webshell In order not to be discovered bythe administrator of the website webshells usually undergo alot of distortion Some escape methods are as follows

(i) Pass parameters with less common fields whilewebsites generally use request field to pass pa-rameters this method uses some unusual fields suchas HTTP referrer and user agent

(ii) Encrypt sensitive features attackers use somecommon encryption algorithms such as base64 [12]and rot 13 to encrypt some key functions For agreater probability of escape some tools evencustomize encryption algorithms

(iii) Multiple encoding and compression hackerschange the original static features of the code bymultiple encoding combined with compressiontechnology to reduce the possibility of detection

22Method ofWebshell Detection We did some research onthe previous detection of webshell At the earliest webshelldetection takes a manual identification method is is theoldest and most traditional way to detect webshells whichplaces high demands on the administrators of the websiteAdministrators are supposed to have a comprehensive graspof the website files and have a high recognition ability forsome newly added exception files [13] such as some namingfiles passbyphp passasp and ajsp Besides these small filesshould be treated carefully because there are probably oneword Trojans After finding suspicious files we need toanalyze the contents of the file e most thorough way is totake a look at the entire file carefully but it will take a bunchof time A better way is to search for some sensitive functionssuch as exec() shell exec() and system() and check theirparameters carefully [14]

Static feature webshell detection is the hot trend of re-search It is an upgraded version of manual identificationbut they are almost identical is method focuses on thefeatures of file contents Due to numerous features it oftenuses machine learning to improve effectiveness and accu-racy For example in [15] an approach based on optimal

2 Security and Communication Networks

thresholds was proposed to identify files containing mali-cious codes from web applicationse detection system willscan and find malicious codes in each file of the web ap-plication analyzing the features including keywords filepermissions and owner Instead of the source file codesTian et al [16] divided the POST contents in the HTTPrequest into several words which were represented in theform of vectors using the Word2vec model [17] and thenwere input into the CNN in a fixed-size matrix Experimentshave shown that such a method can achieve high accuracywhich was also the first time that CNN [18] applied towebshell detection However if we just detect by signaturematching we can only do well in detecting webshells thathave known features and it does not apply to detect un-known webshells In the literature [19] an approach that canbe used to predict unknown webshells was proposed Su-pervised machine learning and matrix decomposition wereused to generate original and unknown webshell features byanalyzing different features of known pages e paper [20]focuses on the detection of PHP webshell which used thetext classifier fastText [21] developed by Facebook and therandom forest algorithm to build the model As a compiledintermediate language of PHP scripts PHP opcode sequencewas regarded as an important feature of webshell detectionis paper can be a good inspiration because it starts tobreak away from the webshell file itself

ere are also some detection methods based on otherfeatures For example dynamic feature detection uses thesystem commands network traffic and state exceptions usedby webshell to determine the threat level of the actionWebshell is usually confused and encrypted to avoid de-tection of static features When the webshell is runningsystem commands must be sent to the system to reach thepurpose of operating the database or even the system ismethod monitors and even intercepts system commands bydetecting system calls and deeply detects the security of thescript from the behavior mode Zhang et al [22] proposed acharacter-level content feature transformationmethod whichcombined the features of CNN and LSTM to construct a newwebshell traffic detection model is model preserves thesequential features in network traffic and reduces the featuredimension Experiments have shown that this model can beused to detect unknown webshells while running on largewebsites Besides Shi et al [23] proposed a log-based methodfor webshell detection which uses features such as textfeatures statistical features and page association about thedegree of graph theory But the session simply uses the IP fieldand the user-agent field for a rough distinction which treatsall access by a user as a session e request field is used formachine learning and it is a text processing problem like thestatic detection we mentioned earlier in essence

Among the methods we mentioned above the defendersneed to scan and look for malicious codes inside every file ofthe web application or every POST content in the web logs[24] e drawbacks of these methods are the heavyworkload What is worse with the development of en-cryption and obfuscation techniques webshell detectionaccuracy will be much lower because webshell detectiontools are mostly based on signature matching As we

mentioned before webshell mainly works by sending HTTPrequests Based on this theory a new direction for webshelldetection is proposed which focuses on using the raw se-quence data without POST contents in web logs

3 Framework

31 Architecture e purpose of this paper is to detectwebshell according to the sessions extracted from web serverlogs without POST contents or source file codes It iscomposed of several components as illustrated in Figure 1Firstly we use normal log files on a large website and collecthoneypot logs with webshell communication for experi-mental data collection In the second part instead ofextracting the fields directly from the web logs we use the IPfield and the user-agent field to roughly divide the collectedlogs into different sessions count the time interval in everysession and set the threshold to identify the session in moredetail first en we use the hidden Markov model and longshort-term memory to build our model respectively andcompare which one has better performance

32 Raw Data Collection In the process of generating datawe use different kinds of webshells including big Trojanssmall Trojans and one word Trojans which differ in size andfunction Besides these webshells are placed in differentdepths of the website we built and different webshell toolssuch as Chinese Chopper andWeBaCoo are used to initializethe attack In this paper the log format of the website we useis Apache but our framework can be applied to othercommon server log formats such as Nginx

33 Data Processing

331 Data Extraction All the logs we collected are in thecommon Apache2 format ere are close to 10 fields in thisformat but there is no need to use all of them ereforesome fields were extracted from the web logs e wholeprocess is shown in Figure 2 Feature sequence consists ofsource IP address timestamps user agent status code bytesreferrer and request in which the request field is subdividedinto method field and path field

e IP field and user-agent field are used to identifysessions Because there may be multiple visitors under thesame IP it is necessary to combine user agent to makejudgments At the same time we roughly regard a visitor as asession for the time beinge status code field and byte fieldcan be utilized to construct feature vectors without anychanges In the method field the GETmethod is representedby 1 POSTmethod is represented by 2 and other methodsare represented by 0 Meanwhile in the referrer field ldquominus rdquo isrepresented by 0 while 1 for the website 2 for the webcrawler and 3 for the external websites In the timestampsfield we calculate the time difference between entries inevery session and fill 0 in the last vector of every session

e path field is encoded with the degree of relevance ofeach entry access path in the session e first entry of eachsession is coded as minus 1 because there is no access in front of it

Security and Communication Networks 3

and the relative relationship cannot be discriminated fromthe second le the access to the same le is marked as 0 andwhen accessing dicopyerent les the distance between dicopyerentles is calculated by the number of directory switches plusone to encodee process of encoding all paths for a sessionis shown in Figure 3 When all the feature elds are pro-cessed a session will transform into a sequence of features asillustrated in Table 1 Finally the feature vector we get will besix-dimensional including byte eld method eld path eldreferrer eld status code eld and time interval

332 Session Identication In the previous section we justmade a rough distinction between the sessions in every log letreating a visitor as a session Butmore often a visitor can accessat dicopyerent times and generate multiple sessionsus we use amore scientic statistical method for session identication

We calculate and count the time interval between entriesin every session which is roughly generated in the previoussection After counting the time intervals in all sessions we

sort all of them and explore the most appropriate quantile asa threshold We compare every time interval to a thresholdand if the time interval is greater than the threshold thesession will be subdivided into two smaller sessions and soforth e whole process is illustrated in Figure 4

When all the data have been processed the framework willcheck all the sessions and delete those sessions that containonly one entry as it assumes that if a user has only one access itis highly unlikely that there is a webshell communication

34 Classication Model Webshell detection is a two-classtask and because of variable length serialization in the logsequence we choose long short-term memory and thehidden Markov model to construct the classication modelrespectively

341 Long Short-Term Memory Long short-term memorynetworks are well suited to classify and process based on

Normal Abnormal

Originaldata

Data collection

Sequence

Featureextraction

IP address

User agent

Rough division

Precise division

Classificationmodel

Traintestdataset

LSTM

Normal Webshell

ClassifyHMM

Time interval

Statisticalanalysis

Figure 1 e architecture of the proposed method

119393195 -- [24Mar2018062553 ndash 0400] ldquoGET imgrssgif HTTP11rdquo 200 1036 ldquohttpswwwexamplecomrdquoldquoMozilla50(Windows NT 100 Win64 x64)rdquo

Host

Path

Time

Method

Referrer

Status

Bytes

User agent

119393195

imgrssgif

24Mar2018062553

GET

httpwwwexamplecom

200

1036

Mozilla50 (Windows NT 100 Win64 x64)

Figure 2 Feature extraction based on log entry

4 Security and Communication Networks

sequence data with its faster convergence and the ability todetect long-term dependencies in data e concept wasoriginally proposed in 1997 by Hochreiter and Schmidhuber[25] e emergence of LSTM is mainly to solve the problemof gradient disappearance and gradient explosion in longsequence training Compared with traditional RNN wherethere is only one delivery state the LSTM has two deliverystates namely the long-term state c(t) and the short-termstate h(t) Besides it introduces three ldquogatesrdquo to control thelong-term state as illustrated in Figure 5

(i) Forget gate f(t) control which of the long-termstates of last moment c(tminus 1) should be discarded

(ii) Input gate i(t) control which parts of the currenttime input x(t) and the last short-term state h(tminus 1)will be added to the long-term state

(iii) Output gate o(t) control which parts of the currentlong-term state c(t) should be output as h(t)

119393195 -- [24Mar2018062552 - 0400] ldquoGET HTTP11rdquo 200 13108 ldquo-rdquo ldquoMozilla50 (Windows NT 100Win64 x64) AppleWebKit53736 (KHTML like Gecko)

Chrome6503325181 Safari53736rdquo

119393195 -- [24Mar2018062553 - 0400] ldquoGETphotonopng HTTP11rdquo 200 4898 ldquohttps examplecomrdquo

ldquoMozilla50 (Windows NT 100 Win64 x64)AppleWebKit53736 (KHTML like Gecko)

Chrome6503325181 Safari53736rdquo

119393195--[24Mar2018062555 - 0400] ldquoGETimgrssgif HTTP11rdquo 200 1036 ldquohttpsexamplecomrdquo

ldquoMozilla50 (Windows NT 100 Win64 x64)AppleWebKit53736 (KHTML like Gecko)

Chrome6503325181 Safari53736rdquo

119393195 -- [24Mar2018101129 - 0400] ldquoGETbook HTTP11rdquo 200 13125 ldquo-rdquo ldquoMozilla50 (WindowsNT 100 Win64 x64) AppleWebKit53736 (KHTML

like Gecko) Chrome6503325181 Safari53736rdquo

Raw data

photonopng

imgrssgif

book

Path

ndash1

2

3

2

Encoded path

Figure 3 e process of encoding all paths for a session

Table 1 Session transformation

Log messages Feature vectorLog entry0 [bytes0 method0 path0 referrer0 statuscode0 t1 minus t0]Log entry1 [bytes1 method1 path1 referrer1 statuscode1 t2 minus t1]Log entry2 [bytes2 method2 path2 referrer2 statuscode2 t3 minus t2]⋮ ⋮Log entryn [bytesnmethodn pathn referrern statuscoden 0]

Statistical analysisof time interval

Division

Division

Originsequence

Rough sessions

Session 1

Session 2

Precise sessions

Session 3

Session 4

IP addressuser-agent

Figure 4 Schematic diagram of session identication

LSTM block

C(tminus1) C(t) h(t)

g(t)

h(tminus1) x(t)

Main layer for analyzingthe current input x(t) andthe previous short-term

state h(tminus1)

Figure 5 e core of LSTM

Security and Communication Networks 5

e model which called SB-LSTM (session-based longshort-term memory) consists of four LSTM layers each ofwhich has sixteen neurons and a dense layer e LSTMlayer will learn the feature of variable length sequenceswhich are activated by tanh function and logistic functione dense layer acts as a ldquoclassifierrdquo throughout the LSTMwhich can map the learned ldquodistributed feature represen-tationrdquo to the sample tag space At last the detection modelused categorical cross entropy as the loss function and theoptimizer is Adam

In more detail the input data size is m times n times 6 m is thenumber of sessions and n is the maximum length of allsessions Current long-term state c(t) and output y(t) arecomputed as follows

g(t) tanh WTxg middot x(t) + W

Thg middot h(tminus 1) + bg1113872 1113873

c(t) f(t) otimes c(tminus 1) + i(t) otimesg(t)

y(t) h(t) o(t) otimes tanh c(t)1113872 1113873

(1)

where Wxg denotes the weight matrix of the main layerconnected to the input vector x(t) of size 1times 6 while Whg

denotes the weight matrix of the main layer connected to theprevious short-term state h(tminus 1) bg represents the coefficientof variation for the main layer

e architecture of this model for a session is depicted inFigure 6

342 Hidden Markov Model e hidden Markov model(HMM) is a statistical Markov model a highly effectivemeans of modeling a family of unaligned sequences [26 27]which describes a hidden Markov chain stochasticallygenerating unobservable state sequences and then gener-ating a stochastical observation sequence from each state Asits most remarkable features the state at any time t dependsonly on the state of the previous moment while it is in-dependent of the time t the state and the observation atother times Besides it also assumes that observations at anytime depend only on the state at that moment independentof other observations and states

In this paper we mainly focus on supervised learningmethod of the hidden Markov chain e train data includeobservation sequences O and corresponding state sequencesI which can be written as

O1 I1( 1113857 O2 I2( 1113857 Os Is( 11138571113864 1113865 (2)

We can use the maximum likelihood estimation methodto estimate the parameters of the hidden Markov model

(1) We assume that the sample is in the state i at time tand the frequency of transition to state j at time t + i

is Aij and then the probability estimate of thetransition state is computed as follows

1113955aij Aij

1113936Nj1Aij

i 1 2 N j 1 2 N (3)

(2) We assume that the sample state is j and the fre-quency of observation k is Bjk and then the

probability of observation k when the state is j can bewritten as

1113955bjk Bjk

1113936Mk1Bjk

j 1 2 N k 1 2 M (4)

(3) e initial state probability π is estimated as thefrequency of the initial state of i in the S samplesearchitecture of this model is illustrated in Figure 7

In more detail the input data size is x times 6 and x is thenumber of entries in all sessions

4 Experiment and Evaluation

In this section we will introduce the detailed composition ofthe dataset the experiment details and the results

41 Dataset ere are two datasets for our experimentsone for the model training and the other for the model testAs for the first dataset a honeypot [28 29] was built tocapture and analyze webshell attacks Tens of testers wereinvited to attack this honeypot One word Trojans smallTrojans and big Trojans which were coded in differentprogram languages were provided to these testers Wecollected a total of 13986 log entries from the honeypot asnegative samples while 57160 logs entries were collectedfrom real-world websites as positive samples In the processof training we also used a 10-folder cross validation toperform a simple verification of the model As for the latterone a real-world website which contains a total of 2324 filesand 248 k lines of log entries in total was utilized to testSince the log entries with webshell communication in thereal environment may only account for a few percents oreven a few thousandths in the log file we control thepositive and negative sample ratio of the dataset to bearound 6 1 in experiments After feature vector extractionand session identification all log entries become in-dependent sessions Every session consists of six sequenceswhich are the byte sequence of the HTTP response themethod sequence the path sequence that is encoded by thedegree of association the referrer sequence that is dividedinto different cases for encoding the status code sequenceand the time interval sequence Figure 8 is an overview ofour model

42 Experiment Design To evaluate the performance of ourmodel we first explored the effects of different quantiles asthresholds and selected the best values for next experimentsen we validated the effectiveness of the session identi-fication method based on the time interval and quantile Inaddition we compare long short-term memory with thehidden Markov models which are commonly used in se-quence data processing and find the one with better per-formance Finally we leverage a real-world website to testthe SB-LSTM model and compare it with results from otherresearch

6 Security and Communication Networks

All the experiments were performed in a PC machinewith an i7-7700HQ processor 16GB of memory and aGeForce GTX 1060 GPU which has 6GB of memory eSB-LSTMmodel is implemented by Tensormacrow [30] and theHHM is implemented by seqlearn [31]

Before using the raw data to build the model we did astandardization to make all the data appropriate for theneural network Data were processed according to the fol-lowing steps

(i) Since we need to input a 3-dimensional matrix whenconstructing the SB-LSTM model we chose Max-Min scaling normalization to reduce the ecopyect ofpadding 0 when normalizing After scaling all data

are between 0 and 1 e function of doing Max-Min scaling normalization is

xlowast x minus min

max minus min (5)

(ii) e label of all the training data is stored in a list 0for normal access and 1 for webshell access

(iii) When we used long short-term memory to buildour model we must make all data to become amatrix so we padded the length of the sequence to zwhich is the maximum session length Besides weheld a list to show all the valid length (originallength) of the data to ensure the number of itera-tions in LSTM

(iv) When we used the hidden Markov model to buildour model we only need to put all the vectors into asequence because of the input data size of thehidden Markov model

We use 10-folder cross validation [32] e dataset isequally divided into 10 subsets each of which is tested onceand the rest as a training set e cross validation is repeated10 times one subset is selected each time as a test set and theaverage cross validation recognition accuracy rate of 10times is taken as a resulte dataset used for the experimentis unbalanced so we choose accuracy recall precision F1score [33] the receiver operating characteristics (ROC)curves [34] and area under curve (AUC) measure forevaluating the proposed method Besides we can visualizethe relation between TPR and FPR of a classier eseindicators can be expressed as

Normal Webshell

y1 y2 yn

b11 b12 b1nb21

b22

b2n

a12

a21

Figure 7 e architecture of HMM

Next

Predict

h40

h30

h20

LSTM layer 1

LSTM layer 2

LSTM layer 3

LSTM layer 4

ForwardDropoutBackward

Y1

h10

Input data

h4n

Dense

h41

h31

h21

h11 LSTM layer 1

LSTM layer 2

Y2

X3

h42

h32

h22

h12

X2

X2X1

(Possibility 1 possibility 2)

LSTM layer 4

LSTM layer 3

Figure 6 e architecture of LSTM

Security and Communication Networks 7

Accuracy 1

|C|1113944

|C|

i1

TPi + TNi

TPi + TNi + FPi + FNi

Recall 1

|C|1113944

|C|

i1

TPi

TPi + FNi

Precision 1

|C|1113944

|C|

i1

TPi

TPi + FPi

F1 score 2precision times recallprecision + recall

AUC 1113936iisinpositiveclassranki minus (M(1 + M)2)

M times N

(6)

where M is the number of positive samples and N is thenumber of negative samples e score indicates theprobability that each test sample belongs to a positivesample while rank is a positive sample set sorted in thedescending order based on score

43 Experiment Results First we explored the influence ofdifferent thresholds in session identification e thresholdswere set to 55 60 65 70 75 80 85 and 90quantile respectively e comparison results are shown inTable 2

As is shown in Table 2 performance is improved with anincreasing threshold in a certain range and reaches thehighest value when the threshold is 70 Moreover whenthe performance reaches the highest it will continuouslydecrease with a rising threshold Excessive thresholds candecrease the performance of the model because two accesses

that were not originally part of the same session are placed inthe same session In comparison with the threshold of 55to 90 we find the appropriate threshold is 70 and we useit for the next experiments

en we validated the effectiveness of session identifi-cation based on the time interval and the quantile Wecounted the number of access sessions for every user wherethe user is identified by the IP field and the user-agent fielde top 100 users with the largest number of sessions areshown in Figure 9

As is illustrated in Figure 9 most users have multiplesessions up to a maximum of 166 erefore it is quitenecessary for us to conduct a more detailed sessionidentification

In addition we compare the performance of two ma-chine learning models e comparisons are presented inTable 3

As is shown in Table 3 SB-LSTM has a much betterperformance than the HHM-based model e recall rate ofthe HMM-based model under such dataset training is closeto 0 e reason that the HMM assumes that the state at any

B0M0P0R0S0T0

B1M1P1R1S1T1

B2M2P2R2S2T2

BnMnPnRnSnTn

Session content

Sessiondata

Test

Train

Train

Model

Trainedmodel

B bytesM methodP path

R referrerS status codeT time interval

Figure 8 Overview of webshell detection with our model

Table 2 e influence of different thresholds in sessionidentification

reshold () Accuracy Recall Precision F1 score55 0932 08624 08757 0869060 09364 0869 08958 0882265 09598 09568 08977 0926370 09597 09615 09036 0931775 09696 09368 09009 0918580 09376 09401 08518 0893785 08974 07228 06932 0707790 08779 07186 06418 06780

8 Security and Communication Networks

time depends only on the state of the previous momentmight lead to the low recall Besides the HMM-based modelalso has a much lower accuracy than SB-LSTM e ROC isshown in Figure 10 e curve shows the SB-LSTM modelcould achieve an ecopyective and accurate result in which thetrue positive rate could reach 08 when the false positive rateonly reaches 001

Afterwards we found the webshell communication andwe can easily nd the webshell by counting all the accesspaths of the session Besides if the administrator is familiarwith the les included in website dictionary hemay not needto perform statistics to nd the webshell

At last we leverage a real-world website to test the SB-LSTM model and compare it with results from other re-search We rst explored the inmacruence of the dicopyerent sizesof logs on the modelrsquos runtimee experiment used two logles including a recent one and the largest one the websitecan provide Besides we ran this experiment three times andgot the average as results e results are presented in Ta-ble 4 en we compared these to the result of the citedrelated work We downloaded all the source les of thewebsite and reproduced a cited related work which is a le-based detection method For the purpose of maximizing theexperimental results we used the largest log the website canprovide which records all access to the website for nearly ayear and a half to compare We do the same things in termsof the nal results and the comparison results are shown inTable 5

As is illustrated in Table 4 when the size of the log isquite small SB-LSTM can get the result of detection veryquickly which is ecient for detecting whether there iswebshell communication for a certain period of time

0

20

40

60

80

100

120

140

160

180U

ser1

Use

r7

Use

r13

Use

r19

Use

r25

Use

r31

Use

r37

Use

r43

Use

r49

Use

r55

Use

r61

Use

r67

Use

r73

Use

r79

Use

r85

Use

r91

Use

r97

Figure 9 e top 100 users with the largest number of sessions

Table 3 Comparison of two machine learning methods

Model name Accuracy () Recall ()SB-LSTM 9597 9615HMM-based model [31] 6827 000

10

08

06

04

02

000000 0025 0050 0075 0100 0125 0150 0175 0200

Figure 10 ROC curve based on SB-LSTM

Table 4 e inmacruence of the dicopyerent sizes of logs

Model name Sizes Entries Time

SB-LSTM 177 kB 908 05897 s466MB 248433 287127 s

Table 5 Comparison of dicopyerent detection methods

Model name Sizesamount TimeSB-LSTM 466MB 287127 sFRF-WD [20] 2324 398415 s

Security and Communication Networks 9

As revealed in Table 5 even if we use all the log entriessince the website was built the runtime required for ana-lyzing is 279 less than the result of the cited related work

Compared with the results of the first experiment itdemonstrated that the SB-LSTM model is efficient fordetecting whether there is webshell communication for acertain period of time as it requires to scan all the sourcefiles in the common approach while the only input for theSB-LSTM model is the recent log

5 Conclusion

In the previous studies a huge number of features wereextracted from webshell and these features were regarded askeywords to judge whether there is a webshell or notHowever it is almost certain that these approaches would beless useful with the gradual development of encryption andconfusion technology is paper mainly focuses on thechallenge that detects webshell out of itself Instead ofleveraging POST contents source file codes or receivingtraffic the framework we proposed uses sessions generatedfrom the websitersquos logs which highly reduces the cost of timeand space but maintains a high recall rate and accuracyFeatures were extracted in raw sequence data in the web logsand a statistical method was applied to identify sessionsprecisely e results of experiments show that 70 quantilecan be the right threshold that makes the model obtain thehighest accuracy and recall rate and the long short-termmemory which can achieve a high accuracy rate of 9597with a recall rate of 9615 has much better performancethan the hidden Markov model on webshell detectionMoreover the experiment demonstrated the high efficiencyof the proposed approach in terms of the runtime as it takes985 less time than the cited related approach to get theresults In order to be closer to the real-world applicationthe model can be employed to identify webshell files after thewebshell communication is detected by using a statisticalmethod

Data Availability

eWeb logs data used to support the findings of this studyhave not been made available because it was extracted fromreal websites and contains many sensitive information

Conflicts of Interest

e authors declare that they have no conflicts of interest

Acknowledgments

is work was supported in part by the Fundamental ResearchFunds for the Central Universities and the Sichuan UniversityPostdoc Research Foundation under Grant 19XJ0002

References

[1] CNCERT weekly reportrdquo 2019 httpwwwcertorgcnpublishenglishuploadFileWeekly20Report20of20CNCERT-Issue207202019pdf

[2] D H Tony Lee and I Ahl ldquoBreaking down the China chopperweb shell-part Irdquo 2019 httpswwwfireeyecomblogthreat-research201308breaking-down-the-china-chopper-web-shell-part-ihtml

[3] B Yong X Liu Y Liu H Yin L Huang and Q Zhou ldquoWebbehavior detection based on deep neural networkrdquo in Pro-ceedings of the IEEE SmartWorld Ubiquitous Intelligence ampComputing Advanced amp Trusted Computing Scalable Com-puting amp Communications Cloud amp Big Data ComputingInternet of People and Smart City Innovation (SmartWorldSCALCOMUICATCCBDComIOPSCI) pp 1911ndash1916Hong Kong China July 2018

[4] L Y Deng D L Lee Y-H Chen and L X Yann ldquoLexicalanalysis for the webshell attacksrdquo in Proceedings of the 2016International Symposium on Computer Consumer and Con-trol (IS3C) pp 579ndash582 Xirsquoan China July 2016

[5] V-G Le H-T Nguyen D-N Lu and N-H Nguyen ldquoAsolution for automatically malicious web shell and web ap-plication vulnerability detectionrdquo in Proceedings of the In-ternational Conference on Computational CollectiveIntelligence pp 367ndash378 Halkidiki Greece September 2016

[6] J Wang Z Zhou and J Chen ldquoEvaluating CNN and LSTMfor web attack detectionrdquo in Proceedings of the 2018 10thInternational Conference on Machine Learning and Com-puting pp 283ndash287 Macau China February 2018

[7] W Yang B Sun and B Cui ldquoA webshell detection tech-nology based on http traffic analysisrdquo Innovative Mobile andInternet Services in Ubiquitous Computing in Proceedings ofthe International Conference on Innovative Mobile and In-ternet Services in Ubiquitous Computing pp 336ndash342MatsueJapan July 2018

[8] J Kim D-H Yoo H Jang and K Jeong ldquoWebshark 10 abenchmark collection for malicious web shell detectionrdquo JIPSvol 11 no 2 pp 229ndash238 2015

[9] P M Wrench and B V Irwin ldquoTowards a php webshelltaxonomy using deobfuscation-assisted similarity analysisrdquo inProceedings of the 2015 Information Security for South Africa(ISSA) pp 1ndash8 Johannesburg South Africa July 2015

[10] ZMeng R Mei T Zhang andW-PWen ldquoResearch of linuxwebshell detection based on SVM classifierrdquo Netinfo Securityvol 5 pp 5ndash9 2014

[11] O Starov J Dahse S S Ahmad T Holz and N NikiforakisldquoNo honor among thieves a large-scale analysis of maliciousweb shellsrdquo in Proceedings of the 25th International Confer-ence on World Wide Web pp 1021ndash1032 Montreal CanadaApril 2016

[12] S Josefsson ldquoe base16 base32 and base64 data encodingsrdquo2009 httpwwwhjpatdocrfcrfc3548html

[13] Z-H Lv H-B Yan and R Mei ldquoAutomatic and accuratedetection of webshell based on convolutional neural net-workrdquo in Proceedings of the China Cyber Security AnnualConference pp 73ndash85 Beijing China August 2018

[14] Compromised web servers and web shells-threat awarenessand guidancerdquo 2017 httpswwwus-certgovncasalertsTA15-314A

[15] T D Tu C Guang G Xiaojun and P Wubin ldquoWebshelldetection techniques in web applicationsrdquo in Proceedings of theFifth International Conference on Computing Communicationsand Networking Technologies (ICCCNT) pp 1ndash7 Hefei China2014

[16] Y Tian J Wang Z Zhou and S Zhou ldquoCNN-webshellmalicious web shell detection with convolutional neuralnetworkrdquo in Proceedings of the 2017 VI International

10 Security and Communication Networks

Conference on Network Communication and Computingpp 75ndash79 Kunming China December 2017

[17] T Walkowiak S Datko and H Maciejewski ldquoBag-of-wordsbag-of-topics and word-to-vec based subject classification oftext documents in polish-a comparative studyrdquo Contempo-rary Complex Systems and eir Dependability pp 526ndash5352018

[18] D H Hubel and T NWiesel ldquoReceptive fields and functionalarchitecture of monkey striate cortexrdquo e Journal of Phys-iology vol 195 no 1 pp 215ndash243 1968

[19] X Sun X Lu and H Dai ldquoA matrix decomposition basedwebshell detection methodrdquo in Proceedings of the 2017 In-ternational Conference on Cryptography Security and Privacypp 66ndash70 Wuhan China March 2017

[20] Y Fang Y Qiu L Liu and C Huang ldquoDetecting webshellbased on random forest with fasttextrdquo in Proceedings of the2018 International Conference on Computing and ArtificialIntelligence pp 52ndash56 Chengdu China March 2018

[21] Package Information [ebol] 2019 httpspeclphpnetpackagevld

[22] H Zhang H Guan H Yan et al ldquoWebshell traffic detectionwith character-level features based on deep learningrdquo IEEEAccess vol 6 pp 75268ndash75277 2018

[23] L Shi and Y Fang ldquoWebshell detection method researchbased on web logrdquo Journal of Information Security Researchvol 1 p 11 2016

[24] R Sasi ldquoWeb backdoors-attack evasion and detectionrdquo inProceedings of the C0C0N Sec Conference pp 989ndash1003Cochin India 2011

[25] J Schmidhuber and S Hochreiter ldquoLong short-term mem-oryrdquo Neural Computation vol 9 no 8 pp 1735ndash1780 1997

[26] R Hughey and A Krogh ldquoHidden markov models for se-quence analysis extension and analysis of the basic methodrdquoBioinformatics vol 12 no 2 pp 95ndash107 1996

[27] B Schuller G Rigoll and M Lang ldquoHidden markov model-based speech emotion recognitionrdquo in Proceedings of the 2003IEEE International Conference on Acoustics Speech andSignal Processing Atlanta GA USA April 2003

[28] I Kuwatly M Sraj Z Al Masri and H Artail ldquoA dynamichoneypot design for intrusion detectionrdquo in Proceedings of theIEEEACS International Conference on Pervasive ServicesLebanon July 2004

[29] C Kreibich and J Crowcroft ldquoHoneycombrdquo ACM SIG-COMM Computer Communication Review vol 34 no 1pp 51ndash56 2004

[30] TensorFlowrdquo 2019 httpswwwtensorfloworg[31] L larsmans ldquoseqlearnrdquo 2016 httpsgithubcomlarsmans

seqlearn[32] R Kohavi et al ldquoA study of cross-validation and bootstrap for

accuracy estimation and model selectionrdquo IJCAI vol 14no 2 pp 1137ndash1145 1995

[33] C Goutte and E Gaussier ldquoA probabilistic interpretation ofprecision recall and f-score with implication for evaluationLecture Notes in Computer Sciencerdquo in Proceedings of theEuropean Conference on Information Retrieval pp 345ndash359Santiago de Compostela Spain March 2005

[34] T Fawcett ldquoAn introduction to ROC analysisrdquo PatternRecognition Letters vol 27 no 8 pp 861ndash874 2006

Security and Communication Networks 11

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 3: Research Article - Hindawi Publishing Corporationdownloads.hindawi.com/journals/scn/2019/3093809.pdf · As a web service-based backdoor program, webshell is installed through vul-

thresholds was proposed to identify files containing mali-cious codes from web applicationse detection system willscan and find malicious codes in each file of the web ap-plication analyzing the features including keywords filepermissions and owner Instead of the source file codesTian et al [16] divided the POST contents in the HTTPrequest into several words which were represented in theform of vectors using the Word2vec model [17] and thenwere input into the CNN in a fixed-size matrix Experimentshave shown that such a method can achieve high accuracywhich was also the first time that CNN [18] applied towebshell detection However if we just detect by signaturematching we can only do well in detecting webshells thathave known features and it does not apply to detect un-known webshells In the literature [19] an approach that canbe used to predict unknown webshells was proposed Su-pervised machine learning and matrix decomposition wereused to generate original and unknown webshell features byanalyzing different features of known pages e paper [20]focuses on the detection of PHP webshell which used thetext classifier fastText [21] developed by Facebook and therandom forest algorithm to build the model As a compiledintermediate language of PHP scripts PHP opcode sequencewas regarded as an important feature of webshell detectionis paper can be a good inspiration because it starts tobreak away from the webshell file itself

ere are also some detection methods based on otherfeatures For example dynamic feature detection uses thesystem commands network traffic and state exceptions usedby webshell to determine the threat level of the actionWebshell is usually confused and encrypted to avoid de-tection of static features When the webshell is runningsystem commands must be sent to the system to reach thepurpose of operating the database or even the system ismethod monitors and even intercepts system commands bydetecting system calls and deeply detects the security of thescript from the behavior mode Zhang et al [22] proposed acharacter-level content feature transformationmethod whichcombined the features of CNN and LSTM to construct a newwebshell traffic detection model is model preserves thesequential features in network traffic and reduces the featuredimension Experiments have shown that this model can beused to detect unknown webshells while running on largewebsites Besides Shi et al [23] proposed a log-based methodfor webshell detection which uses features such as textfeatures statistical features and page association about thedegree of graph theory But the session simply uses the IP fieldand the user-agent field for a rough distinction which treatsall access by a user as a session e request field is used formachine learning and it is a text processing problem like thestatic detection we mentioned earlier in essence

Among the methods we mentioned above the defendersneed to scan and look for malicious codes inside every file ofthe web application or every POST content in the web logs[24] e drawbacks of these methods are the heavyworkload What is worse with the development of en-cryption and obfuscation techniques webshell detectionaccuracy will be much lower because webshell detectiontools are mostly based on signature matching As we

mentioned before webshell mainly works by sending HTTPrequests Based on this theory a new direction for webshelldetection is proposed which focuses on using the raw se-quence data without POST contents in web logs

3 Framework

31 Architecture e purpose of this paper is to detectwebshell according to the sessions extracted from web serverlogs without POST contents or source file codes It iscomposed of several components as illustrated in Figure 1Firstly we use normal log files on a large website and collecthoneypot logs with webshell communication for experi-mental data collection In the second part instead ofextracting the fields directly from the web logs we use the IPfield and the user-agent field to roughly divide the collectedlogs into different sessions count the time interval in everysession and set the threshold to identify the session in moredetail first en we use the hidden Markov model and longshort-term memory to build our model respectively andcompare which one has better performance

32 Raw Data Collection In the process of generating datawe use different kinds of webshells including big Trojanssmall Trojans and one word Trojans which differ in size andfunction Besides these webshells are placed in differentdepths of the website we built and different webshell toolssuch as Chinese Chopper andWeBaCoo are used to initializethe attack In this paper the log format of the website we useis Apache but our framework can be applied to othercommon server log formats such as Nginx

33 Data Processing

331 Data Extraction All the logs we collected are in thecommon Apache2 format ere are close to 10 fields in thisformat but there is no need to use all of them ereforesome fields were extracted from the web logs e wholeprocess is shown in Figure 2 Feature sequence consists ofsource IP address timestamps user agent status code bytesreferrer and request in which the request field is subdividedinto method field and path field

e IP field and user-agent field are used to identifysessions Because there may be multiple visitors under thesame IP it is necessary to combine user agent to makejudgments At the same time we roughly regard a visitor as asession for the time beinge status code field and byte fieldcan be utilized to construct feature vectors without anychanges In the method field the GETmethod is representedby 1 POSTmethod is represented by 2 and other methodsare represented by 0 Meanwhile in the referrer field ldquominus rdquo isrepresented by 0 while 1 for the website 2 for the webcrawler and 3 for the external websites In the timestampsfield we calculate the time difference between entries inevery session and fill 0 in the last vector of every session

e path field is encoded with the degree of relevance ofeach entry access path in the session e first entry of eachsession is coded as minus 1 because there is no access in front of it

Security and Communication Networks 3

and the relative relationship cannot be discriminated fromthe second le the access to the same le is marked as 0 andwhen accessing dicopyerent les the distance between dicopyerentles is calculated by the number of directory switches plusone to encodee process of encoding all paths for a sessionis shown in Figure 3 When all the feature elds are pro-cessed a session will transform into a sequence of features asillustrated in Table 1 Finally the feature vector we get will besix-dimensional including byte eld method eld path eldreferrer eld status code eld and time interval

332 Session Identication In the previous section we justmade a rough distinction between the sessions in every log letreating a visitor as a session Butmore often a visitor can accessat dicopyerent times and generate multiple sessionsus we use amore scientic statistical method for session identication

We calculate and count the time interval between entriesin every session which is roughly generated in the previoussection After counting the time intervals in all sessions we

sort all of them and explore the most appropriate quantile asa threshold We compare every time interval to a thresholdand if the time interval is greater than the threshold thesession will be subdivided into two smaller sessions and soforth e whole process is illustrated in Figure 4

When all the data have been processed the framework willcheck all the sessions and delete those sessions that containonly one entry as it assumes that if a user has only one access itis highly unlikely that there is a webshell communication

34 Classication Model Webshell detection is a two-classtask and because of variable length serialization in the logsequence we choose long short-term memory and thehidden Markov model to construct the classication modelrespectively

341 Long Short-Term Memory Long short-term memorynetworks are well suited to classify and process based on

Normal Abnormal

Originaldata

Data collection

Sequence

Featureextraction

IP address

User agent

Rough division

Precise division

Classificationmodel

Traintestdataset

LSTM

Normal Webshell

ClassifyHMM

Time interval

Statisticalanalysis

Figure 1 e architecture of the proposed method

119393195 -- [24Mar2018062553 ndash 0400] ldquoGET imgrssgif HTTP11rdquo 200 1036 ldquohttpswwwexamplecomrdquoldquoMozilla50(Windows NT 100 Win64 x64)rdquo

Host

Path

Time

Method

Referrer

Status

Bytes

User agent

119393195

imgrssgif

24Mar2018062553

GET

httpwwwexamplecom

200

1036

Mozilla50 (Windows NT 100 Win64 x64)

Figure 2 Feature extraction based on log entry

4 Security and Communication Networks

sequence data with its faster convergence and the ability todetect long-term dependencies in data e concept wasoriginally proposed in 1997 by Hochreiter and Schmidhuber[25] e emergence of LSTM is mainly to solve the problemof gradient disappearance and gradient explosion in longsequence training Compared with traditional RNN wherethere is only one delivery state the LSTM has two deliverystates namely the long-term state c(t) and the short-termstate h(t) Besides it introduces three ldquogatesrdquo to control thelong-term state as illustrated in Figure 5

(i) Forget gate f(t) control which of the long-termstates of last moment c(tminus 1) should be discarded

(ii) Input gate i(t) control which parts of the currenttime input x(t) and the last short-term state h(tminus 1)will be added to the long-term state

(iii) Output gate o(t) control which parts of the currentlong-term state c(t) should be output as h(t)

119393195 -- [24Mar2018062552 - 0400] ldquoGET HTTP11rdquo 200 13108 ldquo-rdquo ldquoMozilla50 (Windows NT 100Win64 x64) AppleWebKit53736 (KHTML like Gecko)

Chrome6503325181 Safari53736rdquo

119393195 -- [24Mar2018062553 - 0400] ldquoGETphotonopng HTTP11rdquo 200 4898 ldquohttps examplecomrdquo

ldquoMozilla50 (Windows NT 100 Win64 x64)AppleWebKit53736 (KHTML like Gecko)

Chrome6503325181 Safari53736rdquo

119393195--[24Mar2018062555 - 0400] ldquoGETimgrssgif HTTP11rdquo 200 1036 ldquohttpsexamplecomrdquo

ldquoMozilla50 (Windows NT 100 Win64 x64)AppleWebKit53736 (KHTML like Gecko)

Chrome6503325181 Safari53736rdquo

119393195 -- [24Mar2018101129 - 0400] ldquoGETbook HTTP11rdquo 200 13125 ldquo-rdquo ldquoMozilla50 (WindowsNT 100 Win64 x64) AppleWebKit53736 (KHTML

like Gecko) Chrome6503325181 Safari53736rdquo

Raw data

photonopng

imgrssgif

book

Path

ndash1

2

3

2

Encoded path

Figure 3 e process of encoding all paths for a session

Table 1 Session transformation

Log messages Feature vectorLog entry0 [bytes0 method0 path0 referrer0 statuscode0 t1 minus t0]Log entry1 [bytes1 method1 path1 referrer1 statuscode1 t2 minus t1]Log entry2 [bytes2 method2 path2 referrer2 statuscode2 t3 minus t2]⋮ ⋮Log entryn [bytesnmethodn pathn referrern statuscoden 0]

Statistical analysisof time interval

Division

Division

Originsequence

Rough sessions

Session 1

Session 2

Precise sessions

Session 3

Session 4

IP addressuser-agent

Figure 4 Schematic diagram of session identication

LSTM block

C(tminus1) C(t) h(t)

g(t)

h(tminus1) x(t)

Main layer for analyzingthe current input x(t) andthe previous short-term

state h(tminus1)

Figure 5 e core of LSTM

Security and Communication Networks 5

e model which called SB-LSTM (session-based longshort-term memory) consists of four LSTM layers each ofwhich has sixteen neurons and a dense layer e LSTMlayer will learn the feature of variable length sequenceswhich are activated by tanh function and logistic functione dense layer acts as a ldquoclassifierrdquo throughout the LSTMwhich can map the learned ldquodistributed feature represen-tationrdquo to the sample tag space At last the detection modelused categorical cross entropy as the loss function and theoptimizer is Adam

In more detail the input data size is m times n times 6 m is thenumber of sessions and n is the maximum length of allsessions Current long-term state c(t) and output y(t) arecomputed as follows

g(t) tanh WTxg middot x(t) + W

Thg middot h(tminus 1) + bg1113872 1113873

c(t) f(t) otimes c(tminus 1) + i(t) otimesg(t)

y(t) h(t) o(t) otimes tanh c(t)1113872 1113873

(1)

where Wxg denotes the weight matrix of the main layerconnected to the input vector x(t) of size 1times 6 while Whg

denotes the weight matrix of the main layer connected to theprevious short-term state h(tminus 1) bg represents the coefficientof variation for the main layer

e architecture of this model for a session is depicted inFigure 6

342 Hidden Markov Model e hidden Markov model(HMM) is a statistical Markov model a highly effectivemeans of modeling a family of unaligned sequences [26 27]which describes a hidden Markov chain stochasticallygenerating unobservable state sequences and then gener-ating a stochastical observation sequence from each state Asits most remarkable features the state at any time t dependsonly on the state of the previous moment while it is in-dependent of the time t the state and the observation atother times Besides it also assumes that observations at anytime depend only on the state at that moment independentof other observations and states

In this paper we mainly focus on supervised learningmethod of the hidden Markov chain e train data includeobservation sequences O and corresponding state sequencesI which can be written as

O1 I1( 1113857 O2 I2( 1113857 Os Is( 11138571113864 1113865 (2)

We can use the maximum likelihood estimation methodto estimate the parameters of the hidden Markov model

(1) We assume that the sample is in the state i at time tand the frequency of transition to state j at time t + i

is Aij and then the probability estimate of thetransition state is computed as follows

1113955aij Aij

1113936Nj1Aij

i 1 2 N j 1 2 N (3)

(2) We assume that the sample state is j and the fre-quency of observation k is Bjk and then the

probability of observation k when the state is j can bewritten as

1113955bjk Bjk

1113936Mk1Bjk

j 1 2 N k 1 2 M (4)

(3) e initial state probability π is estimated as thefrequency of the initial state of i in the S samplesearchitecture of this model is illustrated in Figure 7

In more detail the input data size is x times 6 and x is thenumber of entries in all sessions

4 Experiment and Evaluation

In this section we will introduce the detailed composition ofthe dataset the experiment details and the results

41 Dataset ere are two datasets for our experimentsone for the model training and the other for the model testAs for the first dataset a honeypot [28 29] was built tocapture and analyze webshell attacks Tens of testers wereinvited to attack this honeypot One word Trojans smallTrojans and big Trojans which were coded in differentprogram languages were provided to these testers Wecollected a total of 13986 log entries from the honeypot asnegative samples while 57160 logs entries were collectedfrom real-world websites as positive samples In the processof training we also used a 10-folder cross validation toperform a simple verification of the model As for the latterone a real-world website which contains a total of 2324 filesand 248 k lines of log entries in total was utilized to testSince the log entries with webshell communication in thereal environment may only account for a few percents oreven a few thousandths in the log file we control thepositive and negative sample ratio of the dataset to bearound 6 1 in experiments After feature vector extractionand session identification all log entries become in-dependent sessions Every session consists of six sequenceswhich are the byte sequence of the HTTP response themethod sequence the path sequence that is encoded by thedegree of association the referrer sequence that is dividedinto different cases for encoding the status code sequenceand the time interval sequence Figure 8 is an overview ofour model

42 Experiment Design To evaluate the performance of ourmodel we first explored the effects of different quantiles asthresholds and selected the best values for next experimentsen we validated the effectiveness of the session identi-fication method based on the time interval and quantile Inaddition we compare long short-term memory with thehidden Markov models which are commonly used in se-quence data processing and find the one with better per-formance Finally we leverage a real-world website to testthe SB-LSTM model and compare it with results from otherresearch

6 Security and Communication Networks

All the experiments were performed in a PC machinewith an i7-7700HQ processor 16GB of memory and aGeForce GTX 1060 GPU which has 6GB of memory eSB-LSTMmodel is implemented by Tensormacrow [30] and theHHM is implemented by seqlearn [31]

Before using the raw data to build the model we did astandardization to make all the data appropriate for theneural network Data were processed according to the fol-lowing steps

(i) Since we need to input a 3-dimensional matrix whenconstructing the SB-LSTM model we chose Max-Min scaling normalization to reduce the ecopyect ofpadding 0 when normalizing After scaling all data

are between 0 and 1 e function of doing Max-Min scaling normalization is

xlowast x minus min

max minus min (5)

(ii) e label of all the training data is stored in a list 0for normal access and 1 for webshell access

(iii) When we used long short-term memory to buildour model we must make all data to become amatrix so we padded the length of the sequence to zwhich is the maximum session length Besides weheld a list to show all the valid length (originallength) of the data to ensure the number of itera-tions in LSTM

(iv) When we used the hidden Markov model to buildour model we only need to put all the vectors into asequence because of the input data size of thehidden Markov model

We use 10-folder cross validation [32] e dataset isequally divided into 10 subsets each of which is tested onceand the rest as a training set e cross validation is repeated10 times one subset is selected each time as a test set and theaverage cross validation recognition accuracy rate of 10times is taken as a resulte dataset used for the experimentis unbalanced so we choose accuracy recall precision F1score [33] the receiver operating characteristics (ROC)curves [34] and area under curve (AUC) measure forevaluating the proposed method Besides we can visualizethe relation between TPR and FPR of a classier eseindicators can be expressed as

Normal Webshell

y1 y2 yn

b11 b12 b1nb21

b22

b2n

a12

a21

Figure 7 e architecture of HMM

Next

Predict

h40

h30

h20

LSTM layer 1

LSTM layer 2

LSTM layer 3

LSTM layer 4

ForwardDropoutBackward

Y1

h10

Input data

h4n

Dense

h41

h31

h21

h11 LSTM layer 1

LSTM layer 2

Y2

X3

h42

h32

h22

h12

X2

X2X1

(Possibility 1 possibility 2)

LSTM layer 4

LSTM layer 3

Figure 6 e architecture of LSTM

Security and Communication Networks 7

Accuracy 1

|C|1113944

|C|

i1

TPi + TNi

TPi + TNi + FPi + FNi

Recall 1

|C|1113944

|C|

i1

TPi

TPi + FNi

Precision 1

|C|1113944

|C|

i1

TPi

TPi + FPi

F1 score 2precision times recallprecision + recall

AUC 1113936iisinpositiveclassranki minus (M(1 + M)2)

M times N

(6)

where M is the number of positive samples and N is thenumber of negative samples e score indicates theprobability that each test sample belongs to a positivesample while rank is a positive sample set sorted in thedescending order based on score

43 Experiment Results First we explored the influence ofdifferent thresholds in session identification e thresholdswere set to 55 60 65 70 75 80 85 and 90quantile respectively e comparison results are shown inTable 2

As is shown in Table 2 performance is improved with anincreasing threshold in a certain range and reaches thehighest value when the threshold is 70 Moreover whenthe performance reaches the highest it will continuouslydecrease with a rising threshold Excessive thresholds candecrease the performance of the model because two accesses

that were not originally part of the same session are placed inthe same session In comparison with the threshold of 55to 90 we find the appropriate threshold is 70 and we useit for the next experiments

en we validated the effectiveness of session identifi-cation based on the time interval and the quantile Wecounted the number of access sessions for every user wherethe user is identified by the IP field and the user-agent fielde top 100 users with the largest number of sessions areshown in Figure 9

As is illustrated in Figure 9 most users have multiplesessions up to a maximum of 166 erefore it is quitenecessary for us to conduct a more detailed sessionidentification

In addition we compare the performance of two ma-chine learning models e comparisons are presented inTable 3

As is shown in Table 3 SB-LSTM has a much betterperformance than the HHM-based model e recall rate ofthe HMM-based model under such dataset training is closeto 0 e reason that the HMM assumes that the state at any

B0M0P0R0S0T0

B1M1P1R1S1T1

B2M2P2R2S2T2

BnMnPnRnSnTn

Session content

Sessiondata

Test

Train

Train

Model

Trainedmodel

B bytesM methodP path

R referrerS status codeT time interval

Figure 8 Overview of webshell detection with our model

Table 2 e influence of different thresholds in sessionidentification

reshold () Accuracy Recall Precision F1 score55 0932 08624 08757 0869060 09364 0869 08958 0882265 09598 09568 08977 0926370 09597 09615 09036 0931775 09696 09368 09009 0918580 09376 09401 08518 0893785 08974 07228 06932 0707790 08779 07186 06418 06780

8 Security and Communication Networks

time depends only on the state of the previous momentmight lead to the low recall Besides the HMM-based modelalso has a much lower accuracy than SB-LSTM e ROC isshown in Figure 10 e curve shows the SB-LSTM modelcould achieve an ecopyective and accurate result in which thetrue positive rate could reach 08 when the false positive rateonly reaches 001

Afterwards we found the webshell communication andwe can easily nd the webshell by counting all the accesspaths of the session Besides if the administrator is familiarwith the les included in website dictionary hemay not needto perform statistics to nd the webshell

At last we leverage a real-world website to test the SB-LSTM model and compare it with results from other re-search We rst explored the inmacruence of the dicopyerent sizesof logs on the modelrsquos runtimee experiment used two logles including a recent one and the largest one the websitecan provide Besides we ran this experiment three times andgot the average as results e results are presented in Ta-ble 4 en we compared these to the result of the citedrelated work We downloaded all the source les of thewebsite and reproduced a cited related work which is a le-based detection method For the purpose of maximizing theexperimental results we used the largest log the website canprovide which records all access to the website for nearly ayear and a half to compare We do the same things in termsof the nal results and the comparison results are shown inTable 5

As is illustrated in Table 4 when the size of the log isquite small SB-LSTM can get the result of detection veryquickly which is ecient for detecting whether there iswebshell communication for a certain period of time

0

20

40

60

80

100

120

140

160

180U

ser1

Use

r7

Use

r13

Use

r19

Use

r25

Use

r31

Use

r37

Use

r43

Use

r49

Use

r55

Use

r61

Use

r67

Use

r73

Use

r79

Use

r85

Use

r91

Use

r97

Figure 9 e top 100 users with the largest number of sessions

Table 3 Comparison of two machine learning methods

Model name Accuracy () Recall ()SB-LSTM 9597 9615HMM-based model [31] 6827 000

10

08

06

04

02

000000 0025 0050 0075 0100 0125 0150 0175 0200

Figure 10 ROC curve based on SB-LSTM

Table 4 e inmacruence of the dicopyerent sizes of logs

Model name Sizes Entries Time

SB-LSTM 177 kB 908 05897 s466MB 248433 287127 s

Table 5 Comparison of dicopyerent detection methods

Model name Sizesamount TimeSB-LSTM 466MB 287127 sFRF-WD [20] 2324 398415 s

Security and Communication Networks 9

As revealed in Table 5 even if we use all the log entriessince the website was built the runtime required for ana-lyzing is 279 less than the result of the cited related work

Compared with the results of the first experiment itdemonstrated that the SB-LSTM model is efficient fordetecting whether there is webshell communication for acertain period of time as it requires to scan all the sourcefiles in the common approach while the only input for theSB-LSTM model is the recent log

5 Conclusion

In the previous studies a huge number of features wereextracted from webshell and these features were regarded askeywords to judge whether there is a webshell or notHowever it is almost certain that these approaches would beless useful with the gradual development of encryption andconfusion technology is paper mainly focuses on thechallenge that detects webshell out of itself Instead ofleveraging POST contents source file codes or receivingtraffic the framework we proposed uses sessions generatedfrom the websitersquos logs which highly reduces the cost of timeand space but maintains a high recall rate and accuracyFeatures were extracted in raw sequence data in the web logsand a statistical method was applied to identify sessionsprecisely e results of experiments show that 70 quantilecan be the right threshold that makes the model obtain thehighest accuracy and recall rate and the long short-termmemory which can achieve a high accuracy rate of 9597with a recall rate of 9615 has much better performancethan the hidden Markov model on webshell detectionMoreover the experiment demonstrated the high efficiencyof the proposed approach in terms of the runtime as it takes985 less time than the cited related approach to get theresults In order to be closer to the real-world applicationthe model can be employed to identify webshell files after thewebshell communication is detected by using a statisticalmethod

Data Availability

eWeb logs data used to support the findings of this studyhave not been made available because it was extracted fromreal websites and contains many sensitive information

Conflicts of Interest

e authors declare that they have no conflicts of interest

Acknowledgments

is work was supported in part by the Fundamental ResearchFunds for the Central Universities and the Sichuan UniversityPostdoc Research Foundation under Grant 19XJ0002

References

[1] CNCERT weekly reportrdquo 2019 httpwwwcertorgcnpublishenglishuploadFileWeekly20Report20of20CNCERT-Issue207202019pdf

[2] D H Tony Lee and I Ahl ldquoBreaking down the China chopperweb shell-part Irdquo 2019 httpswwwfireeyecomblogthreat-research201308breaking-down-the-china-chopper-web-shell-part-ihtml

[3] B Yong X Liu Y Liu H Yin L Huang and Q Zhou ldquoWebbehavior detection based on deep neural networkrdquo in Pro-ceedings of the IEEE SmartWorld Ubiquitous Intelligence ampComputing Advanced amp Trusted Computing Scalable Com-puting amp Communications Cloud amp Big Data ComputingInternet of People and Smart City Innovation (SmartWorldSCALCOMUICATCCBDComIOPSCI) pp 1911ndash1916Hong Kong China July 2018

[4] L Y Deng D L Lee Y-H Chen and L X Yann ldquoLexicalanalysis for the webshell attacksrdquo in Proceedings of the 2016International Symposium on Computer Consumer and Con-trol (IS3C) pp 579ndash582 Xirsquoan China July 2016

[5] V-G Le H-T Nguyen D-N Lu and N-H Nguyen ldquoAsolution for automatically malicious web shell and web ap-plication vulnerability detectionrdquo in Proceedings of the In-ternational Conference on Computational CollectiveIntelligence pp 367ndash378 Halkidiki Greece September 2016

[6] J Wang Z Zhou and J Chen ldquoEvaluating CNN and LSTMfor web attack detectionrdquo in Proceedings of the 2018 10thInternational Conference on Machine Learning and Com-puting pp 283ndash287 Macau China February 2018

[7] W Yang B Sun and B Cui ldquoA webshell detection tech-nology based on http traffic analysisrdquo Innovative Mobile andInternet Services in Ubiquitous Computing in Proceedings ofthe International Conference on Innovative Mobile and In-ternet Services in Ubiquitous Computing pp 336ndash342MatsueJapan July 2018

[8] J Kim D-H Yoo H Jang and K Jeong ldquoWebshark 10 abenchmark collection for malicious web shell detectionrdquo JIPSvol 11 no 2 pp 229ndash238 2015

[9] P M Wrench and B V Irwin ldquoTowards a php webshelltaxonomy using deobfuscation-assisted similarity analysisrdquo inProceedings of the 2015 Information Security for South Africa(ISSA) pp 1ndash8 Johannesburg South Africa July 2015

[10] ZMeng R Mei T Zhang andW-PWen ldquoResearch of linuxwebshell detection based on SVM classifierrdquo Netinfo Securityvol 5 pp 5ndash9 2014

[11] O Starov J Dahse S S Ahmad T Holz and N NikiforakisldquoNo honor among thieves a large-scale analysis of maliciousweb shellsrdquo in Proceedings of the 25th International Confer-ence on World Wide Web pp 1021ndash1032 Montreal CanadaApril 2016

[12] S Josefsson ldquoe base16 base32 and base64 data encodingsrdquo2009 httpwwwhjpatdocrfcrfc3548html

[13] Z-H Lv H-B Yan and R Mei ldquoAutomatic and accuratedetection of webshell based on convolutional neural net-workrdquo in Proceedings of the China Cyber Security AnnualConference pp 73ndash85 Beijing China August 2018

[14] Compromised web servers and web shells-threat awarenessand guidancerdquo 2017 httpswwwus-certgovncasalertsTA15-314A

[15] T D Tu C Guang G Xiaojun and P Wubin ldquoWebshelldetection techniques in web applicationsrdquo in Proceedings of theFifth International Conference on Computing Communicationsand Networking Technologies (ICCCNT) pp 1ndash7 Hefei China2014

[16] Y Tian J Wang Z Zhou and S Zhou ldquoCNN-webshellmalicious web shell detection with convolutional neuralnetworkrdquo in Proceedings of the 2017 VI International

10 Security and Communication Networks

Conference on Network Communication and Computingpp 75ndash79 Kunming China December 2017

[17] T Walkowiak S Datko and H Maciejewski ldquoBag-of-wordsbag-of-topics and word-to-vec based subject classification oftext documents in polish-a comparative studyrdquo Contempo-rary Complex Systems and eir Dependability pp 526ndash5352018

[18] D H Hubel and T NWiesel ldquoReceptive fields and functionalarchitecture of monkey striate cortexrdquo e Journal of Phys-iology vol 195 no 1 pp 215ndash243 1968

[19] X Sun X Lu and H Dai ldquoA matrix decomposition basedwebshell detection methodrdquo in Proceedings of the 2017 In-ternational Conference on Cryptography Security and Privacypp 66ndash70 Wuhan China March 2017

[20] Y Fang Y Qiu L Liu and C Huang ldquoDetecting webshellbased on random forest with fasttextrdquo in Proceedings of the2018 International Conference on Computing and ArtificialIntelligence pp 52ndash56 Chengdu China March 2018

[21] Package Information [ebol] 2019 httpspeclphpnetpackagevld

[22] H Zhang H Guan H Yan et al ldquoWebshell traffic detectionwith character-level features based on deep learningrdquo IEEEAccess vol 6 pp 75268ndash75277 2018

[23] L Shi and Y Fang ldquoWebshell detection method researchbased on web logrdquo Journal of Information Security Researchvol 1 p 11 2016

[24] R Sasi ldquoWeb backdoors-attack evasion and detectionrdquo inProceedings of the C0C0N Sec Conference pp 989ndash1003Cochin India 2011

[25] J Schmidhuber and S Hochreiter ldquoLong short-term mem-oryrdquo Neural Computation vol 9 no 8 pp 1735ndash1780 1997

[26] R Hughey and A Krogh ldquoHidden markov models for se-quence analysis extension and analysis of the basic methodrdquoBioinformatics vol 12 no 2 pp 95ndash107 1996

[27] B Schuller G Rigoll and M Lang ldquoHidden markov model-based speech emotion recognitionrdquo in Proceedings of the 2003IEEE International Conference on Acoustics Speech andSignal Processing Atlanta GA USA April 2003

[28] I Kuwatly M Sraj Z Al Masri and H Artail ldquoA dynamichoneypot design for intrusion detectionrdquo in Proceedings of theIEEEACS International Conference on Pervasive ServicesLebanon July 2004

[29] C Kreibich and J Crowcroft ldquoHoneycombrdquo ACM SIG-COMM Computer Communication Review vol 34 no 1pp 51ndash56 2004

[30] TensorFlowrdquo 2019 httpswwwtensorfloworg[31] L larsmans ldquoseqlearnrdquo 2016 httpsgithubcomlarsmans

seqlearn[32] R Kohavi et al ldquoA study of cross-validation and bootstrap for

accuracy estimation and model selectionrdquo IJCAI vol 14no 2 pp 1137ndash1145 1995

[33] C Goutte and E Gaussier ldquoA probabilistic interpretation ofprecision recall and f-score with implication for evaluationLecture Notes in Computer Sciencerdquo in Proceedings of theEuropean Conference on Information Retrieval pp 345ndash359Santiago de Compostela Spain March 2005

[34] T Fawcett ldquoAn introduction to ROC analysisrdquo PatternRecognition Letters vol 27 no 8 pp 861ndash874 2006

Security and Communication Networks 11

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 4: Research Article - Hindawi Publishing Corporationdownloads.hindawi.com/journals/scn/2019/3093809.pdf · As a web service-based backdoor program, webshell is installed through vul-

and the relative relationship cannot be discriminated fromthe second le the access to the same le is marked as 0 andwhen accessing dicopyerent les the distance between dicopyerentles is calculated by the number of directory switches plusone to encodee process of encoding all paths for a sessionis shown in Figure 3 When all the feature elds are pro-cessed a session will transform into a sequence of features asillustrated in Table 1 Finally the feature vector we get will besix-dimensional including byte eld method eld path eldreferrer eld status code eld and time interval

332 Session Identication In the previous section we justmade a rough distinction between the sessions in every log letreating a visitor as a session Butmore often a visitor can accessat dicopyerent times and generate multiple sessionsus we use amore scientic statistical method for session identication

We calculate and count the time interval between entriesin every session which is roughly generated in the previoussection After counting the time intervals in all sessions we

sort all of them and explore the most appropriate quantile asa threshold We compare every time interval to a thresholdand if the time interval is greater than the threshold thesession will be subdivided into two smaller sessions and soforth e whole process is illustrated in Figure 4

When all the data have been processed the framework willcheck all the sessions and delete those sessions that containonly one entry as it assumes that if a user has only one access itis highly unlikely that there is a webshell communication

34 Classication Model Webshell detection is a two-classtask and because of variable length serialization in the logsequence we choose long short-term memory and thehidden Markov model to construct the classication modelrespectively

341 Long Short-Term Memory Long short-term memorynetworks are well suited to classify and process based on

Normal Abnormal

Originaldata

Data collection

Sequence

Featureextraction

IP address

User agent

Rough division

Precise division

Classificationmodel

Traintestdataset

LSTM

Normal Webshell

ClassifyHMM

Time interval

Statisticalanalysis

Figure 1 e architecture of the proposed method

119393195 -- [24Mar2018062553 ndash 0400] ldquoGET imgrssgif HTTP11rdquo 200 1036 ldquohttpswwwexamplecomrdquoldquoMozilla50(Windows NT 100 Win64 x64)rdquo

Host

Path

Time

Method

Referrer

Status

Bytes

User agent

119393195

imgrssgif

24Mar2018062553

GET

httpwwwexamplecom

200

1036

Mozilla50 (Windows NT 100 Win64 x64)

Figure 2 Feature extraction based on log entry

4 Security and Communication Networks

sequence data with its faster convergence and the ability todetect long-term dependencies in data e concept wasoriginally proposed in 1997 by Hochreiter and Schmidhuber[25] e emergence of LSTM is mainly to solve the problemof gradient disappearance and gradient explosion in longsequence training Compared with traditional RNN wherethere is only one delivery state the LSTM has two deliverystates namely the long-term state c(t) and the short-termstate h(t) Besides it introduces three ldquogatesrdquo to control thelong-term state as illustrated in Figure 5

(i) Forget gate f(t) control which of the long-termstates of last moment c(tminus 1) should be discarded

(ii) Input gate i(t) control which parts of the currenttime input x(t) and the last short-term state h(tminus 1)will be added to the long-term state

(iii) Output gate o(t) control which parts of the currentlong-term state c(t) should be output as h(t)

119393195 -- [24Mar2018062552 - 0400] ldquoGET HTTP11rdquo 200 13108 ldquo-rdquo ldquoMozilla50 (Windows NT 100Win64 x64) AppleWebKit53736 (KHTML like Gecko)

Chrome6503325181 Safari53736rdquo

119393195 -- [24Mar2018062553 - 0400] ldquoGETphotonopng HTTP11rdquo 200 4898 ldquohttps examplecomrdquo

ldquoMozilla50 (Windows NT 100 Win64 x64)AppleWebKit53736 (KHTML like Gecko)

Chrome6503325181 Safari53736rdquo

119393195--[24Mar2018062555 - 0400] ldquoGETimgrssgif HTTP11rdquo 200 1036 ldquohttpsexamplecomrdquo

ldquoMozilla50 (Windows NT 100 Win64 x64)AppleWebKit53736 (KHTML like Gecko)

Chrome6503325181 Safari53736rdquo

119393195 -- [24Mar2018101129 - 0400] ldquoGETbook HTTP11rdquo 200 13125 ldquo-rdquo ldquoMozilla50 (WindowsNT 100 Win64 x64) AppleWebKit53736 (KHTML

like Gecko) Chrome6503325181 Safari53736rdquo

Raw data

photonopng

imgrssgif

book

Path

ndash1

2

3

2

Encoded path

Figure 3 e process of encoding all paths for a session

Table 1 Session transformation

Log messages Feature vectorLog entry0 [bytes0 method0 path0 referrer0 statuscode0 t1 minus t0]Log entry1 [bytes1 method1 path1 referrer1 statuscode1 t2 minus t1]Log entry2 [bytes2 method2 path2 referrer2 statuscode2 t3 minus t2]⋮ ⋮Log entryn [bytesnmethodn pathn referrern statuscoden 0]

Statistical analysisof time interval

Division

Division

Originsequence

Rough sessions

Session 1

Session 2

Precise sessions

Session 3

Session 4

IP addressuser-agent

Figure 4 Schematic diagram of session identication

LSTM block

C(tminus1) C(t) h(t)

g(t)

h(tminus1) x(t)

Main layer for analyzingthe current input x(t) andthe previous short-term

state h(tminus1)

Figure 5 e core of LSTM

Security and Communication Networks 5

e model which called SB-LSTM (session-based longshort-term memory) consists of four LSTM layers each ofwhich has sixteen neurons and a dense layer e LSTMlayer will learn the feature of variable length sequenceswhich are activated by tanh function and logistic functione dense layer acts as a ldquoclassifierrdquo throughout the LSTMwhich can map the learned ldquodistributed feature represen-tationrdquo to the sample tag space At last the detection modelused categorical cross entropy as the loss function and theoptimizer is Adam

In more detail the input data size is m times n times 6 m is thenumber of sessions and n is the maximum length of allsessions Current long-term state c(t) and output y(t) arecomputed as follows

g(t) tanh WTxg middot x(t) + W

Thg middot h(tminus 1) + bg1113872 1113873

c(t) f(t) otimes c(tminus 1) + i(t) otimesg(t)

y(t) h(t) o(t) otimes tanh c(t)1113872 1113873

(1)

where Wxg denotes the weight matrix of the main layerconnected to the input vector x(t) of size 1times 6 while Whg

denotes the weight matrix of the main layer connected to theprevious short-term state h(tminus 1) bg represents the coefficientof variation for the main layer

e architecture of this model for a session is depicted inFigure 6

342 Hidden Markov Model e hidden Markov model(HMM) is a statistical Markov model a highly effectivemeans of modeling a family of unaligned sequences [26 27]which describes a hidden Markov chain stochasticallygenerating unobservable state sequences and then gener-ating a stochastical observation sequence from each state Asits most remarkable features the state at any time t dependsonly on the state of the previous moment while it is in-dependent of the time t the state and the observation atother times Besides it also assumes that observations at anytime depend only on the state at that moment independentof other observations and states

In this paper we mainly focus on supervised learningmethod of the hidden Markov chain e train data includeobservation sequences O and corresponding state sequencesI which can be written as

O1 I1( 1113857 O2 I2( 1113857 Os Is( 11138571113864 1113865 (2)

We can use the maximum likelihood estimation methodto estimate the parameters of the hidden Markov model

(1) We assume that the sample is in the state i at time tand the frequency of transition to state j at time t + i

is Aij and then the probability estimate of thetransition state is computed as follows

1113955aij Aij

1113936Nj1Aij

i 1 2 N j 1 2 N (3)

(2) We assume that the sample state is j and the fre-quency of observation k is Bjk and then the

probability of observation k when the state is j can bewritten as

1113955bjk Bjk

1113936Mk1Bjk

j 1 2 N k 1 2 M (4)

(3) e initial state probability π is estimated as thefrequency of the initial state of i in the S samplesearchitecture of this model is illustrated in Figure 7

In more detail the input data size is x times 6 and x is thenumber of entries in all sessions

4 Experiment and Evaluation

In this section we will introduce the detailed composition ofthe dataset the experiment details and the results

41 Dataset ere are two datasets for our experimentsone for the model training and the other for the model testAs for the first dataset a honeypot [28 29] was built tocapture and analyze webshell attacks Tens of testers wereinvited to attack this honeypot One word Trojans smallTrojans and big Trojans which were coded in differentprogram languages were provided to these testers Wecollected a total of 13986 log entries from the honeypot asnegative samples while 57160 logs entries were collectedfrom real-world websites as positive samples In the processof training we also used a 10-folder cross validation toperform a simple verification of the model As for the latterone a real-world website which contains a total of 2324 filesand 248 k lines of log entries in total was utilized to testSince the log entries with webshell communication in thereal environment may only account for a few percents oreven a few thousandths in the log file we control thepositive and negative sample ratio of the dataset to bearound 6 1 in experiments After feature vector extractionand session identification all log entries become in-dependent sessions Every session consists of six sequenceswhich are the byte sequence of the HTTP response themethod sequence the path sequence that is encoded by thedegree of association the referrer sequence that is dividedinto different cases for encoding the status code sequenceand the time interval sequence Figure 8 is an overview ofour model

42 Experiment Design To evaluate the performance of ourmodel we first explored the effects of different quantiles asthresholds and selected the best values for next experimentsen we validated the effectiveness of the session identi-fication method based on the time interval and quantile Inaddition we compare long short-term memory with thehidden Markov models which are commonly used in se-quence data processing and find the one with better per-formance Finally we leverage a real-world website to testthe SB-LSTM model and compare it with results from otherresearch

6 Security and Communication Networks

All the experiments were performed in a PC machinewith an i7-7700HQ processor 16GB of memory and aGeForce GTX 1060 GPU which has 6GB of memory eSB-LSTMmodel is implemented by Tensormacrow [30] and theHHM is implemented by seqlearn [31]

Before using the raw data to build the model we did astandardization to make all the data appropriate for theneural network Data were processed according to the fol-lowing steps

(i) Since we need to input a 3-dimensional matrix whenconstructing the SB-LSTM model we chose Max-Min scaling normalization to reduce the ecopyect ofpadding 0 when normalizing After scaling all data

are between 0 and 1 e function of doing Max-Min scaling normalization is

xlowast x minus min

max minus min (5)

(ii) e label of all the training data is stored in a list 0for normal access and 1 for webshell access

(iii) When we used long short-term memory to buildour model we must make all data to become amatrix so we padded the length of the sequence to zwhich is the maximum session length Besides weheld a list to show all the valid length (originallength) of the data to ensure the number of itera-tions in LSTM

(iv) When we used the hidden Markov model to buildour model we only need to put all the vectors into asequence because of the input data size of thehidden Markov model

We use 10-folder cross validation [32] e dataset isequally divided into 10 subsets each of which is tested onceand the rest as a training set e cross validation is repeated10 times one subset is selected each time as a test set and theaverage cross validation recognition accuracy rate of 10times is taken as a resulte dataset used for the experimentis unbalanced so we choose accuracy recall precision F1score [33] the receiver operating characteristics (ROC)curves [34] and area under curve (AUC) measure forevaluating the proposed method Besides we can visualizethe relation between TPR and FPR of a classier eseindicators can be expressed as

Normal Webshell

y1 y2 yn

b11 b12 b1nb21

b22

b2n

a12

a21

Figure 7 e architecture of HMM

Next

Predict

h40

h30

h20

LSTM layer 1

LSTM layer 2

LSTM layer 3

LSTM layer 4

ForwardDropoutBackward

Y1

h10

Input data

h4n

Dense

h41

h31

h21

h11 LSTM layer 1

LSTM layer 2

Y2

X3

h42

h32

h22

h12

X2

X2X1

(Possibility 1 possibility 2)

LSTM layer 4

LSTM layer 3

Figure 6 e architecture of LSTM

Security and Communication Networks 7

Accuracy 1

|C|1113944

|C|

i1

TPi + TNi

TPi + TNi + FPi + FNi

Recall 1

|C|1113944

|C|

i1

TPi

TPi + FNi

Precision 1

|C|1113944

|C|

i1

TPi

TPi + FPi

F1 score 2precision times recallprecision + recall

AUC 1113936iisinpositiveclassranki minus (M(1 + M)2)

M times N

(6)

where M is the number of positive samples and N is thenumber of negative samples e score indicates theprobability that each test sample belongs to a positivesample while rank is a positive sample set sorted in thedescending order based on score

43 Experiment Results First we explored the influence ofdifferent thresholds in session identification e thresholdswere set to 55 60 65 70 75 80 85 and 90quantile respectively e comparison results are shown inTable 2

As is shown in Table 2 performance is improved with anincreasing threshold in a certain range and reaches thehighest value when the threshold is 70 Moreover whenthe performance reaches the highest it will continuouslydecrease with a rising threshold Excessive thresholds candecrease the performance of the model because two accesses

that were not originally part of the same session are placed inthe same session In comparison with the threshold of 55to 90 we find the appropriate threshold is 70 and we useit for the next experiments

en we validated the effectiveness of session identifi-cation based on the time interval and the quantile Wecounted the number of access sessions for every user wherethe user is identified by the IP field and the user-agent fielde top 100 users with the largest number of sessions areshown in Figure 9

As is illustrated in Figure 9 most users have multiplesessions up to a maximum of 166 erefore it is quitenecessary for us to conduct a more detailed sessionidentification

In addition we compare the performance of two ma-chine learning models e comparisons are presented inTable 3

As is shown in Table 3 SB-LSTM has a much betterperformance than the HHM-based model e recall rate ofthe HMM-based model under such dataset training is closeto 0 e reason that the HMM assumes that the state at any

B0M0P0R0S0T0

B1M1P1R1S1T1

B2M2P2R2S2T2

BnMnPnRnSnTn

Session content

Sessiondata

Test

Train

Train

Model

Trainedmodel

B bytesM methodP path

R referrerS status codeT time interval

Figure 8 Overview of webshell detection with our model

Table 2 e influence of different thresholds in sessionidentification

reshold () Accuracy Recall Precision F1 score55 0932 08624 08757 0869060 09364 0869 08958 0882265 09598 09568 08977 0926370 09597 09615 09036 0931775 09696 09368 09009 0918580 09376 09401 08518 0893785 08974 07228 06932 0707790 08779 07186 06418 06780

8 Security and Communication Networks

time depends only on the state of the previous momentmight lead to the low recall Besides the HMM-based modelalso has a much lower accuracy than SB-LSTM e ROC isshown in Figure 10 e curve shows the SB-LSTM modelcould achieve an ecopyective and accurate result in which thetrue positive rate could reach 08 when the false positive rateonly reaches 001

Afterwards we found the webshell communication andwe can easily nd the webshell by counting all the accesspaths of the session Besides if the administrator is familiarwith the les included in website dictionary hemay not needto perform statistics to nd the webshell

At last we leverage a real-world website to test the SB-LSTM model and compare it with results from other re-search We rst explored the inmacruence of the dicopyerent sizesof logs on the modelrsquos runtimee experiment used two logles including a recent one and the largest one the websitecan provide Besides we ran this experiment three times andgot the average as results e results are presented in Ta-ble 4 en we compared these to the result of the citedrelated work We downloaded all the source les of thewebsite and reproduced a cited related work which is a le-based detection method For the purpose of maximizing theexperimental results we used the largest log the website canprovide which records all access to the website for nearly ayear and a half to compare We do the same things in termsof the nal results and the comparison results are shown inTable 5

As is illustrated in Table 4 when the size of the log isquite small SB-LSTM can get the result of detection veryquickly which is ecient for detecting whether there iswebshell communication for a certain period of time

0

20

40

60

80

100

120

140

160

180U

ser1

Use

r7

Use

r13

Use

r19

Use

r25

Use

r31

Use

r37

Use

r43

Use

r49

Use

r55

Use

r61

Use

r67

Use

r73

Use

r79

Use

r85

Use

r91

Use

r97

Figure 9 e top 100 users with the largest number of sessions

Table 3 Comparison of two machine learning methods

Model name Accuracy () Recall ()SB-LSTM 9597 9615HMM-based model [31] 6827 000

10

08

06

04

02

000000 0025 0050 0075 0100 0125 0150 0175 0200

Figure 10 ROC curve based on SB-LSTM

Table 4 e inmacruence of the dicopyerent sizes of logs

Model name Sizes Entries Time

SB-LSTM 177 kB 908 05897 s466MB 248433 287127 s

Table 5 Comparison of dicopyerent detection methods

Model name Sizesamount TimeSB-LSTM 466MB 287127 sFRF-WD [20] 2324 398415 s

Security and Communication Networks 9

As revealed in Table 5 even if we use all the log entriessince the website was built the runtime required for ana-lyzing is 279 less than the result of the cited related work

Compared with the results of the first experiment itdemonstrated that the SB-LSTM model is efficient fordetecting whether there is webshell communication for acertain period of time as it requires to scan all the sourcefiles in the common approach while the only input for theSB-LSTM model is the recent log

5 Conclusion

In the previous studies a huge number of features wereextracted from webshell and these features were regarded askeywords to judge whether there is a webshell or notHowever it is almost certain that these approaches would beless useful with the gradual development of encryption andconfusion technology is paper mainly focuses on thechallenge that detects webshell out of itself Instead ofleveraging POST contents source file codes or receivingtraffic the framework we proposed uses sessions generatedfrom the websitersquos logs which highly reduces the cost of timeand space but maintains a high recall rate and accuracyFeatures were extracted in raw sequence data in the web logsand a statistical method was applied to identify sessionsprecisely e results of experiments show that 70 quantilecan be the right threshold that makes the model obtain thehighest accuracy and recall rate and the long short-termmemory which can achieve a high accuracy rate of 9597with a recall rate of 9615 has much better performancethan the hidden Markov model on webshell detectionMoreover the experiment demonstrated the high efficiencyof the proposed approach in terms of the runtime as it takes985 less time than the cited related approach to get theresults In order to be closer to the real-world applicationthe model can be employed to identify webshell files after thewebshell communication is detected by using a statisticalmethod

Data Availability

eWeb logs data used to support the findings of this studyhave not been made available because it was extracted fromreal websites and contains many sensitive information

Conflicts of Interest

e authors declare that they have no conflicts of interest

Acknowledgments

is work was supported in part by the Fundamental ResearchFunds for the Central Universities and the Sichuan UniversityPostdoc Research Foundation under Grant 19XJ0002

References

[1] CNCERT weekly reportrdquo 2019 httpwwwcertorgcnpublishenglishuploadFileWeekly20Report20of20CNCERT-Issue207202019pdf

[2] D H Tony Lee and I Ahl ldquoBreaking down the China chopperweb shell-part Irdquo 2019 httpswwwfireeyecomblogthreat-research201308breaking-down-the-china-chopper-web-shell-part-ihtml

[3] B Yong X Liu Y Liu H Yin L Huang and Q Zhou ldquoWebbehavior detection based on deep neural networkrdquo in Pro-ceedings of the IEEE SmartWorld Ubiquitous Intelligence ampComputing Advanced amp Trusted Computing Scalable Com-puting amp Communications Cloud amp Big Data ComputingInternet of People and Smart City Innovation (SmartWorldSCALCOMUICATCCBDComIOPSCI) pp 1911ndash1916Hong Kong China July 2018

[4] L Y Deng D L Lee Y-H Chen and L X Yann ldquoLexicalanalysis for the webshell attacksrdquo in Proceedings of the 2016International Symposium on Computer Consumer and Con-trol (IS3C) pp 579ndash582 Xirsquoan China July 2016

[5] V-G Le H-T Nguyen D-N Lu and N-H Nguyen ldquoAsolution for automatically malicious web shell and web ap-plication vulnerability detectionrdquo in Proceedings of the In-ternational Conference on Computational CollectiveIntelligence pp 367ndash378 Halkidiki Greece September 2016

[6] J Wang Z Zhou and J Chen ldquoEvaluating CNN and LSTMfor web attack detectionrdquo in Proceedings of the 2018 10thInternational Conference on Machine Learning and Com-puting pp 283ndash287 Macau China February 2018

[7] W Yang B Sun and B Cui ldquoA webshell detection tech-nology based on http traffic analysisrdquo Innovative Mobile andInternet Services in Ubiquitous Computing in Proceedings ofthe International Conference on Innovative Mobile and In-ternet Services in Ubiquitous Computing pp 336ndash342MatsueJapan July 2018

[8] J Kim D-H Yoo H Jang and K Jeong ldquoWebshark 10 abenchmark collection for malicious web shell detectionrdquo JIPSvol 11 no 2 pp 229ndash238 2015

[9] P M Wrench and B V Irwin ldquoTowards a php webshelltaxonomy using deobfuscation-assisted similarity analysisrdquo inProceedings of the 2015 Information Security for South Africa(ISSA) pp 1ndash8 Johannesburg South Africa July 2015

[10] ZMeng R Mei T Zhang andW-PWen ldquoResearch of linuxwebshell detection based on SVM classifierrdquo Netinfo Securityvol 5 pp 5ndash9 2014

[11] O Starov J Dahse S S Ahmad T Holz and N NikiforakisldquoNo honor among thieves a large-scale analysis of maliciousweb shellsrdquo in Proceedings of the 25th International Confer-ence on World Wide Web pp 1021ndash1032 Montreal CanadaApril 2016

[12] S Josefsson ldquoe base16 base32 and base64 data encodingsrdquo2009 httpwwwhjpatdocrfcrfc3548html

[13] Z-H Lv H-B Yan and R Mei ldquoAutomatic and accuratedetection of webshell based on convolutional neural net-workrdquo in Proceedings of the China Cyber Security AnnualConference pp 73ndash85 Beijing China August 2018

[14] Compromised web servers and web shells-threat awarenessand guidancerdquo 2017 httpswwwus-certgovncasalertsTA15-314A

[15] T D Tu C Guang G Xiaojun and P Wubin ldquoWebshelldetection techniques in web applicationsrdquo in Proceedings of theFifth International Conference on Computing Communicationsand Networking Technologies (ICCCNT) pp 1ndash7 Hefei China2014

[16] Y Tian J Wang Z Zhou and S Zhou ldquoCNN-webshellmalicious web shell detection with convolutional neuralnetworkrdquo in Proceedings of the 2017 VI International

10 Security and Communication Networks

Conference on Network Communication and Computingpp 75ndash79 Kunming China December 2017

[17] T Walkowiak S Datko and H Maciejewski ldquoBag-of-wordsbag-of-topics and word-to-vec based subject classification oftext documents in polish-a comparative studyrdquo Contempo-rary Complex Systems and eir Dependability pp 526ndash5352018

[18] D H Hubel and T NWiesel ldquoReceptive fields and functionalarchitecture of monkey striate cortexrdquo e Journal of Phys-iology vol 195 no 1 pp 215ndash243 1968

[19] X Sun X Lu and H Dai ldquoA matrix decomposition basedwebshell detection methodrdquo in Proceedings of the 2017 In-ternational Conference on Cryptography Security and Privacypp 66ndash70 Wuhan China March 2017

[20] Y Fang Y Qiu L Liu and C Huang ldquoDetecting webshellbased on random forest with fasttextrdquo in Proceedings of the2018 International Conference on Computing and ArtificialIntelligence pp 52ndash56 Chengdu China March 2018

[21] Package Information [ebol] 2019 httpspeclphpnetpackagevld

[22] H Zhang H Guan H Yan et al ldquoWebshell traffic detectionwith character-level features based on deep learningrdquo IEEEAccess vol 6 pp 75268ndash75277 2018

[23] L Shi and Y Fang ldquoWebshell detection method researchbased on web logrdquo Journal of Information Security Researchvol 1 p 11 2016

[24] R Sasi ldquoWeb backdoors-attack evasion and detectionrdquo inProceedings of the C0C0N Sec Conference pp 989ndash1003Cochin India 2011

[25] J Schmidhuber and S Hochreiter ldquoLong short-term mem-oryrdquo Neural Computation vol 9 no 8 pp 1735ndash1780 1997

[26] R Hughey and A Krogh ldquoHidden markov models for se-quence analysis extension and analysis of the basic methodrdquoBioinformatics vol 12 no 2 pp 95ndash107 1996

[27] B Schuller G Rigoll and M Lang ldquoHidden markov model-based speech emotion recognitionrdquo in Proceedings of the 2003IEEE International Conference on Acoustics Speech andSignal Processing Atlanta GA USA April 2003

[28] I Kuwatly M Sraj Z Al Masri and H Artail ldquoA dynamichoneypot design for intrusion detectionrdquo in Proceedings of theIEEEACS International Conference on Pervasive ServicesLebanon July 2004

[29] C Kreibich and J Crowcroft ldquoHoneycombrdquo ACM SIG-COMM Computer Communication Review vol 34 no 1pp 51ndash56 2004

[30] TensorFlowrdquo 2019 httpswwwtensorfloworg[31] L larsmans ldquoseqlearnrdquo 2016 httpsgithubcomlarsmans

seqlearn[32] R Kohavi et al ldquoA study of cross-validation and bootstrap for

accuracy estimation and model selectionrdquo IJCAI vol 14no 2 pp 1137ndash1145 1995

[33] C Goutte and E Gaussier ldquoA probabilistic interpretation ofprecision recall and f-score with implication for evaluationLecture Notes in Computer Sciencerdquo in Proceedings of theEuropean Conference on Information Retrieval pp 345ndash359Santiago de Compostela Spain March 2005

[34] T Fawcett ldquoAn introduction to ROC analysisrdquo PatternRecognition Letters vol 27 no 8 pp 861ndash874 2006

Security and Communication Networks 11

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 5: Research Article - Hindawi Publishing Corporationdownloads.hindawi.com/journals/scn/2019/3093809.pdf · As a web service-based backdoor program, webshell is installed through vul-

sequence data with its faster convergence and the ability todetect long-term dependencies in data e concept wasoriginally proposed in 1997 by Hochreiter and Schmidhuber[25] e emergence of LSTM is mainly to solve the problemof gradient disappearance and gradient explosion in longsequence training Compared with traditional RNN wherethere is only one delivery state the LSTM has two deliverystates namely the long-term state c(t) and the short-termstate h(t) Besides it introduces three ldquogatesrdquo to control thelong-term state as illustrated in Figure 5

(i) Forget gate f(t) control which of the long-termstates of last moment c(tminus 1) should be discarded

(ii) Input gate i(t) control which parts of the currenttime input x(t) and the last short-term state h(tminus 1)will be added to the long-term state

(iii) Output gate o(t) control which parts of the currentlong-term state c(t) should be output as h(t)

119393195 -- [24Mar2018062552 - 0400] ldquoGET HTTP11rdquo 200 13108 ldquo-rdquo ldquoMozilla50 (Windows NT 100Win64 x64) AppleWebKit53736 (KHTML like Gecko)

Chrome6503325181 Safari53736rdquo

119393195 -- [24Mar2018062553 - 0400] ldquoGETphotonopng HTTP11rdquo 200 4898 ldquohttps examplecomrdquo

ldquoMozilla50 (Windows NT 100 Win64 x64)AppleWebKit53736 (KHTML like Gecko)

Chrome6503325181 Safari53736rdquo

119393195--[24Mar2018062555 - 0400] ldquoGETimgrssgif HTTP11rdquo 200 1036 ldquohttpsexamplecomrdquo

ldquoMozilla50 (Windows NT 100 Win64 x64)AppleWebKit53736 (KHTML like Gecko)

Chrome6503325181 Safari53736rdquo

119393195 -- [24Mar2018101129 - 0400] ldquoGETbook HTTP11rdquo 200 13125 ldquo-rdquo ldquoMozilla50 (WindowsNT 100 Win64 x64) AppleWebKit53736 (KHTML

like Gecko) Chrome6503325181 Safari53736rdquo

Raw data

photonopng

imgrssgif

book

Path

ndash1

2

3

2

Encoded path

Figure 3 e process of encoding all paths for a session

Table 1 Session transformation

Log messages Feature vectorLog entry0 [bytes0 method0 path0 referrer0 statuscode0 t1 minus t0]Log entry1 [bytes1 method1 path1 referrer1 statuscode1 t2 minus t1]Log entry2 [bytes2 method2 path2 referrer2 statuscode2 t3 minus t2]⋮ ⋮Log entryn [bytesnmethodn pathn referrern statuscoden 0]

Statistical analysisof time interval

Division

Division

Originsequence

Rough sessions

Session 1

Session 2

Precise sessions

Session 3

Session 4

IP addressuser-agent

Figure 4 Schematic diagram of session identication

LSTM block

C(tminus1) C(t) h(t)

g(t)

h(tminus1) x(t)

Main layer for analyzingthe current input x(t) andthe previous short-term

state h(tminus1)

Figure 5 e core of LSTM

Security and Communication Networks 5

e model which called SB-LSTM (session-based longshort-term memory) consists of four LSTM layers each ofwhich has sixteen neurons and a dense layer e LSTMlayer will learn the feature of variable length sequenceswhich are activated by tanh function and logistic functione dense layer acts as a ldquoclassifierrdquo throughout the LSTMwhich can map the learned ldquodistributed feature represen-tationrdquo to the sample tag space At last the detection modelused categorical cross entropy as the loss function and theoptimizer is Adam

In more detail the input data size is m times n times 6 m is thenumber of sessions and n is the maximum length of allsessions Current long-term state c(t) and output y(t) arecomputed as follows

g(t) tanh WTxg middot x(t) + W

Thg middot h(tminus 1) + bg1113872 1113873

c(t) f(t) otimes c(tminus 1) + i(t) otimesg(t)

y(t) h(t) o(t) otimes tanh c(t)1113872 1113873

(1)

where Wxg denotes the weight matrix of the main layerconnected to the input vector x(t) of size 1times 6 while Whg

denotes the weight matrix of the main layer connected to theprevious short-term state h(tminus 1) bg represents the coefficientof variation for the main layer

e architecture of this model for a session is depicted inFigure 6

342 Hidden Markov Model e hidden Markov model(HMM) is a statistical Markov model a highly effectivemeans of modeling a family of unaligned sequences [26 27]which describes a hidden Markov chain stochasticallygenerating unobservable state sequences and then gener-ating a stochastical observation sequence from each state Asits most remarkable features the state at any time t dependsonly on the state of the previous moment while it is in-dependent of the time t the state and the observation atother times Besides it also assumes that observations at anytime depend only on the state at that moment independentof other observations and states

In this paper we mainly focus on supervised learningmethod of the hidden Markov chain e train data includeobservation sequences O and corresponding state sequencesI which can be written as

O1 I1( 1113857 O2 I2( 1113857 Os Is( 11138571113864 1113865 (2)

We can use the maximum likelihood estimation methodto estimate the parameters of the hidden Markov model

(1) We assume that the sample is in the state i at time tand the frequency of transition to state j at time t + i

is Aij and then the probability estimate of thetransition state is computed as follows

1113955aij Aij

1113936Nj1Aij

i 1 2 N j 1 2 N (3)

(2) We assume that the sample state is j and the fre-quency of observation k is Bjk and then the

probability of observation k when the state is j can bewritten as

1113955bjk Bjk

1113936Mk1Bjk

j 1 2 N k 1 2 M (4)

(3) e initial state probability π is estimated as thefrequency of the initial state of i in the S samplesearchitecture of this model is illustrated in Figure 7

In more detail the input data size is x times 6 and x is thenumber of entries in all sessions

4 Experiment and Evaluation

In this section we will introduce the detailed composition ofthe dataset the experiment details and the results

41 Dataset ere are two datasets for our experimentsone for the model training and the other for the model testAs for the first dataset a honeypot [28 29] was built tocapture and analyze webshell attacks Tens of testers wereinvited to attack this honeypot One word Trojans smallTrojans and big Trojans which were coded in differentprogram languages were provided to these testers Wecollected a total of 13986 log entries from the honeypot asnegative samples while 57160 logs entries were collectedfrom real-world websites as positive samples In the processof training we also used a 10-folder cross validation toperform a simple verification of the model As for the latterone a real-world website which contains a total of 2324 filesand 248 k lines of log entries in total was utilized to testSince the log entries with webshell communication in thereal environment may only account for a few percents oreven a few thousandths in the log file we control thepositive and negative sample ratio of the dataset to bearound 6 1 in experiments After feature vector extractionand session identification all log entries become in-dependent sessions Every session consists of six sequenceswhich are the byte sequence of the HTTP response themethod sequence the path sequence that is encoded by thedegree of association the referrer sequence that is dividedinto different cases for encoding the status code sequenceand the time interval sequence Figure 8 is an overview ofour model

42 Experiment Design To evaluate the performance of ourmodel we first explored the effects of different quantiles asthresholds and selected the best values for next experimentsen we validated the effectiveness of the session identi-fication method based on the time interval and quantile Inaddition we compare long short-term memory with thehidden Markov models which are commonly used in se-quence data processing and find the one with better per-formance Finally we leverage a real-world website to testthe SB-LSTM model and compare it with results from otherresearch

6 Security and Communication Networks

All the experiments were performed in a PC machinewith an i7-7700HQ processor 16GB of memory and aGeForce GTX 1060 GPU which has 6GB of memory eSB-LSTMmodel is implemented by Tensormacrow [30] and theHHM is implemented by seqlearn [31]

Before using the raw data to build the model we did astandardization to make all the data appropriate for theneural network Data were processed according to the fol-lowing steps

(i) Since we need to input a 3-dimensional matrix whenconstructing the SB-LSTM model we chose Max-Min scaling normalization to reduce the ecopyect ofpadding 0 when normalizing After scaling all data

are between 0 and 1 e function of doing Max-Min scaling normalization is

xlowast x minus min

max minus min (5)

(ii) e label of all the training data is stored in a list 0for normal access and 1 for webshell access

(iii) When we used long short-term memory to buildour model we must make all data to become amatrix so we padded the length of the sequence to zwhich is the maximum session length Besides weheld a list to show all the valid length (originallength) of the data to ensure the number of itera-tions in LSTM

(iv) When we used the hidden Markov model to buildour model we only need to put all the vectors into asequence because of the input data size of thehidden Markov model

We use 10-folder cross validation [32] e dataset isequally divided into 10 subsets each of which is tested onceand the rest as a training set e cross validation is repeated10 times one subset is selected each time as a test set and theaverage cross validation recognition accuracy rate of 10times is taken as a resulte dataset used for the experimentis unbalanced so we choose accuracy recall precision F1score [33] the receiver operating characteristics (ROC)curves [34] and area under curve (AUC) measure forevaluating the proposed method Besides we can visualizethe relation between TPR and FPR of a classier eseindicators can be expressed as

Normal Webshell

y1 y2 yn

b11 b12 b1nb21

b22

b2n

a12

a21

Figure 7 e architecture of HMM

Next

Predict

h40

h30

h20

LSTM layer 1

LSTM layer 2

LSTM layer 3

LSTM layer 4

ForwardDropoutBackward

Y1

h10

Input data

h4n

Dense

h41

h31

h21

h11 LSTM layer 1

LSTM layer 2

Y2

X3

h42

h32

h22

h12

X2

X2X1

(Possibility 1 possibility 2)

LSTM layer 4

LSTM layer 3

Figure 6 e architecture of LSTM

Security and Communication Networks 7

Accuracy 1

|C|1113944

|C|

i1

TPi + TNi

TPi + TNi + FPi + FNi

Recall 1

|C|1113944

|C|

i1

TPi

TPi + FNi

Precision 1

|C|1113944

|C|

i1

TPi

TPi + FPi

F1 score 2precision times recallprecision + recall

AUC 1113936iisinpositiveclassranki minus (M(1 + M)2)

M times N

(6)

where M is the number of positive samples and N is thenumber of negative samples e score indicates theprobability that each test sample belongs to a positivesample while rank is a positive sample set sorted in thedescending order based on score

43 Experiment Results First we explored the influence ofdifferent thresholds in session identification e thresholdswere set to 55 60 65 70 75 80 85 and 90quantile respectively e comparison results are shown inTable 2

As is shown in Table 2 performance is improved with anincreasing threshold in a certain range and reaches thehighest value when the threshold is 70 Moreover whenthe performance reaches the highest it will continuouslydecrease with a rising threshold Excessive thresholds candecrease the performance of the model because two accesses

that were not originally part of the same session are placed inthe same session In comparison with the threshold of 55to 90 we find the appropriate threshold is 70 and we useit for the next experiments

en we validated the effectiveness of session identifi-cation based on the time interval and the quantile Wecounted the number of access sessions for every user wherethe user is identified by the IP field and the user-agent fielde top 100 users with the largest number of sessions areshown in Figure 9

As is illustrated in Figure 9 most users have multiplesessions up to a maximum of 166 erefore it is quitenecessary for us to conduct a more detailed sessionidentification

In addition we compare the performance of two ma-chine learning models e comparisons are presented inTable 3

As is shown in Table 3 SB-LSTM has a much betterperformance than the HHM-based model e recall rate ofthe HMM-based model under such dataset training is closeto 0 e reason that the HMM assumes that the state at any

B0M0P0R0S0T0

B1M1P1R1S1T1

B2M2P2R2S2T2

BnMnPnRnSnTn

Session content

Sessiondata

Test

Train

Train

Model

Trainedmodel

B bytesM methodP path

R referrerS status codeT time interval

Figure 8 Overview of webshell detection with our model

Table 2 e influence of different thresholds in sessionidentification

reshold () Accuracy Recall Precision F1 score55 0932 08624 08757 0869060 09364 0869 08958 0882265 09598 09568 08977 0926370 09597 09615 09036 0931775 09696 09368 09009 0918580 09376 09401 08518 0893785 08974 07228 06932 0707790 08779 07186 06418 06780

8 Security and Communication Networks

time depends only on the state of the previous momentmight lead to the low recall Besides the HMM-based modelalso has a much lower accuracy than SB-LSTM e ROC isshown in Figure 10 e curve shows the SB-LSTM modelcould achieve an ecopyective and accurate result in which thetrue positive rate could reach 08 when the false positive rateonly reaches 001

Afterwards we found the webshell communication andwe can easily nd the webshell by counting all the accesspaths of the session Besides if the administrator is familiarwith the les included in website dictionary hemay not needto perform statistics to nd the webshell

At last we leverage a real-world website to test the SB-LSTM model and compare it with results from other re-search We rst explored the inmacruence of the dicopyerent sizesof logs on the modelrsquos runtimee experiment used two logles including a recent one and the largest one the websitecan provide Besides we ran this experiment three times andgot the average as results e results are presented in Ta-ble 4 en we compared these to the result of the citedrelated work We downloaded all the source les of thewebsite and reproduced a cited related work which is a le-based detection method For the purpose of maximizing theexperimental results we used the largest log the website canprovide which records all access to the website for nearly ayear and a half to compare We do the same things in termsof the nal results and the comparison results are shown inTable 5

As is illustrated in Table 4 when the size of the log isquite small SB-LSTM can get the result of detection veryquickly which is ecient for detecting whether there iswebshell communication for a certain period of time

0

20

40

60

80

100

120

140

160

180U

ser1

Use

r7

Use

r13

Use

r19

Use

r25

Use

r31

Use

r37

Use

r43

Use

r49

Use

r55

Use

r61

Use

r67

Use

r73

Use

r79

Use

r85

Use

r91

Use

r97

Figure 9 e top 100 users with the largest number of sessions

Table 3 Comparison of two machine learning methods

Model name Accuracy () Recall ()SB-LSTM 9597 9615HMM-based model [31] 6827 000

10

08

06

04

02

000000 0025 0050 0075 0100 0125 0150 0175 0200

Figure 10 ROC curve based on SB-LSTM

Table 4 e inmacruence of the dicopyerent sizes of logs

Model name Sizes Entries Time

SB-LSTM 177 kB 908 05897 s466MB 248433 287127 s

Table 5 Comparison of dicopyerent detection methods

Model name Sizesamount TimeSB-LSTM 466MB 287127 sFRF-WD [20] 2324 398415 s

Security and Communication Networks 9

As revealed in Table 5 even if we use all the log entriessince the website was built the runtime required for ana-lyzing is 279 less than the result of the cited related work

Compared with the results of the first experiment itdemonstrated that the SB-LSTM model is efficient fordetecting whether there is webshell communication for acertain period of time as it requires to scan all the sourcefiles in the common approach while the only input for theSB-LSTM model is the recent log

5 Conclusion

In the previous studies a huge number of features wereextracted from webshell and these features were regarded askeywords to judge whether there is a webshell or notHowever it is almost certain that these approaches would beless useful with the gradual development of encryption andconfusion technology is paper mainly focuses on thechallenge that detects webshell out of itself Instead ofleveraging POST contents source file codes or receivingtraffic the framework we proposed uses sessions generatedfrom the websitersquos logs which highly reduces the cost of timeand space but maintains a high recall rate and accuracyFeatures were extracted in raw sequence data in the web logsand a statistical method was applied to identify sessionsprecisely e results of experiments show that 70 quantilecan be the right threshold that makes the model obtain thehighest accuracy and recall rate and the long short-termmemory which can achieve a high accuracy rate of 9597with a recall rate of 9615 has much better performancethan the hidden Markov model on webshell detectionMoreover the experiment demonstrated the high efficiencyof the proposed approach in terms of the runtime as it takes985 less time than the cited related approach to get theresults In order to be closer to the real-world applicationthe model can be employed to identify webshell files after thewebshell communication is detected by using a statisticalmethod

Data Availability

eWeb logs data used to support the findings of this studyhave not been made available because it was extracted fromreal websites and contains many sensitive information

Conflicts of Interest

e authors declare that they have no conflicts of interest

Acknowledgments

is work was supported in part by the Fundamental ResearchFunds for the Central Universities and the Sichuan UniversityPostdoc Research Foundation under Grant 19XJ0002

References

[1] CNCERT weekly reportrdquo 2019 httpwwwcertorgcnpublishenglishuploadFileWeekly20Report20of20CNCERT-Issue207202019pdf

[2] D H Tony Lee and I Ahl ldquoBreaking down the China chopperweb shell-part Irdquo 2019 httpswwwfireeyecomblogthreat-research201308breaking-down-the-china-chopper-web-shell-part-ihtml

[3] B Yong X Liu Y Liu H Yin L Huang and Q Zhou ldquoWebbehavior detection based on deep neural networkrdquo in Pro-ceedings of the IEEE SmartWorld Ubiquitous Intelligence ampComputing Advanced amp Trusted Computing Scalable Com-puting amp Communications Cloud amp Big Data ComputingInternet of People and Smart City Innovation (SmartWorldSCALCOMUICATCCBDComIOPSCI) pp 1911ndash1916Hong Kong China July 2018

[4] L Y Deng D L Lee Y-H Chen and L X Yann ldquoLexicalanalysis for the webshell attacksrdquo in Proceedings of the 2016International Symposium on Computer Consumer and Con-trol (IS3C) pp 579ndash582 Xirsquoan China July 2016

[5] V-G Le H-T Nguyen D-N Lu and N-H Nguyen ldquoAsolution for automatically malicious web shell and web ap-plication vulnerability detectionrdquo in Proceedings of the In-ternational Conference on Computational CollectiveIntelligence pp 367ndash378 Halkidiki Greece September 2016

[6] J Wang Z Zhou and J Chen ldquoEvaluating CNN and LSTMfor web attack detectionrdquo in Proceedings of the 2018 10thInternational Conference on Machine Learning and Com-puting pp 283ndash287 Macau China February 2018

[7] W Yang B Sun and B Cui ldquoA webshell detection tech-nology based on http traffic analysisrdquo Innovative Mobile andInternet Services in Ubiquitous Computing in Proceedings ofthe International Conference on Innovative Mobile and In-ternet Services in Ubiquitous Computing pp 336ndash342MatsueJapan July 2018

[8] J Kim D-H Yoo H Jang and K Jeong ldquoWebshark 10 abenchmark collection for malicious web shell detectionrdquo JIPSvol 11 no 2 pp 229ndash238 2015

[9] P M Wrench and B V Irwin ldquoTowards a php webshelltaxonomy using deobfuscation-assisted similarity analysisrdquo inProceedings of the 2015 Information Security for South Africa(ISSA) pp 1ndash8 Johannesburg South Africa July 2015

[10] ZMeng R Mei T Zhang andW-PWen ldquoResearch of linuxwebshell detection based on SVM classifierrdquo Netinfo Securityvol 5 pp 5ndash9 2014

[11] O Starov J Dahse S S Ahmad T Holz and N NikiforakisldquoNo honor among thieves a large-scale analysis of maliciousweb shellsrdquo in Proceedings of the 25th International Confer-ence on World Wide Web pp 1021ndash1032 Montreal CanadaApril 2016

[12] S Josefsson ldquoe base16 base32 and base64 data encodingsrdquo2009 httpwwwhjpatdocrfcrfc3548html

[13] Z-H Lv H-B Yan and R Mei ldquoAutomatic and accuratedetection of webshell based on convolutional neural net-workrdquo in Proceedings of the China Cyber Security AnnualConference pp 73ndash85 Beijing China August 2018

[14] Compromised web servers and web shells-threat awarenessand guidancerdquo 2017 httpswwwus-certgovncasalertsTA15-314A

[15] T D Tu C Guang G Xiaojun and P Wubin ldquoWebshelldetection techniques in web applicationsrdquo in Proceedings of theFifth International Conference on Computing Communicationsand Networking Technologies (ICCCNT) pp 1ndash7 Hefei China2014

[16] Y Tian J Wang Z Zhou and S Zhou ldquoCNN-webshellmalicious web shell detection with convolutional neuralnetworkrdquo in Proceedings of the 2017 VI International

10 Security and Communication Networks

Conference on Network Communication and Computingpp 75ndash79 Kunming China December 2017

[17] T Walkowiak S Datko and H Maciejewski ldquoBag-of-wordsbag-of-topics and word-to-vec based subject classification oftext documents in polish-a comparative studyrdquo Contempo-rary Complex Systems and eir Dependability pp 526ndash5352018

[18] D H Hubel and T NWiesel ldquoReceptive fields and functionalarchitecture of monkey striate cortexrdquo e Journal of Phys-iology vol 195 no 1 pp 215ndash243 1968

[19] X Sun X Lu and H Dai ldquoA matrix decomposition basedwebshell detection methodrdquo in Proceedings of the 2017 In-ternational Conference on Cryptography Security and Privacypp 66ndash70 Wuhan China March 2017

[20] Y Fang Y Qiu L Liu and C Huang ldquoDetecting webshellbased on random forest with fasttextrdquo in Proceedings of the2018 International Conference on Computing and ArtificialIntelligence pp 52ndash56 Chengdu China March 2018

[21] Package Information [ebol] 2019 httpspeclphpnetpackagevld

[22] H Zhang H Guan H Yan et al ldquoWebshell traffic detectionwith character-level features based on deep learningrdquo IEEEAccess vol 6 pp 75268ndash75277 2018

[23] L Shi and Y Fang ldquoWebshell detection method researchbased on web logrdquo Journal of Information Security Researchvol 1 p 11 2016

[24] R Sasi ldquoWeb backdoors-attack evasion and detectionrdquo inProceedings of the C0C0N Sec Conference pp 989ndash1003Cochin India 2011

[25] J Schmidhuber and S Hochreiter ldquoLong short-term mem-oryrdquo Neural Computation vol 9 no 8 pp 1735ndash1780 1997

[26] R Hughey and A Krogh ldquoHidden markov models for se-quence analysis extension and analysis of the basic methodrdquoBioinformatics vol 12 no 2 pp 95ndash107 1996

[27] B Schuller G Rigoll and M Lang ldquoHidden markov model-based speech emotion recognitionrdquo in Proceedings of the 2003IEEE International Conference on Acoustics Speech andSignal Processing Atlanta GA USA April 2003

[28] I Kuwatly M Sraj Z Al Masri and H Artail ldquoA dynamichoneypot design for intrusion detectionrdquo in Proceedings of theIEEEACS International Conference on Pervasive ServicesLebanon July 2004

[29] C Kreibich and J Crowcroft ldquoHoneycombrdquo ACM SIG-COMM Computer Communication Review vol 34 no 1pp 51ndash56 2004

[30] TensorFlowrdquo 2019 httpswwwtensorfloworg[31] L larsmans ldquoseqlearnrdquo 2016 httpsgithubcomlarsmans

seqlearn[32] R Kohavi et al ldquoA study of cross-validation and bootstrap for

accuracy estimation and model selectionrdquo IJCAI vol 14no 2 pp 1137ndash1145 1995

[33] C Goutte and E Gaussier ldquoA probabilistic interpretation ofprecision recall and f-score with implication for evaluationLecture Notes in Computer Sciencerdquo in Proceedings of theEuropean Conference on Information Retrieval pp 345ndash359Santiago de Compostela Spain March 2005

[34] T Fawcett ldquoAn introduction to ROC analysisrdquo PatternRecognition Letters vol 27 no 8 pp 861ndash874 2006

Security and Communication Networks 11

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 6: Research Article - Hindawi Publishing Corporationdownloads.hindawi.com/journals/scn/2019/3093809.pdf · As a web service-based backdoor program, webshell is installed through vul-

e model which called SB-LSTM (session-based longshort-term memory) consists of four LSTM layers each ofwhich has sixteen neurons and a dense layer e LSTMlayer will learn the feature of variable length sequenceswhich are activated by tanh function and logistic functione dense layer acts as a ldquoclassifierrdquo throughout the LSTMwhich can map the learned ldquodistributed feature represen-tationrdquo to the sample tag space At last the detection modelused categorical cross entropy as the loss function and theoptimizer is Adam

In more detail the input data size is m times n times 6 m is thenumber of sessions and n is the maximum length of allsessions Current long-term state c(t) and output y(t) arecomputed as follows

g(t) tanh WTxg middot x(t) + W

Thg middot h(tminus 1) + bg1113872 1113873

c(t) f(t) otimes c(tminus 1) + i(t) otimesg(t)

y(t) h(t) o(t) otimes tanh c(t)1113872 1113873

(1)

where Wxg denotes the weight matrix of the main layerconnected to the input vector x(t) of size 1times 6 while Whg

denotes the weight matrix of the main layer connected to theprevious short-term state h(tminus 1) bg represents the coefficientof variation for the main layer

e architecture of this model for a session is depicted inFigure 6

342 Hidden Markov Model e hidden Markov model(HMM) is a statistical Markov model a highly effectivemeans of modeling a family of unaligned sequences [26 27]which describes a hidden Markov chain stochasticallygenerating unobservable state sequences and then gener-ating a stochastical observation sequence from each state Asits most remarkable features the state at any time t dependsonly on the state of the previous moment while it is in-dependent of the time t the state and the observation atother times Besides it also assumes that observations at anytime depend only on the state at that moment independentof other observations and states

In this paper we mainly focus on supervised learningmethod of the hidden Markov chain e train data includeobservation sequences O and corresponding state sequencesI which can be written as

O1 I1( 1113857 O2 I2( 1113857 Os Is( 11138571113864 1113865 (2)

We can use the maximum likelihood estimation methodto estimate the parameters of the hidden Markov model

(1) We assume that the sample is in the state i at time tand the frequency of transition to state j at time t + i

is Aij and then the probability estimate of thetransition state is computed as follows

1113955aij Aij

1113936Nj1Aij

i 1 2 N j 1 2 N (3)

(2) We assume that the sample state is j and the fre-quency of observation k is Bjk and then the

probability of observation k when the state is j can bewritten as

1113955bjk Bjk

1113936Mk1Bjk

j 1 2 N k 1 2 M (4)

(3) e initial state probability π is estimated as thefrequency of the initial state of i in the S samplesearchitecture of this model is illustrated in Figure 7

In more detail the input data size is x times 6 and x is thenumber of entries in all sessions

4 Experiment and Evaluation

In this section we will introduce the detailed composition ofthe dataset the experiment details and the results

41 Dataset ere are two datasets for our experimentsone for the model training and the other for the model testAs for the first dataset a honeypot [28 29] was built tocapture and analyze webshell attacks Tens of testers wereinvited to attack this honeypot One word Trojans smallTrojans and big Trojans which were coded in differentprogram languages were provided to these testers Wecollected a total of 13986 log entries from the honeypot asnegative samples while 57160 logs entries were collectedfrom real-world websites as positive samples In the processof training we also used a 10-folder cross validation toperform a simple verification of the model As for the latterone a real-world website which contains a total of 2324 filesand 248 k lines of log entries in total was utilized to testSince the log entries with webshell communication in thereal environment may only account for a few percents oreven a few thousandths in the log file we control thepositive and negative sample ratio of the dataset to bearound 6 1 in experiments After feature vector extractionand session identification all log entries become in-dependent sessions Every session consists of six sequenceswhich are the byte sequence of the HTTP response themethod sequence the path sequence that is encoded by thedegree of association the referrer sequence that is dividedinto different cases for encoding the status code sequenceand the time interval sequence Figure 8 is an overview ofour model

42 Experiment Design To evaluate the performance of ourmodel we first explored the effects of different quantiles asthresholds and selected the best values for next experimentsen we validated the effectiveness of the session identi-fication method based on the time interval and quantile Inaddition we compare long short-term memory with thehidden Markov models which are commonly used in se-quence data processing and find the one with better per-formance Finally we leverage a real-world website to testthe SB-LSTM model and compare it with results from otherresearch

6 Security and Communication Networks

All the experiments were performed in a PC machinewith an i7-7700HQ processor 16GB of memory and aGeForce GTX 1060 GPU which has 6GB of memory eSB-LSTMmodel is implemented by Tensormacrow [30] and theHHM is implemented by seqlearn [31]

Before using the raw data to build the model we did astandardization to make all the data appropriate for theneural network Data were processed according to the fol-lowing steps

(i) Since we need to input a 3-dimensional matrix whenconstructing the SB-LSTM model we chose Max-Min scaling normalization to reduce the ecopyect ofpadding 0 when normalizing After scaling all data

are between 0 and 1 e function of doing Max-Min scaling normalization is

xlowast x minus min

max minus min (5)

(ii) e label of all the training data is stored in a list 0for normal access and 1 for webshell access

(iii) When we used long short-term memory to buildour model we must make all data to become amatrix so we padded the length of the sequence to zwhich is the maximum session length Besides weheld a list to show all the valid length (originallength) of the data to ensure the number of itera-tions in LSTM

(iv) When we used the hidden Markov model to buildour model we only need to put all the vectors into asequence because of the input data size of thehidden Markov model

We use 10-folder cross validation [32] e dataset isequally divided into 10 subsets each of which is tested onceand the rest as a training set e cross validation is repeated10 times one subset is selected each time as a test set and theaverage cross validation recognition accuracy rate of 10times is taken as a resulte dataset used for the experimentis unbalanced so we choose accuracy recall precision F1score [33] the receiver operating characteristics (ROC)curves [34] and area under curve (AUC) measure forevaluating the proposed method Besides we can visualizethe relation between TPR and FPR of a classier eseindicators can be expressed as

Normal Webshell

y1 y2 yn

b11 b12 b1nb21

b22

b2n

a12

a21

Figure 7 e architecture of HMM

Next

Predict

h40

h30

h20

LSTM layer 1

LSTM layer 2

LSTM layer 3

LSTM layer 4

ForwardDropoutBackward

Y1

h10

Input data

h4n

Dense

h41

h31

h21

h11 LSTM layer 1

LSTM layer 2

Y2

X3

h42

h32

h22

h12

X2

X2X1

(Possibility 1 possibility 2)

LSTM layer 4

LSTM layer 3

Figure 6 e architecture of LSTM

Security and Communication Networks 7

Accuracy 1

|C|1113944

|C|

i1

TPi + TNi

TPi + TNi + FPi + FNi

Recall 1

|C|1113944

|C|

i1

TPi

TPi + FNi

Precision 1

|C|1113944

|C|

i1

TPi

TPi + FPi

F1 score 2precision times recallprecision + recall

AUC 1113936iisinpositiveclassranki minus (M(1 + M)2)

M times N

(6)

where M is the number of positive samples and N is thenumber of negative samples e score indicates theprobability that each test sample belongs to a positivesample while rank is a positive sample set sorted in thedescending order based on score

43 Experiment Results First we explored the influence ofdifferent thresholds in session identification e thresholdswere set to 55 60 65 70 75 80 85 and 90quantile respectively e comparison results are shown inTable 2

As is shown in Table 2 performance is improved with anincreasing threshold in a certain range and reaches thehighest value when the threshold is 70 Moreover whenthe performance reaches the highest it will continuouslydecrease with a rising threshold Excessive thresholds candecrease the performance of the model because two accesses

that were not originally part of the same session are placed inthe same session In comparison with the threshold of 55to 90 we find the appropriate threshold is 70 and we useit for the next experiments

en we validated the effectiveness of session identifi-cation based on the time interval and the quantile Wecounted the number of access sessions for every user wherethe user is identified by the IP field and the user-agent fielde top 100 users with the largest number of sessions areshown in Figure 9

As is illustrated in Figure 9 most users have multiplesessions up to a maximum of 166 erefore it is quitenecessary for us to conduct a more detailed sessionidentification

In addition we compare the performance of two ma-chine learning models e comparisons are presented inTable 3

As is shown in Table 3 SB-LSTM has a much betterperformance than the HHM-based model e recall rate ofthe HMM-based model under such dataset training is closeto 0 e reason that the HMM assumes that the state at any

B0M0P0R0S0T0

B1M1P1R1S1T1

B2M2P2R2S2T2

BnMnPnRnSnTn

Session content

Sessiondata

Test

Train

Train

Model

Trainedmodel

B bytesM methodP path

R referrerS status codeT time interval

Figure 8 Overview of webshell detection with our model

Table 2 e influence of different thresholds in sessionidentification

reshold () Accuracy Recall Precision F1 score55 0932 08624 08757 0869060 09364 0869 08958 0882265 09598 09568 08977 0926370 09597 09615 09036 0931775 09696 09368 09009 0918580 09376 09401 08518 0893785 08974 07228 06932 0707790 08779 07186 06418 06780

8 Security and Communication Networks

time depends only on the state of the previous momentmight lead to the low recall Besides the HMM-based modelalso has a much lower accuracy than SB-LSTM e ROC isshown in Figure 10 e curve shows the SB-LSTM modelcould achieve an ecopyective and accurate result in which thetrue positive rate could reach 08 when the false positive rateonly reaches 001

Afterwards we found the webshell communication andwe can easily nd the webshell by counting all the accesspaths of the session Besides if the administrator is familiarwith the les included in website dictionary hemay not needto perform statistics to nd the webshell

At last we leverage a real-world website to test the SB-LSTM model and compare it with results from other re-search We rst explored the inmacruence of the dicopyerent sizesof logs on the modelrsquos runtimee experiment used two logles including a recent one and the largest one the websitecan provide Besides we ran this experiment three times andgot the average as results e results are presented in Ta-ble 4 en we compared these to the result of the citedrelated work We downloaded all the source les of thewebsite and reproduced a cited related work which is a le-based detection method For the purpose of maximizing theexperimental results we used the largest log the website canprovide which records all access to the website for nearly ayear and a half to compare We do the same things in termsof the nal results and the comparison results are shown inTable 5

As is illustrated in Table 4 when the size of the log isquite small SB-LSTM can get the result of detection veryquickly which is ecient for detecting whether there iswebshell communication for a certain period of time

0

20

40

60

80

100

120

140

160

180U

ser1

Use

r7

Use

r13

Use

r19

Use

r25

Use

r31

Use

r37

Use

r43

Use

r49

Use

r55

Use

r61

Use

r67

Use

r73

Use

r79

Use

r85

Use

r91

Use

r97

Figure 9 e top 100 users with the largest number of sessions

Table 3 Comparison of two machine learning methods

Model name Accuracy () Recall ()SB-LSTM 9597 9615HMM-based model [31] 6827 000

10

08

06

04

02

000000 0025 0050 0075 0100 0125 0150 0175 0200

Figure 10 ROC curve based on SB-LSTM

Table 4 e inmacruence of the dicopyerent sizes of logs

Model name Sizes Entries Time

SB-LSTM 177 kB 908 05897 s466MB 248433 287127 s

Table 5 Comparison of dicopyerent detection methods

Model name Sizesamount TimeSB-LSTM 466MB 287127 sFRF-WD [20] 2324 398415 s

Security and Communication Networks 9

As revealed in Table 5 even if we use all the log entriessince the website was built the runtime required for ana-lyzing is 279 less than the result of the cited related work

Compared with the results of the first experiment itdemonstrated that the SB-LSTM model is efficient fordetecting whether there is webshell communication for acertain period of time as it requires to scan all the sourcefiles in the common approach while the only input for theSB-LSTM model is the recent log

5 Conclusion

In the previous studies a huge number of features wereextracted from webshell and these features were regarded askeywords to judge whether there is a webshell or notHowever it is almost certain that these approaches would beless useful with the gradual development of encryption andconfusion technology is paper mainly focuses on thechallenge that detects webshell out of itself Instead ofleveraging POST contents source file codes or receivingtraffic the framework we proposed uses sessions generatedfrom the websitersquos logs which highly reduces the cost of timeand space but maintains a high recall rate and accuracyFeatures were extracted in raw sequence data in the web logsand a statistical method was applied to identify sessionsprecisely e results of experiments show that 70 quantilecan be the right threshold that makes the model obtain thehighest accuracy and recall rate and the long short-termmemory which can achieve a high accuracy rate of 9597with a recall rate of 9615 has much better performancethan the hidden Markov model on webshell detectionMoreover the experiment demonstrated the high efficiencyof the proposed approach in terms of the runtime as it takes985 less time than the cited related approach to get theresults In order to be closer to the real-world applicationthe model can be employed to identify webshell files after thewebshell communication is detected by using a statisticalmethod

Data Availability

eWeb logs data used to support the findings of this studyhave not been made available because it was extracted fromreal websites and contains many sensitive information

Conflicts of Interest

e authors declare that they have no conflicts of interest

Acknowledgments

is work was supported in part by the Fundamental ResearchFunds for the Central Universities and the Sichuan UniversityPostdoc Research Foundation under Grant 19XJ0002

References

[1] CNCERT weekly reportrdquo 2019 httpwwwcertorgcnpublishenglishuploadFileWeekly20Report20of20CNCERT-Issue207202019pdf

[2] D H Tony Lee and I Ahl ldquoBreaking down the China chopperweb shell-part Irdquo 2019 httpswwwfireeyecomblogthreat-research201308breaking-down-the-china-chopper-web-shell-part-ihtml

[3] B Yong X Liu Y Liu H Yin L Huang and Q Zhou ldquoWebbehavior detection based on deep neural networkrdquo in Pro-ceedings of the IEEE SmartWorld Ubiquitous Intelligence ampComputing Advanced amp Trusted Computing Scalable Com-puting amp Communications Cloud amp Big Data ComputingInternet of People and Smart City Innovation (SmartWorldSCALCOMUICATCCBDComIOPSCI) pp 1911ndash1916Hong Kong China July 2018

[4] L Y Deng D L Lee Y-H Chen and L X Yann ldquoLexicalanalysis for the webshell attacksrdquo in Proceedings of the 2016International Symposium on Computer Consumer and Con-trol (IS3C) pp 579ndash582 Xirsquoan China July 2016

[5] V-G Le H-T Nguyen D-N Lu and N-H Nguyen ldquoAsolution for automatically malicious web shell and web ap-plication vulnerability detectionrdquo in Proceedings of the In-ternational Conference on Computational CollectiveIntelligence pp 367ndash378 Halkidiki Greece September 2016

[6] J Wang Z Zhou and J Chen ldquoEvaluating CNN and LSTMfor web attack detectionrdquo in Proceedings of the 2018 10thInternational Conference on Machine Learning and Com-puting pp 283ndash287 Macau China February 2018

[7] W Yang B Sun and B Cui ldquoA webshell detection tech-nology based on http traffic analysisrdquo Innovative Mobile andInternet Services in Ubiquitous Computing in Proceedings ofthe International Conference on Innovative Mobile and In-ternet Services in Ubiquitous Computing pp 336ndash342MatsueJapan July 2018

[8] J Kim D-H Yoo H Jang and K Jeong ldquoWebshark 10 abenchmark collection for malicious web shell detectionrdquo JIPSvol 11 no 2 pp 229ndash238 2015

[9] P M Wrench and B V Irwin ldquoTowards a php webshelltaxonomy using deobfuscation-assisted similarity analysisrdquo inProceedings of the 2015 Information Security for South Africa(ISSA) pp 1ndash8 Johannesburg South Africa July 2015

[10] ZMeng R Mei T Zhang andW-PWen ldquoResearch of linuxwebshell detection based on SVM classifierrdquo Netinfo Securityvol 5 pp 5ndash9 2014

[11] O Starov J Dahse S S Ahmad T Holz and N NikiforakisldquoNo honor among thieves a large-scale analysis of maliciousweb shellsrdquo in Proceedings of the 25th International Confer-ence on World Wide Web pp 1021ndash1032 Montreal CanadaApril 2016

[12] S Josefsson ldquoe base16 base32 and base64 data encodingsrdquo2009 httpwwwhjpatdocrfcrfc3548html

[13] Z-H Lv H-B Yan and R Mei ldquoAutomatic and accuratedetection of webshell based on convolutional neural net-workrdquo in Proceedings of the China Cyber Security AnnualConference pp 73ndash85 Beijing China August 2018

[14] Compromised web servers and web shells-threat awarenessand guidancerdquo 2017 httpswwwus-certgovncasalertsTA15-314A

[15] T D Tu C Guang G Xiaojun and P Wubin ldquoWebshelldetection techniques in web applicationsrdquo in Proceedings of theFifth International Conference on Computing Communicationsand Networking Technologies (ICCCNT) pp 1ndash7 Hefei China2014

[16] Y Tian J Wang Z Zhou and S Zhou ldquoCNN-webshellmalicious web shell detection with convolutional neuralnetworkrdquo in Proceedings of the 2017 VI International

10 Security and Communication Networks

Conference on Network Communication and Computingpp 75ndash79 Kunming China December 2017

[17] T Walkowiak S Datko and H Maciejewski ldquoBag-of-wordsbag-of-topics and word-to-vec based subject classification oftext documents in polish-a comparative studyrdquo Contempo-rary Complex Systems and eir Dependability pp 526ndash5352018

[18] D H Hubel and T NWiesel ldquoReceptive fields and functionalarchitecture of monkey striate cortexrdquo e Journal of Phys-iology vol 195 no 1 pp 215ndash243 1968

[19] X Sun X Lu and H Dai ldquoA matrix decomposition basedwebshell detection methodrdquo in Proceedings of the 2017 In-ternational Conference on Cryptography Security and Privacypp 66ndash70 Wuhan China March 2017

[20] Y Fang Y Qiu L Liu and C Huang ldquoDetecting webshellbased on random forest with fasttextrdquo in Proceedings of the2018 International Conference on Computing and ArtificialIntelligence pp 52ndash56 Chengdu China March 2018

[21] Package Information [ebol] 2019 httpspeclphpnetpackagevld

[22] H Zhang H Guan H Yan et al ldquoWebshell traffic detectionwith character-level features based on deep learningrdquo IEEEAccess vol 6 pp 75268ndash75277 2018

[23] L Shi and Y Fang ldquoWebshell detection method researchbased on web logrdquo Journal of Information Security Researchvol 1 p 11 2016

[24] R Sasi ldquoWeb backdoors-attack evasion and detectionrdquo inProceedings of the C0C0N Sec Conference pp 989ndash1003Cochin India 2011

[25] J Schmidhuber and S Hochreiter ldquoLong short-term mem-oryrdquo Neural Computation vol 9 no 8 pp 1735ndash1780 1997

[26] R Hughey and A Krogh ldquoHidden markov models for se-quence analysis extension and analysis of the basic methodrdquoBioinformatics vol 12 no 2 pp 95ndash107 1996

[27] B Schuller G Rigoll and M Lang ldquoHidden markov model-based speech emotion recognitionrdquo in Proceedings of the 2003IEEE International Conference on Acoustics Speech andSignal Processing Atlanta GA USA April 2003

[28] I Kuwatly M Sraj Z Al Masri and H Artail ldquoA dynamichoneypot design for intrusion detectionrdquo in Proceedings of theIEEEACS International Conference on Pervasive ServicesLebanon July 2004

[29] C Kreibich and J Crowcroft ldquoHoneycombrdquo ACM SIG-COMM Computer Communication Review vol 34 no 1pp 51ndash56 2004

[30] TensorFlowrdquo 2019 httpswwwtensorfloworg[31] L larsmans ldquoseqlearnrdquo 2016 httpsgithubcomlarsmans

seqlearn[32] R Kohavi et al ldquoA study of cross-validation and bootstrap for

accuracy estimation and model selectionrdquo IJCAI vol 14no 2 pp 1137ndash1145 1995

[33] C Goutte and E Gaussier ldquoA probabilistic interpretation ofprecision recall and f-score with implication for evaluationLecture Notes in Computer Sciencerdquo in Proceedings of theEuropean Conference on Information Retrieval pp 345ndash359Santiago de Compostela Spain March 2005

[34] T Fawcett ldquoAn introduction to ROC analysisrdquo PatternRecognition Letters vol 27 no 8 pp 861ndash874 2006

Security and Communication Networks 11

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 7: Research Article - Hindawi Publishing Corporationdownloads.hindawi.com/journals/scn/2019/3093809.pdf · As a web service-based backdoor program, webshell is installed through vul-

All the experiments were performed in a PC machinewith an i7-7700HQ processor 16GB of memory and aGeForce GTX 1060 GPU which has 6GB of memory eSB-LSTMmodel is implemented by Tensormacrow [30] and theHHM is implemented by seqlearn [31]

Before using the raw data to build the model we did astandardization to make all the data appropriate for theneural network Data were processed according to the fol-lowing steps

(i) Since we need to input a 3-dimensional matrix whenconstructing the SB-LSTM model we chose Max-Min scaling normalization to reduce the ecopyect ofpadding 0 when normalizing After scaling all data

are between 0 and 1 e function of doing Max-Min scaling normalization is

xlowast x minus min

max minus min (5)

(ii) e label of all the training data is stored in a list 0for normal access and 1 for webshell access

(iii) When we used long short-term memory to buildour model we must make all data to become amatrix so we padded the length of the sequence to zwhich is the maximum session length Besides weheld a list to show all the valid length (originallength) of the data to ensure the number of itera-tions in LSTM

(iv) When we used the hidden Markov model to buildour model we only need to put all the vectors into asequence because of the input data size of thehidden Markov model

We use 10-folder cross validation [32] e dataset isequally divided into 10 subsets each of which is tested onceand the rest as a training set e cross validation is repeated10 times one subset is selected each time as a test set and theaverage cross validation recognition accuracy rate of 10times is taken as a resulte dataset used for the experimentis unbalanced so we choose accuracy recall precision F1score [33] the receiver operating characteristics (ROC)curves [34] and area under curve (AUC) measure forevaluating the proposed method Besides we can visualizethe relation between TPR and FPR of a classier eseindicators can be expressed as

Normal Webshell

y1 y2 yn

b11 b12 b1nb21

b22

b2n

a12

a21

Figure 7 e architecture of HMM

Next

Predict

h40

h30

h20

LSTM layer 1

LSTM layer 2

LSTM layer 3

LSTM layer 4

ForwardDropoutBackward

Y1

h10

Input data

h4n

Dense

h41

h31

h21

h11 LSTM layer 1

LSTM layer 2

Y2

X3

h42

h32

h22

h12

X2

X2X1

(Possibility 1 possibility 2)

LSTM layer 4

LSTM layer 3

Figure 6 e architecture of LSTM

Security and Communication Networks 7

Accuracy 1

|C|1113944

|C|

i1

TPi + TNi

TPi + TNi + FPi + FNi

Recall 1

|C|1113944

|C|

i1

TPi

TPi + FNi

Precision 1

|C|1113944

|C|

i1

TPi

TPi + FPi

F1 score 2precision times recallprecision + recall

AUC 1113936iisinpositiveclassranki minus (M(1 + M)2)

M times N

(6)

where M is the number of positive samples and N is thenumber of negative samples e score indicates theprobability that each test sample belongs to a positivesample while rank is a positive sample set sorted in thedescending order based on score

43 Experiment Results First we explored the influence ofdifferent thresholds in session identification e thresholdswere set to 55 60 65 70 75 80 85 and 90quantile respectively e comparison results are shown inTable 2

As is shown in Table 2 performance is improved with anincreasing threshold in a certain range and reaches thehighest value when the threshold is 70 Moreover whenthe performance reaches the highest it will continuouslydecrease with a rising threshold Excessive thresholds candecrease the performance of the model because two accesses

that were not originally part of the same session are placed inthe same session In comparison with the threshold of 55to 90 we find the appropriate threshold is 70 and we useit for the next experiments

en we validated the effectiveness of session identifi-cation based on the time interval and the quantile Wecounted the number of access sessions for every user wherethe user is identified by the IP field and the user-agent fielde top 100 users with the largest number of sessions areshown in Figure 9

As is illustrated in Figure 9 most users have multiplesessions up to a maximum of 166 erefore it is quitenecessary for us to conduct a more detailed sessionidentification

In addition we compare the performance of two ma-chine learning models e comparisons are presented inTable 3

As is shown in Table 3 SB-LSTM has a much betterperformance than the HHM-based model e recall rate ofthe HMM-based model under such dataset training is closeto 0 e reason that the HMM assumes that the state at any

B0M0P0R0S0T0

B1M1P1R1S1T1

B2M2P2R2S2T2

BnMnPnRnSnTn

Session content

Sessiondata

Test

Train

Train

Model

Trainedmodel

B bytesM methodP path

R referrerS status codeT time interval

Figure 8 Overview of webshell detection with our model

Table 2 e influence of different thresholds in sessionidentification

reshold () Accuracy Recall Precision F1 score55 0932 08624 08757 0869060 09364 0869 08958 0882265 09598 09568 08977 0926370 09597 09615 09036 0931775 09696 09368 09009 0918580 09376 09401 08518 0893785 08974 07228 06932 0707790 08779 07186 06418 06780

8 Security and Communication Networks

time depends only on the state of the previous momentmight lead to the low recall Besides the HMM-based modelalso has a much lower accuracy than SB-LSTM e ROC isshown in Figure 10 e curve shows the SB-LSTM modelcould achieve an ecopyective and accurate result in which thetrue positive rate could reach 08 when the false positive rateonly reaches 001

Afterwards we found the webshell communication andwe can easily nd the webshell by counting all the accesspaths of the session Besides if the administrator is familiarwith the les included in website dictionary hemay not needto perform statistics to nd the webshell

At last we leverage a real-world website to test the SB-LSTM model and compare it with results from other re-search We rst explored the inmacruence of the dicopyerent sizesof logs on the modelrsquos runtimee experiment used two logles including a recent one and the largest one the websitecan provide Besides we ran this experiment three times andgot the average as results e results are presented in Ta-ble 4 en we compared these to the result of the citedrelated work We downloaded all the source les of thewebsite and reproduced a cited related work which is a le-based detection method For the purpose of maximizing theexperimental results we used the largest log the website canprovide which records all access to the website for nearly ayear and a half to compare We do the same things in termsof the nal results and the comparison results are shown inTable 5

As is illustrated in Table 4 when the size of the log isquite small SB-LSTM can get the result of detection veryquickly which is ecient for detecting whether there iswebshell communication for a certain period of time

0

20

40

60

80

100

120

140

160

180U

ser1

Use

r7

Use

r13

Use

r19

Use

r25

Use

r31

Use

r37

Use

r43

Use

r49

Use

r55

Use

r61

Use

r67

Use

r73

Use

r79

Use

r85

Use

r91

Use

r97

Figure 9 e top 100 users with the largest number of sessions

Table 3 Comparison of two machine learning methods

Model name Accuracy () Recall ()SB-LSTM 9597 9615HMM-based model [31] 6827 000

10

08

06

04

02

000000 0025 0050 0075 0100 0125 0150 0175 0200

Figure 10 ROC curve based on SB-LSTM

Table 4 e inmacruence of the dicopyerent sizes of logs

Model name Sizes Entries Time

SB-LSTM 177 kB 908 05897 s466MB 248433 287127 s

Table 5 Comparison of dicopyerent detection methods

Model name Sizesamount TimeSB-LSTM 466MB 287127 sFRF-WD [20] 2324 398415 s

Security and Communication Networks 9

As revealed in Table 5 even if we use all the log entriessince the website was built the runtime required for ana-lyzing is 279 less than the result of the cited related work

Compared with the results of the first experiment itdemonstrated that the SB-LSTM model is efficient fordetecting whether there is webshell communication for acertain period of time as it requires to scan all the sourcefiles in the common approach while the only input for theSB-LSTM model is the recent log

5 Conclusion

In the previous studies a huge number of features wereextracted from webshell and these features were regarded askeywords to judge whether there is a webshell or notHowever it is almost certain that these approaches would beless useful with the gradual development of encryption andconfusion technology is paper mainly focuses on thechallenge that detects webshell out of itself Instead ofleveraging POST contents source file codes or receivingtraffic the framework we proposed uses sessions generatedfrom the websitersquos logs which highly reduces the cost of timeand space but maintains a high recall rate and accuracyFeatures were extracted in raw sequence data in the web logsand a statistical method was applied to identify sessionsprecisely e results of experiments show that 70 quantilecan be the right threshold that makes the model obtain thehighest accuracy and recall rate and the long short-termmemory which can achieve a high accuracy rate of 9597with a recall rate of 9615 has much better performancethan the hidden Markov model on webshell detectionMoreover the experiment demonstrated the high efficiencyof the proposed approach in terms of the runtime as it takes985 less time than the cited related approach to get theresults In order to be closer to the real-world applicationthe model can be employed to identify webshell files after thewebshell communication is detected by using a statisticalmethod

Data Availability

eWeb logs data used to support the findings of this studyhave not been made available because it was extracted fromreal websites and contains many sensitive information

Conflicts of Interest

e authors declare that they have no conflicts of interest

Acknowledgments

is work was supported in part by the Fundamental ResearchFunds for the Central Universities and the Sichuan UniversityPostdoc Research Foundation under Grant 19XJ0002

References

[1] CNCERT weekly reportrdquo 2019 httpwwwcertorgcnpublishenglishuploadFileWeekly20Report20of20CNCERT-Issue207202019pdf

[2] D H Tony Lee and I Ahl ldquoBreaking down the China chopperweb shell-part Irdquo 2019 httpswwwfireeyecomblogthreat-research201308breaking-down-the-china-chopper-web-shell-part-ihtml

[3] B Yong X Liu Y Liu H Yin L Huang and Q Zhou ldquoWebbehavior detection based on deep neural networkrdquo in Pro-ceedings of the IEEE SmartWorld Ubiquitous Intelligence ampComputing Advanced amp Trusted Computing Scalable Com-puting amp Communications Cloud amp Big Data ComputingInternet of People and Smart City Innovation (SmartWorldSCALCOMUICATCCBDComIOPSCI) pp 1911ndash1916Hong Kong China July 2018

[4] L Y Deng D L Lee Y-H Chen and L X Yann ldquoLexicalanalysis for the webshell attacksrdquo in Proceedings of the 2016International Symposium on Computer Consumer and Con-trol (IS3C) pp 579ndash582 Xirsquoan China July 2016

[5] V-G Le H-T Nguyen D-N Lu and N-H Nguyen ldquoAsolution for automatically malicious web shell and web ap-plication vulnerability detectionrdquo in Proceedings of the In-ternational Conference on Computational CollectiveIntelligence pp 367ndash378 Halkidiki Greece September 2016

[6] J Wang Z Zhou and J Chen ldquoEvaluating CNN and LSTMfor web attack detectionrdquo in Proceedings of the 2018 10thInternational Conference on Machine Learning and Com-puting pp 283ndash287 Macau China February 2018

[7] W Yang B Sun and B Cui ldquoA webshell detection tech-nology based on http traffic analysisrdquo Innovative Mobile andInternet Services in Ubiquitous Computing in Proceedings ofthe International Conference on Innovative Mobile and In-ternet Services in Ubiquitous Computing pp 336ndash342MatsueJapan July 2018

[8] J Kim D-H Yoo H Jang and K Jeong ldquoWebshark 10 abenchmark collection for malicious web shell detectionrdquo JIPSvol 11 no 2 pp 229ndash238 2015

[9] P M Wrench and B V Irwin ldquoTowards a php webshelltaxonomy using deobfuscation-assisted similarity analysisrdquo inProceedings of the 2015 Information Security for South Africa(ISSA) pp 1ndash8 Johannesburg South Africa July 2015

[10] ZMeng R Mei T Zhang andW-PWen ldquoResearch of linuxwebshell detection based on SVM classifierrdquo Netinfo Securityvol 5 pp 5ndash9 2014

[11] O Starov J Dahse S S Ahmad T Holz and N NikiforakisldquoNo honor among thieves a large-scale analysis of maliciousweb shellsrdquo in Proceedings of the 25th International Confer-ence on World Wide Web pp 1021ndash1032 Montreal CanadaApril 2016

[12] S Josefsson ldquoe base16 base32 and base64 data encodingsrdquo2009 httpwwwhjpatdocrfcrfc3548html

[13] Z-H Lv H-B Yan and R Mei ldquoAutomatic and accuratedetection of webshell based on convolutional neural net-workrdquo in Proceedings of the China Cyber Security AnnualConference pp 73ndash85 Beijing China August 2018

[14] Compromised web servers and web shells-threat awarenessand guidancerdquo 2017 httpswwwus-certgovncasalertsTA15-314A

[15] T D Tu C Guang G Xiaojun and P Wubin ldquoWebshelldetection techniques in web applicationsrdquo in Proceedings of theFifth International Conference on Computing Communicationsand Networking Technologies (ICCCNT) pp 1ndash7 Hefei China2014

[16] Y Tian J Wang Z Zhou and S Zhou ldquoCNN-webshellmalicious web shell detection with convolutional neuralnetworkrdquo in Proceedings of the 2017 VI International

10 Security and Communication Networks

Conference on Network Communication and Computingpp 75ndash79 Kunming China December 2017

[17] T Walkowiak S Datko and H Maciejewski ldquoBag-of-wordsbag-of-topics and word-to-vec based subject classification oftext documents in polish-a comparative studyrdquo Contempo-rary Complex Systems and eir Dependability pp 526ndash5352018

[18] D H Hubel and T NWiesel ldquoReceptive fields and functionalarchitecture of monkey striate cortexrdquo e Journal of Phys-iology vol 195 no 1 pp 215ndash243 1968

[19] X Sun X Lu and H Dai ldquoA matrix decomposition basedwebshell detection methodrdquo in Proceedings of the 2017 In-ternational Conference on Cryptography Security and Privacypp 66ndash70 Wuhan China March 2017

[20] Y Fang Y Qiu L Liu and C Huang ldquoDetecting webshellbased on random forest with fasttextrdquo in Proceedings of the2018 International Conference on Computing and ArtificialIntelligence pp 52ndash56 Chengdu China March 2018

[21] Package Information [ebol] 2019 httpspeclphpnetpackagevld

[22] H Zhang H Guan H Yan et al ldquoWebshell traffic detectionwith character-level features based on deep learningrdquo IEEEAccess vol 6 pp 75268ndash75277 2018

[23] L Shi and Y Fang ldquoWebshell detection method researchbased on web logrdquo Journal of Information Security Researchvol 1 p 11 2016

[24] R Sasi ldquoWeb backdoors-attack evasion and detectionrdquo inProceedings of the C0C0N Sec Conference pp 989ndash1003Cochin India 2011

[25] J Schmidhuber and S Hochreiter ldquoLong short-term mem-oryrdquo Neural Computation vol 9 no 8 pp 1735ndash1780 1997

[26] R Hughey and A Krogh ldquoHidden markov models for se-quence analysis extension and analysis of the basic methodrdquoBioinformatics vol 12 no 2 pp 95ndash107 1996

[27] B Schuller G Rigoll and M Lang ldquoHidden markov model-based speech emotion recognitionrdquo in Proceedings of the 2003IEEE International Conference on Acoustics Speech andSignal Processing Atlanta GA USA April 2003

[28] I Kuwatly M Sraj Z Al Masri and H Artail ldquoA dynamichoneypot design for intrusion detectionrdquo in Proceedings of theIEEEACS International Conference on Pervasive ServicesLebanon July 2004

[29] C Kreibich and J Crowcroft ldquoHoneycombrdquo ACM SIG-COMM Computer Communication Review vol 34 no 1pp 51ndash56 2004

[30] TensorFlowrdquo 2019 httpswwwtensorfloworg[31] L larsmans ldquoseqlearnrdquo 2016 httpsgithubcomlarsmans

seqlearn[32] R Kohavi et al ldquoA study of cross-validation and bootstrap for

accuracy estimation and model selectionrdquo IJCAI vol 14no 2 pp 1137ndash1145 1995

[33] C Goutte and E Gaussier ldquoA probabilistic interpretation ofprecision recall and f-score with implication for evaluationLecture Notes in Computer Sciencerdquo in Proceedings of theEuropean Conference on Information Retrieval pp 345ndash359Santiago de Compostela Spain March 2005

[34] T Fawcett ldquoAn introduction to ROC analysisrdquo PatternRecognition Letters vol 27 no 8 pp 861ndash874 2006

Security and Communication Networks 11

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 8: Research Article - Hindawi Publishing Corporationdownloads.hindawi.com/journals/scn/2019/3093809.pdf · As a web service-based backdoor program, webshell is installed through vul-

Accuracy 1

|C|1113944

|C|

i1

TPi + TNi

TPi + TNi + FPi + FNi

Recall 1

|C|1113944

|C|

i1

TPi

TPi + FNi

Precision 1

|C|1113944

|C|

i1

TPi

TPi + FPi

F1 score 2precision times recallprecision + recall

AUC 1113936iisinpositiveclassranki minus (M(1 + M)2)

M times N

(6)

where M is the number of positive samples and N is thenumber of negative samples e score indicates theprobability that each test sample belongs to a positivesample while rank is a positive sample set sorted in thedescending order based on score

43 Experiment Results First we explored the influence ofdifferent thresholds in session identification e thresholdswere set to 55 60 65 70 75 80 85 and 90quantile respectively e comparison results are shown inTable 2

As is shown in Table 2 performance is improved with anincreasing threshold in a certain range and reaches thehighest value when the threshold is 70 Moreover whenthe performance reaches the highest it will continuouslydecrease with a rising threshold Excessive thresholds candecrease the performance of the model because two accesses

that were not originally part of the same session are placed inthe same session In comparison with the threshold of 55to 90 we find the appropriate threshold is 70 and we useit for the next experiments

en we validated the effectiveness of session identifi-cation based on the time interval and the quantile Wecounted the number of access sessions for every user wherethe user is identified by the IP field and the user-agent fielde top 100 users with the largest number of sessions areshown in Figure 9

As is illustrated in Figure 9 most users have multiplesessions up to a maximum of 166 erefore it is quitenecessary for us to conduct a more detailed sessionidentification

In addition we compare the performance of two ma-chine learning models e comparisons are presented inTable 3

As is shown in Table 3 SB-LSTM has a much betterperformance than the HHM-based model e recall rate ofthe HMM-based model under such dataset training is closeto 0 e reason that the HMM assumes that the state at any

B0M0P0R0S0T0

B1M1P1R1S1T1

B2M2P2R2S2T2

BnMnPnRnSnTn

Session content

Sessiondata

Test

Train

Train

Model

Trainedmodel

B bytesM methodP path

R referrerS status codeT time interval

Figure 8 Overview of webshell detection with our model

Table 2 e influence of different thresholds in sessionidentification

reshold () Accuracy Recall Precision F1 score55 0932 08624 08757 0869060 09364 0869 08958 0882265 09598 09568 08977 0926370 09597 09615 09036 0931775 09696 09368 09009 0918580 09376 09401 08518 0893785 08974 07228 06932 0707790 08779 07186 06418 06780

8 Security and Communication Networks

time depends only on the state of the previous momentmight lead to the low recall Besides the HMM-based modelalso has a much lower accuracy than SB-LSTM e ROC isshown in Figure 10 e curve shows the SB-LSTM modelcould achieve an ecopyective and accurate result in which thetrue positive rate could reach 08 when the false positive rateonly reaches 001

Afterwards we found the webshell communication andwe can easily nd the webshell by counting all the accesspaths of the session Besides if the administrator is familiarwith the les included in website dictionary hemay not needto perform statistics to nd the webshell

At last we leverage a real-world website to test the SB-LSTM model and compare it with results from other re-search We rst explored the inmacruence of the dicopyerent sizesof logs on the modelrsquos runtimee experiment used two logles including a recent one and the largest one the websitecan provide Besides we ran this experiment three times andgot the average as results e results are presented in Ta-ble 4 en we compared these to the result of the citedrelated work We downloaded all the source les of thewebsite and reproduced a cited related work which is a le-based detection method For the purpose of maximizing theexperimental results we used the largest log the website canprovide which records all access to the website for nearly ayear and a half to compare We do the same things in termsof the nal results and the comparison results are shown inTable 5

As is illustrated in Table 4 when the size of the log isquite small SB-LSTM can get the result of detection veryquickly which is ecient for detecting whether there iswebshell communication for a certain period of time

0

20

40

60

80

100

120

140

160

180U

ser1

Use

r7

Use

r13

Use

r19

Use

r25

Use

r31

Use

r37

Use

r43

Use

r49

Use

r55

Use

r61

Use

r67

Use

r73

Use

r79

Use

r85

Use

r91

Use

r97

Figure 9 e top 100 users with the largest number of sessions

Table 3 Comparison of two machine learning methods

Model name Accuracy () Recall ()SB-LSTM 9597 9615HMM-based model [31] 6827 000

10

08

06

04

02

000000 0025 0050 0075 0100 0125 0150 0175 0200

Figure 10 ROC curve based on SB-LSTM

Table 4 e inmacruence of the dicopyerent sizes of logs

Model name Sizes Entries Time

SB-LSTM 177 kB 908 05897 s466MB 248433 287127 s

Table 5 Comparison of dicopyerent detection methods

Model name Sizesamount TimeSB-LSTM 466MB 287127 sFRF-WD [20] 2324 398415 s

Security and Communication Networks 9

As revealed in Table 5 even if we use all the log entriessince the website was built the runtime required for ana-lyzing is 279 less than the result of the cited related work

Compared with the results of the first experiment itdemonstrated that the SB-LSTM model is efficient fordetecting whether there is webshell communication for acertain period of time as it requires to scan all the sourcefiles in the common approach while the only input for theSB-LSTM model is the recent log

5 Conclusion

In the previous studies a huge number of features wereextracted from webshell and these features were regarded askeywords to judge whether there is a webshell or notHowever it is almost certain that these approaches would beless useful with the gradual development of encryption andconfusion technology is paper mainly focuses on thechallenge that detects webshell out of itself Instead ofleveraging POST contents source file codes or receivingtraffic the framework we proposed uses sessions generatedfrom the websitersquos logs which highly reduces the cost of timeand space but maintains a high recall rate and accuracyFeatures were extracted in raw sequence data in the web logsand a statistical method was applied to identify sessionsprecisely e results of experiments show that 70 quantilecan be the right threshold that makes the model obtain thehighest accuracy and recall rate and the long short-termmemory which can achieve a high accuracy rate of 9597with a recall rate of 9615 has much better performancethan the hidden Markov model on webshell detectionMoreover the experiment demonstrated the high efficiencyof the proposed approach in terms of the runtime as it takes985 less time than the cited related approach to get theresults In order to be closer to the real-world applicationthe model can be employed to identify webshell files after thewebshell communication is detected by using a statisticalmethod

Data Availability

eWeb logs data used to support the findings of this studyhave not been made available because it was extracted fromreal websites and contains many sensitive information

Conflicts of Interest

e authors declare that they have no conflicts of interest

Acknowledgments

is work was supported in part by the Fundamental ResearchFunds for the Central Universities and the Sichuan UniversityPostdoc Research Foundation under Grant 19XJ0002

References

[1] CNCERT weekly reportrdquo 2019 httpwwwcertorgcnpublishenglishuploadFileWeekly20Report20of20CNCERT-Issue207202019pdf

[2] D H Tony Lee and I Ahl ldquoBreaking down the China chopperweb shell-part Irdquo 2019 httpswwwfireeyecomblogthreat-research201308breaking-down-the-china-chopper-web-shell-part-ihtml

[3] B Yong X Liu Y Liu H Yin L Huang and Q Zhou ldquoWebbehavior detection based on deep neural networkrdquo in Pro-ceedings of the IEEE SmartWorld Ubiquitous Intelligence ampComputing Advanced amp Trusted Computing Scalable Com-puting amp Communications Cloud amp Big Data ComputingInternet of People and Smart City Innovation (SmartWorldSCALCOMUICATCCBDComIOPSCI) pp 1911ndash1916Hong Kong China July 2018

[4] L Y Deng D L Lee Y-H Chen and L X Yann ldquoLexicalanalysis for the webshell attacksrdquo in Proceedings of the 2016International Symposium on Computer Consumer and Con-trol (IS3C) pp 579ndash582 Xirsquoan China July 2016

[5] V-G Le H-T Nguyen D-N Lu and N-H Nguyen ldquoAsolution for automatically malicious web shell and web ap-plication vulnerability detectionrdquo in Proceedings of the In-ternational Conference on Computational CollectiveIntelligence pp 367ndash378 Halkidiki Greece September 2016

[6] J Wang Z Zhou and J Chen ldquoEvaluating CNN and LSTMfor web attack detectionrdquo in Proceedings of the 2018 10thInternational Conference on Machine Learning and Com-puting pp 283ndash287 Macau China February 2018

[7] W Yang B Sun and B Cui ldquoA webshell detection tech-nology based on http traffic analysisrdquo Innovative Mobile andInternet Services in Ubiquitous Computing in Proceedings ofthe International Conference on Innovative Mobile and In-ternet Services in Ubiquitous Computing pp 336ndash342MatsueJapan July 2018

[8] J Kim D-H Yoo H Jang and K Jeong ldquoWebshark 10 abenchmark collection for malicious web shell detectionrdquo JIPSvol 11 no 2 pp 229ndash238 2015

[9] P M Wrench and B V Irwin ldquoTowards a php webshelltaxonomy using deobfuscation-assisted similarity analysisrdquo inProceedings of the 2015 Information Security for South Africa(ISSA) pp 1ndash8 Johannesburg South Africa July 2015

[10] ZMeng R Mei T Zhang andW-PWen ldquoResearch of linuxwebshell detection based on SVM classifierrdquo Netinfo Securityvol 5 pp 5ndash9 2014

[11] O Starov J Dahse S S Ahmad T Holz and N NikiforakisldquoNo honor among thieves a large-scale analysis of maliciousweb shellsrdquo in Proceedings of the 25th International Confer-ence on World Wide Web pp 1021ndash1032 Montreal CanadaApril 2016

[12] S Josefsson ldquoe base16 base32 and base64 data encodingsrdquo2009 httpwwwhjpatdocrfcrfc3548html

[13] Z-H Lv H-B Yan and R Mei ldquoAutomatic and accuratedetection of webshell based on convolutional neural net-workrdquo in Proceedings of the China Cyber Security AnnualConference pp 73ndash85 Beijing China August 2018

[14] Compromised web servers and web shells-threat awarenessand guidancerdquo 2017 httpswwwus-certgovncasalertsTA15-314A

[15] T D Tu C Guang G Xiaojun and P Wubin ldquoWebshelldetection techniques in web applicationsrdquo in Proceedings of theFifth International Conference on Computing Communicationsand Networking Technologies (ICCCNT) pp 1ndash7 Hefei China2014

[16] Y Tian J Wang Z Zhou and S Zhou ldquoCNN-webshellmalicious web shell detection with convolutional neuralnetworkrdquo in Proceedings of the 2017 VI International

10 Security and Communication Networks

Conference on Network Communication and Computingpp 75ndash79 Kunming China December 2017

[17] T Walkowiak S Datko and H Maciejewski ldquoBag-of-wordsbag-of-topics and word-to-vec based subject classification oftext documents in polish-a comparative studyrdquo Contempo-rary Complex Systems and eir Dependability pp 526ndash5352018

[18] D H Hubel and T NWiesel ldquoReceptive fields and functionalarchitecture of monkey striate cortexrdquo e Journal of Phys-iology vol 195 no 1 pp 215ndash243 1968

[19] X Sun X Lu and H Dai ldquoA matrix decomposition basedwebshell detection methodrdquo in Proceedings of the 2017 In-ternational Conference on Cryptography Security and Privacypp 66ndash70 Wuhan China March 2017

[20] Y Fang Y Qiu L Liu and C Huang ldquoDetecting webshellbased on random forest with fasttextrdquo in Proceedings of the2018 International Conference on Computing and ArtificialIntelligence pp 52ndash56 Chengdu China March 2018

[21] Package Information [ebol] 2019 httpspeclphpnetpackagevld

[22] H Zhang H Guan H Yan et al ldquoWebshell traffic detectionwith character-level features based on deep learningrdquo IEEEAccess vol 6 pp 75268ndash75277 2018

[23] L Shi and Y Fang ldquoWebshell detection method researchbased on web logrdquo Journal of Information Security Researchvol 1 p 11 2016

[24] R Sasi ldquoWeb backdoors-attack evasion and detectionrdquo inProceedings of the C0C0N Sec Conference pp 989ndash1003Cochin India 2011

[25] J Schmidhuber and S Hochreiter ldquoLong short-term mem-oryrdquo Neural Computation vol 9 no 8 pp 1735ndash1780 1997

[26] R Hughey and A Krogh ldquoHidden markov models for se-quence analysis extension and analysis of the basic methodrdquoBioinformatics vol 12 no 2 pp 95ndash107 1996

[27] B Schuller G Rigoll and M Lang ldquoHidden markov model-based speech emotion recognitionrdquo in Proceedings of the 2003IEEE International Conference on Acoustics Speech andSignal Processing Atlanta GA USA April 2003

[28] I Kuwatly M Sraj Z Al Masri and H Artail ldquoA dynamichoneypot design for intrusion detectionrdquo in Proceedings of theIEEEACS International Conference on Pervasive ServicesLebanon July 2004

[29] C Kreibich and J Crowcroft ldquoHoneycombrdquo ACM SIG-COMM Computer Communication Review vol 34 no 1pp 51ndash56 2004

[30] TensorFlowrdquo 2019 httpswwwtensorfloworg[31] L larsmans ldquoseqlearnrdquo 2016 httpsgithubcomlarsmans

seqlearn[32] R Kohavi et al ldquoA study of cross-validation and bootstrap for

accuracy estimation and model selectionrdquo IJCAI vol 14no 2 pp 1137ndash1145 1995

[33] C Goutte and E Gaussier ldquoA probabilistic interpretation ofprecision recall and f-score with implication for evaluationLecture Notes in Computer Sciencerdquo in Proceedings of theEuropean Conference on Information Retrieval pp 345ndash359Santiago de Compostela Spain March 2005

[34] T Fawcett ldquoAn introduction to ROC analysisrdquo PatternRecognition Letters vol 27 no 8 pp 861ndash874 2006

Security and Communication Networks 11

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 9: Research Article - Hindawi Publishing Corporationdownloads.hindawi.com/journals/scn/2019/3093809.pdf · As a web service-based backdoor program, webshell is installed through vul-

time depends only on the state of the previous momentmight lead to the low recall Besides the HMM-based modelalso has a much lower accuracy than SB-LSTM e ROC isshown in Figure 10 e curve shows the SB-LSTM modelcould achieve an ecopyective and accurate result in which thetrue positive rate could reach 08 when the false positive rateonly reaches 001

Afterwards we found the webshell communication andwe can easily nd the webshell by counting all the accesspaths of the session Besides if the administrator is familiarwith the les included in website dictionary hemay not needto perform statistics to nd the webshell

At last we leverage a real-world website to test the SB-LSTM model and compare it with results from other re-search We rst explored the inmacruence of the dicopyerent sizesof logs on the modelrsquos runtimee experiment used two logles including a recent one and the largest one the websitecan provide Besides we ran this experiment three times andgot the average as results e results are presented in Ta-ble 4 en we compared these to the result of the citedrelated work We downloaded all the source les of thewebsite and reproduced a cited related work which is a le-based detection method For the purpose of maximizing theexperimental results we used the largest log the website canprovide which records all access to the website for nearly ayear and a half to compare We do the same things in termsof the nal results and the comparison results are shown inTable 5

As is illustrated in Table 4 when the size of the log isquite small SB-LSTM can get the result of detection veryquickly which is ecient for detecting whether there iswebshell communication for a certain period of time

0

20

40

60

80

100

120

140

160

180U

ser1

Use

r7

Use

r13

Use

r19

Use

r25

Use

r31

Use

r37

Use

r43

Use

r49

Use

r55

Use

r61

Use

r67

Use

r73

Use

r79

Use

r85

Use

r91

Use

r97

Figure 9 e top 100 users with the largest number of sessions

Table 3 Comparison of two machine learning methods

Model name Accuracy () Recall ()SB-LSTM 9597 9615HMM-based model [31] 6827 000

10

08

06

04

02

000000 0025 0050 0075 0100 0125 0150 0175 0200

Figure 10 ROC curve based on SB-LSTM

Table 4 e inmacruence of the dicopyerent sizes of logs

Model name Sizes Entries Time

SB-LSTM 177 kB 908 05897 s466MB 248433 287127 s

Table 5 Comparison of dicopyerent detection methods

Model name Sizesamount TimeSB-LSTM 466MB 287127 sFRF-WD [20] 2324 398415 s

Security and Communication Networks 9

As revealed in Table 5 even if we use all the log entriessince the website was built the runtime required for ana-lyzing is 279 less than the result of the cited related work

Compared with the results of the first experiment itdemonstrated that the SB-LSTM model is efficient fordetecting whether there is webshell communication for acertain period of time as it requires to scan all the sourcefiles in the common approach while the only input for theSB-LSTM model is the recent log

5 Conclusion

In the previous studies a huge number of features wereextracted from webshell and these features were regarded askeywords to judge whether there is a webshell or notHowever it is almost certain that these approaches would beless useful with the gradual development of encryption andconfusion technology is paper mainly focuses on thechallenge that detects webshell out of itself Instead ofleveraging POST contents source file codes or receivingtraffic the framework we proposed uses sessions generatedfrom the websitersquos logs which highly reduces the cost of timeand space but maintains a high recall rate and accuracyFeatures were extracted in raw sequence data in the web logsand a statistical method was applied to identify sessionsprecisely e results of experiments show that 70 quantilecan be the right threshold that makes the model obtain thehighest accuracy and recall rate and the long short-termmemory which can achieve a high accuracy rate of 9597with a recall rate of 9615 has much better performancethan the hidden Markov model on webshell detectionMoreover the experiment demonstrated the high efficiencyof the proposed approach in terms of the runtime as it takes985 less time than the cited related approach to get theresults In order to be closer to the real-world applicationthe model can be employed to identify webshell files after thewebshell communication is detected by using a statisticalmethod

Data Availability

eWeb logs data used to support the findings of this studyhave not been made available because it was extracted fromreal websites and contains many sensitive information

Conflicts of Interest

e authors declare that they have no conflicts of interest

Acknowledgments

is work was supported in part by the Fundamental ResearchFunds for the Central Universities and the Sichuan UniversityPostdoc Research Foundation under Grant 19XJ0002

References

[1] CNCERT weekly reportrdquo 2019 httpwwwcertorgcnpublishenglishuploadFileWeekly20Report20of20CNCERT-Issue207202019pdf

[2] D H Tony Lee and I Ahl ldquoBreaking down the China chopperweb shell-part Irdquo 2019 httpswwwfireeyecomblogthreat-research201308breaking-down-the-china-chopper-web-shell-part-ihtml

[3] B Yong X Liu Y Liu H Yin L Huang and Q Zhou ldquoWebbehavior detection based on deep neural networkrdquo in Pro-ceedings of the IEEE SmartWorld Ubiquitous Intelligence ampComputing Advanced amp Trusted Computing Scalable Com-puting amp Communications Cloud amp Big Data ComputingInternet of People and Smart City Innovation (SmartWorldSCALCOMUICATCCBDComIOPSCI) pp 1911ndash1916Hong Kong China July 2018

[4] L Y Deng D L Lee Y-H Chen and L X Yann ldquoLexicalanalysis for the webshell attacksrdquo in Proceedings of the 2016International Symposium on Computer Consumer and Con-trol (IS3C) pp 579ndash582 Xirsquoan China July 2016

[5] V-G Le H-T Nguyen D-N Lu and N-H Nguyen ldquoAsolution for automatically malicious web shell and web ap-plication vulnerability detectionrdquo in Proceedings of the In-ternational Conference on Computational CollectiveIntelligence pp 367ndash378 Halkidiki Greece September 2016

[6] J Wang Z Zhou and J Chen ldquoEvaluating CNN and LSTMfor web attack detectionrdquo in Proceedings of the 2018 10thInternational Conference on Machine Learning and Com-puting pp 283ndash287 Macau China February 2018

[7] W Yang B Sun and B Cui ldquoA webshell detection tech-nology based on http traffic analysisrdquo Innovative Mobile andInternet Services in Ubiquitous Computing in Proceedings ofthe International Conference on Innovative Mobile and In-ternet Services in Ubiquitous Computing pp 336ndash342MatsueJapan July 2018

[8] J Kim D-H Yoo H Jang and K Jeong ldquoWebshark 10 abenchmark collection for malicious web shell detectionrdquo JIPSvol 11 no 2 pp 229ndash238 2015

[9] P M Wrench and B V Irwin ldquoTowards a php webshelltaxonomy using deobfuscation-assisted similarity analysisrdquo inProceedings of the 2015 Information Security for South Africa(ISSA) pp 1ndash8 Johannesburg South Africa July 2015

[10] ZMeng R Mei T Zhang andW-PWen ldquoResearch of linuxwebshell detection based on SVM classifierrdquo Netinfo Securityvol 5 pp 5ndash9 2014

[11] O Starov J Dahse S S Ahmad T Holz and N NikiforakisldquoNo honor among thieves a large-scale analysis of maliciousweb shellsrdquo in Proceedings of the 25th International Confer-ence on World Wide Web pp 1021ndash1032 Montreal CanadaApril 2016

[12] S Josefsson ldquoe base16 base32 and base64 data encodingsrdquo2009 httpwwwhjpatdocrfcrfc3548html

[13] Z-H Lv H-B Yan and R Mei ldquoAutomatic and accuratedetection of webshell based on convolutional neural net-workrdquo in Proceedings of the China Cyber Security AnnualConference pp 73ndash85 Beijing China August 2018

[14] Compromised web servers and web shells-threat awarenessand guidancerdquo 2017 httpswwwus-certgovncasalertsTA15-314A

[15] T D Tu C Guang G Xiaojun and P Wubin ldquoWebshelldetection techniques in web applicationsrdquo in Proceedings of theFifth International Conference on Computing Communicationsand Networking Technologies (ICCCNT) pp 1ndash7 Hefei China2014

[16] Y Tian J Wang Z Zhou and S Zhou ldquoCNN-webshellmalicious web shell detection with convolutional neuralnetworkrdquo in Proceedings of the 2017 VI International

10 Security and Communication Networks

Conference on Network Communication and Computingpp 75ndash79 Kunming China December 2017

[17] T Walkowiak S Datko and H Maciejewski ldquoBag-of-wordsbag-of-topics and word-to-vec based subject classification oftext documents in polish-a comparative studyrdquo Contempo-rary Complex Systems and eir Dependability pp 526ndash5352018

[18] D H Hubel and T NWiesel ldquoReceptive fields and functionalarchitecture of monkey striate cortexrdquo e Journal of Phys-iology vol 195 no 1 pp 215ndash243 1968

[19] X Sun X Lu and H Dai ldquoA matrix decomposition basedwebshell detection methodrdquo in Proceedings of the 2017 In-ternational Conference on Cryptography Security and Privacypp 66ndash70 Wuhan China March 2017

[20] Y Fang Y Qiu L Liu and C Huang ldquoDetecting webshellbased on random forest with fasttextrdquo in Proceedings of the2018 International Conference on Computing and ArtificialIntelligence pp 52ndash56 Chengdu China March 2018

[21] Package Information [ebol] 2019 httpspeclphpnetpackagevld

[22] H Zhang H Guan H Yan et al ldquoWebshell traffic detectionwith character-level features based on deep learningrdquo IEEEAccess vol 6 pp 75268ndash75277 2018

[23] L Shi and Y Fang ldquoWebshell detection method researchbased on web logrdquo Journal of Information Security Researchvol 1 p 11 2016

[24] R Sasi ldquoWeb backdoors-attack evasion and detectionrdquo inProceedings of the C0C0N Sec Conference pp 989ndash1003Cochin India 2011

[25] J Schmidhuber and S Hochreiter ldquoLong short-term mem-oryrdquo Neural Computation vol 9 no 8 pp 1735ndash1780 1997

[26] R Hughey and A Krogh ldquoHidden markov models for se-quence analysis extension and analysis of the basic methodrdquoBioinformatics vol 12 no 2 pp 95ndash107 1996

[27] B Schuller G Rigoll and M Lang ldquoHidden markov model-based speech emotion recognitionrdquo in Proceedings of the 2003IEEE International Conference on Acoustics Speech andSignal Processing Atlanta GA USA April 2003

[28] I Kuwatly M Sraj Z Al Masri and H Artail ldquoA dynamichoneypot design for intrusion detectionrdquo in Proceedings of theIEEEACS International Conference on Pervasive ServicesLebanon July 2004

[29] C Kreibich and J Crowcroft ldquoHoneycombrdquo ACM SIG-COMM Computer Communication Review vol 34 no 1pp 51ndash56 2004

[30] TensorFlowrdquo 2019 httpswwwtensorfloworg[31] L larsmans ldquoseqlearnrdquo 2016 httpsgithubcomlarsmans

seqlearn[32] R Kohavi et al ldquoA study of cross-validation and bootstrap for

accuracy estimation and model selectionrdquo IJCAI vol 14no 2 pp 1137ndash1145 1995

[33] C Goutte and E Gaussier ldquoA probabilistic interpretation ofprecision recall and f-score with implication for evaluationLecture Notes in Computer Sciencerdquo in Proceedings of theEuropean Conference on Information Retrieval pp 345ndash359Santiago de Compostela Spain March 2005

[34] T Fawcett ldquoAn introduction to ROC analysisrdquo PatternRecognition Letters vol 27 no 8 pp 861ndash874 2006

Security and Communication Networks 11

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 10: Research Article - Hindawi Publishing Corporationdownloads.hindawi.com/journals/scn/2019/3093809.pdf · As a web service-based backdoor program, webshell is installed through vul-

As revealed in Table 5 even if we use all the log entriessince the website was built the runtime required for ana-lyzing is 279 less than the result of the cited related work

Compared with the results of the first experiment itdemonstrated that the SB-LSTM model is efficient fordetecting whether there is webshell communication for acertain period of time as it requires to scan all the sourcefiles in the common approach while the only input for theSB-LSTM model is the recent log

5 Conclusion

In the previous studies a huge number of features wereextracted from webshell and these features were regarded askeywords to judge whether there is a webshell or notHowever it is almost certain that these approaches would beless useful with the gradual development of encryption andconfusion technology is paper mainly focuses on thechallenge that detects webshell out of itself Instead ofleveraging POST contents source file codes or receivingtraffic the framework we proposed uses sessions generatedfrom the websitersquos logs which highly reduces the cost of timeand space but maintains a high recall rate and accuracyFeatures were extracted in raw sequence data in the web logsand a statistical method was applied to identify sessionsprecisely e results of experiments show that 70 quantilecan be the right threshold that makes the model obtain thehighest accuracy and recall rate and the long short-termmemory which can achieve a high accuracy rate of 9597with a recall rate of 9615 has much better performancethan the hidden Markov model on webshell detectionMoreover the experiment demonstrated the high efficiencyof the proposed approach in terms of the runtime as it takes985 less time than the cited related approach to get theresults In order to be closer to the real-world applicationthe model can be employed to identify webshell files after thewebshell communication is detected by using a statisticalmethod

Data Availability

eWeb logs data used to support the findings of this studyhave not been made available because it was extracted fromreal websites and contains many sensitive information

Conflicts of Interest

e authors declare that they have no conflicts of interest

Acknowledgments

is work was supported in part by the Fundamental ResearchFunds for the Central Universities and the Sichuan UniversityPostdoc Research Foundation under Grant 19XJ0002

References

[1] CNCERT weekly reportrdquo 2019 httpwwwcertorgcnpublishenglishuploadFileWeekly20Report20of20CNCERT-Issue207202019pdf

[2] D H Tony Lee and I Ahl ldquoBreaking down the China chopperweb shell-part Irdquo 2019 httpswwwfireeyecomblogthreat-research201308breaking-down-the-china-chopper-web-shell-part-ihtml

[3] B Yong X Liu Y Liu H Yin L Huang and Q Zhou ldquoWebbehavior detection based on deep neural networkrdquo in Pro-ceedings of the IEEE SmartWorld Ubiquitous Intelligence ampComputing Advanced amp Trusted Computing Scalable Com-puting amp Communications Cloud amp Big Data ComputingInternet of People and Smart City Innovation (SmartWorldSCALCOMUICATCCBDComIOPSCI) pp 1911ndash1916Hong Kong China July 2018

[4] L Y Deng D L Lee Y-H Chen and L X Yann ldquoLexicalanalysis for the webshell attacksrdquo in Proceedings of the 2016International Symposium on Computer Consumer and Con-trol (IS3C) pp 579ndash582 Xirsquoan China July 2016

[5] V-G Le H-T Nguyen D-N Lu and N-H Nguyen ldquoAsolution for automatically malicious web shell and web ap-plication vulnerability detectionrdquo in Proceedings of the In-ternational Conference on Computational CollectiveIntelligence pp 367ndash378 Halkidiki Greece September 2016

[6] J Wang Z Zhou and J Chen ldquoEvaluating CNN and LSTMfor web attack detectionrdquo in Proceedings of the 2018 10thInternational Conference on Machine Learning and Com-puting pp 283ndash287 Macau China February 2018

[7] W Yang B Sun and B Cui ldquoA webshell detection tech-nology based on http traffic analysisrdquo Innovative Mobile andInternet Services in Ubiquitous Computing in Proceedings ofthe International Conference on Innovative Mobile and In-ternet Services in Ubiquitous Computing pp 336ndash342MatsueJapan July 2018

[8] J Kim D-H Yoo H Jang and K Jeong ldquoWebshark 10 abenchmark collection for malicious web shell detectionrdquo JIPSvol 11 no 2 pp 229ndash238 2015

[9] P M Wrench and B V Irwin ldquoTowards a php webshelltaxonomy using deobfuscation-assisted similarity analysisrdquo inProceedings of the 2015 Information Security for South Africa(ISSA) pp 1ndash8 Johannesburg South Africa July 2015

[10] ZMeng R Mei T Zhang andW-PWen ldquoResearch of linuxwebshell detection based on SVM classifierrdquo Netinfo Securityvol 5 pp 5ndash9 2014

[11] O Starov J Dahse S S Ahmad T Holz and N NikiforakisldquoNo honor among thieves a large-scale analysis of maliciousweb shellsrdquo in Proceedings of the 25th International Confer-ence on World Wide Web pp 1021ndash1032 Montreal CanadaApril 2016

[12] S Josefsson ldquoe base16 base32 and base64 data encodingsrdquo2009 httpwwwhjpatdocrfcrfc3548html

[13] Z-H Lv H-B Yan and R Mei ldquoAutomatic and accuratedetection of webshell based on convolutional neural net-workrdquo in Proceedings of the China Cyber Security AnnualConference pp 73ndash85 Beijing China August 2018

[14] Compromised web servers and web shells-threat awarenessand guidancerdquo 2017 httpswwwus-certgovncasalertsTA15-314A

[15] T D Tu C Guang G Xiaojun and P Wubin ldquoWebshelldetection techniques in web applicationsrdquo in Proceedings of theFifth International Conference on Computing Communicationsand Networking Technologies (ICCCNT) pp 1ndash7 Hefei China2014

[16] Y Tian J Wang Z Zhou and S Zhou ldquoCNN-webshellmalicious web shell detection with convolutional neuralnetworkrdquo in Proceedings of the 2017 VI International

10 Security and Communication Networks

Conference on Network Communication and Computingpp 75ndash79 Kunming China December 2017

[17] T Walkowiak S Datko and H Maciejewski ldquoBag-of-wordsbag-of-topics and word-to-vec based subject classification oftext documents in polish-a comparative studyrdquo Contempo-rary Complex Systems and eir Dependability pp 526ndash5352018

[18] D H Hubel and T NWiesel ldquoReceptive fields and functionalarchitecture of monkey striate cortexrdquo e Journal of Phys-iology vol 195 no 1 pp 215ndash243 1968

[19] X Sun X Lu and H Dai ldquoA matrix decomposition basedwebshell detection methodrdquo in Proceedings of the 2017 In-ternational Conference on Cryptography Security and Privacypp 66ndash70 Wuhan China March 2017

[20] Y Fang Y Qiu L Liu and C Huang ldquoDetecting webshellbased on random forest with fasttextrdquo in Proceedings of the2018 International Conference on Computing and ArtificialIntelligence pp 52ndash56 Chengdu China March 2018

[21] Package Information [ebol] 2019 httpspeclphpnetpackagevld

[22] H Zhang H Guan H Yan et al ldquoWebshell traffic detectionwith character-level features based on deep learningrdquo IEEEAccess vol 6 pp 75268ndash75277 2018

[23] L Shi and Y Fang ldquoWebshell detection method researchbased on web logrdquo Journal of Information Security Researchvol 1 p 11 2016

[24] R Sasi ldquoWeb backdoors-attack evasion and detectionrdquo inProceedings of the C0C0N Sec Conference pp 989ndash1003Cochin India 2011

[25] J Schmidhuber and S Hochreiter ldquoLong short-term mem-oryrdquo Neural Computation vol 9 no 8 pp 1735ndash1780 1997

[26] R Hughey and A Krogh ldquoHidden markov models for se-quence analysis extension and analysis of the basic methodrdquoBioinformatics vol 12 no 2 pp 95ndash107 1996

[27] B Schuller G Rigoll and M Lang ldquoHidden markov model-based speech emotion recognitionrdquo in Proceedings of the 2003IEEE International Conference on Acoustics Speech andSignal Processing Atlanta GA USA April 2003

[28] I Kuwatly M Sraj Z Al Masri and H Artail ldquoA dynamichoneypot design for intrusion detectionrdquo in Proceedings of theIEEEACS International Conference on Pervasive ServicesLebanon July 2004

[29] C Kreibich and J Crowcroft ldquoHoneycombrdquo ACM SIG-COMM Computer Communication Review vol 34 no 1pp 51ndash56 2004

[30] TensorFlowrdquo 2019 httpswwwtensorfloworg[31] L larsmans ldquoseqlearnrdquo 2016 httpsgithubcomlarsmans

seqlearn[32] R Kohavi et al ldquoA study of cross-validation and bootstrap for

accuracy estimation and model selectionrdquo IJCAI vol 14no 2 pp 1137ndash1145 1995

[33] C Goutte and E Gaussier ldquoA probabilistic interpretation ofprecision recall and f-score with implication for evaluationLecture Notes in Computer Sciencerdquo in Proceedings of theEuropean Conference on Information Retrieval pp 345ndash359Santiago de Compostela Spain March 2005

[34] T Fawcett ldquoAn introduction to ROC analysisrdquo PatternRecognition Letters vol 27 no 8 pp 861ndash874 2006

Security and Communication Networks 11

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 11: Research Article - Hindawi Publishing Corporationdownloads.hindawi.com/journals/scn/2019/3093809.pdf · As a web service-based backdoor program, webshell is installed through vul-

Conference on Network Communication and Computingpp 75ndash79 Kunming China December 2017

[17] T Walkowiak S Datko and H Maciejewski ldquoBag-of-wordsbag-of-topics and word-to-vec based subject classification oftext documents in polish-a comparative studyrdquo Contempo-rary Complex Systems and eir Dependability pp 526ndash5352018

[18] D H Hubel and T NWiesel ldquoReceptive fields and functionalarchitecture of monkey striate cortexrdquo e Journal of Phys-iology vol 195 no 1 pp 215ndash243 1968

[19] X Sun X Lu and H Dai ldquoA matrix decomposition basedwebshell detection methodrdquo in Proceedings of the 2017 In-ternational Conference on Cryptography Security and Privacypp 66ndash70 Wuhan China March 2017

[20] Y Fang Y Qiu L Liu and C Huang ldquoDetecting webshellbased on random forest with fasttextrdquo in Proceedings of the2018 International Conference on Computing and ArtificialIntelligence pp 52ndash56 Chengdu China March 2018

[21] Package Information [ebol] 2019 httpspeclphpnetpackagevld

[22] H Zhang H Guan H Yan et al ldquoWebshell traffic detectionwith character-level features based on deep learningrdquo IEEEAccess vol 6 pp 75268ndash75277 2018

[23] L Shi and Y Fang ldquoWebshell detection method researchbased on web logrdquo Journal of Information Security Researchvol 1 p 11 2016

[24] R Sasi ldquoWeb backdoors-attack evasion and detectionrdquo inProceedings of the C0C0N Sec Conference pp 989ndash1003Cochin India 2011

[25] J Schmidhuber and S Hochreiter ldquoLong short-term mem-oryrdquo Neural Computation vol 9 no 8 pp 1735ndash1780 1997

[26] R Hughey and A Krogh ldquoHidden markov models for se-quence analysis extension and analysis of the basic methodrdquoBioinformatics vol 12 no 2 pp 95ndash107 1996

[27] B Schuller G Rigoll and M Lang ldquoHidden markov model-based speech emotion recognitionrdquo in Proceedings of the 2003IEEE International Conference on Acoustics Speech andSignal Processing Atlanta GA USA April 2003

[28] I Kuwatly M Sraj Z Al Masri and H Artail ldquoA dynamichoneypot design for intrusion detectionrdquo in Proceedings of theIEEEACS International Conference on Pervasive ServicesLebanon July 2004

[29] C Kreibich and J Crowcroft ldquoHoneycombrdquo ACM SIG-COMM Computer Communication Review vol 34 no 1pp 51ndash56 2004

[30] TensorFlowrdquo 2019 httpswwwtensorfloworg[31] L larsmans ldquoseqlearnrdquo 2016 httpsgithubcomlarsmans

seqlearn[32] R Kohavi et al ldquoA study of cross-validation and bootstrap for

accuracy estimation and model selectionrdquo IJCAI vol 14no 2 pp 1137ndash1145 1995

[33] C Goutte and E Gaussier ldquoA probabilistic interpretation ofprecision recall and f-score with implication for evaluationLecture Notes in Computer Sciencerdquo in Proceedings of theEuropean Conference on Information Retrieval pp 345ndash359Santiago de Compostela Spain March 2005

[34] T Fawcett ldquoAn introduction to ROC analysisrdquo PatternRecognition Letters vol 27 no 8 pp 861ndash874 2006

Security and Communication Networks 11

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 12: Research Article - Hindawi Publishing Corporationdownloads.hindawi.com/journals/scn/2019/3093809.pdf · As a web service-based backdoor program, webshell is installed through vul-

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom