22
Research Article CBR-Based Decision Support Methodology for Cybercrime Investigation: Focused on the Data-Driven Website Defacement Analysis Mee Lan Han , Byung Il Kwak , and Huy Kang Kim Graduate School of Information Security, Korea University, Seoul, Republic of Korea Correspondence should be addressed to Huy Kang Kim; [email protected] Received 25 March 2019; Revised 21 August 2019; Accepted 2 December 2019; Published 20 December 2019 Guest Editor: Jungwoo Ryoo Copyright © 2019 Mee Lan Han et al. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Criminal profiling is a useful technique to identify the most plausible suspects based on the evidence discovered at the crime scene. Similar to offline criminal profiling, in-depth profiling for cybercrime investigation is useful in analysing cyberattacks and for speculating on the identities of the criminals. Every cybercrime committed by the same hacker or hacking group has unique traits such as attack purpose, attack methods, and target. ese unique traits are revealed in the evidence of cybercrime; in some cases, these unique traits are well hidden in the evidence such that it cannot be easily perceived. erefore, a complete analysis of several factors concerning cybercrime can provide an investigator with concrete evidence to attribute the attacks and narrow down the scope of the criminal data and grasp the criminals in the end. We herein propose a decision support methodology based on the case-based reasoning (CBR) for cybercrime investigation. is study focuses on the massive data-driven analysis of website defacement. Our primary aim in this study is to demonstrate the practicality of the proposed methodology as a proof of concept. e assessment of website defacement was performed through the similarity measure and the clustering processing in the reasoning engine based on the CBR. Our results show that the proposed methodology that focuses on the investigation enables a better understanding and interpretation of website defacement and assists in inferring the hacker’s behavioural traits from the available evidence concerning website defacement. e results of the case studies demonstrate that our proposed methodology is beneficial for understanding the behaviour and motivation of the hacker and that our proposed data-driven analytic methodology can be utilized as a decision support system for cybercrime investigation. 1. Introduction Advanced persistent threat (APT) attacks, stealthily and continuously controlled by hackers or hacking groups tar- geting a specific entity, remain as a challenging threat, particularly to the companies or organizations that handle sensitive funding and information. When successful, APT attacks can have a catastrophic impact on critical in- frastructures, such as banking, broadcasting system, and mass media sites. is impact is not speculative or theo- retical—in fact, it is supported by various real-world in- cidents and actual attacks. For instance, in February 2016, a group of hackers stole $81 million from the Central Bank of Bangladesh through its account at the Federal Reserve Bank of New York through an APT attack which targeted constantly the SWIFT payment system for a year [1]. Fur- thermore, in May 2017, the WannaCry ransomware, another type of APT attack, spread due to the vulnerability in the Microsoft Server Message Block (SMB; the message format used to share folders and files and so on in Microsoft Windows OS). is attack caused catastrophic conse- quences, such a standstill and disruption of online work in hospitals, companies, and several government agencies. According to Symantec’s 2017 annual report [2], the SWIFT case and the WannaCry ransomware case were perhaps launched by the Lazarus group that could be affiliated to the DarkSeoul (DS) case in 2013 and the Sony Pictures Enter- tainment (SPE) case in 2014. Symantec found that the hacking skills in the SWIFT case were very similar to those used by the Lazarus group, presumably one of the North Hindawi Security and Communication Networks Volume 2019, Article ID 1901548, 21 pages https://doi.org/10.1155/2019/1901548

CBR-Based Decision Support Methodology for Cybercrime

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CBR-Based Decision Support Methodology for Cybercrime

Research ArticleCBR-Based Decision Support Methodology for CybercrimeInvestigation Focused on the Data-Driven WebsiteDefacement Analysis

Mee Lan Han Byung Il Kwak and Huy Kang Kim

Graduate School of Information Security Korea University Seoul Republic of Korea

Correspondence should be addressed to Huy Kang Kim cendakoreaackr

Received 25 March 2019 Revised 21 August 2019 Accepted 2 December 2019 Published 20 December 2019

Guest Editor Jungwoo Ryoo

Copyright copy 2019Mee Lan Han et al shyis is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

Criminal proling is a useful technique to identify the most plausible suspects based on the evidence discovered at the crime sceneSimilar to oine criminal proling in-depth proling for cybercrime investigation is useful in analysing cyberattacks and forspeculating on the identities of the criminals Every cybercrime committed by the same hacker or hacking group has unique traitssuch as attack purpose attack methods and target shyese unique traits are revealed in the evidence of cybercrime in some casesthese unique traits are well hidden in the evidence such that it cannot be easily perceived shyerefore a complete analysis of severalfactors concerning cybercrime can provide an investigator with concrete evidence to attribute the attacks and narrow down thescope of the criminal data and grasp the criminals in the end We herein propose a decision support methodology based on thecase-based reasoning (CBR) for cybercrime investigation shyis study focuses on the massive data-driven analysis of websitedefacement Our primary aim in this study is to demonstrate the practicality of the proposed methodology as a proof of conceptshye assessment of website defacement was performed through the similarity measure and the clustering processing in thereasoning engine based on the CBR Our results show that the proposed methodology that focuses on the investigation enables abetter understanding and interpretation of website defacement and assists in inferring the hackerrsquos behavioural traits from theavailable evidence concerning website defacement shye results of the case studies demonstrate that our proposed methodology isbenecial for understanding the behaviour and motivation of the hacker and that our proposed data-driven analytic methodologycan be utilized as a decision support system for cybercrime investigation

1 Introduction

Advanced persistent threat (APT) attacks stealthily andcontinuously controlled by hackers or hacking groups tar-geting a specic entity remain as a challenging threatparticularly to the companies or organizations that handlesensitive funding and information When successful APTattacks can have a catastrophic impact on critical in-frastructures such as banking broadcasting system andmass media sites shyis impact is not speculative or theo-reticalmdashin fact it is supported by various real-world in-cidents and actual attacks For instance in February 2016 agroup of hackers stole $81 million from the Central Bank ofBangladesh through its account at the Federal Reserve Bankof New York through an APT attack which targeted

constantly the SWIFT payment system for a year [1] Fur-thermore inMay 2017 theWannaCry ransomware anothertype of APT attack spread due to the vulnerability in theMicrosoft Server Message Block (SMB the message formatused to share folders and les and so on in MicrosoftWindows OS) shyis attack caused catastrophic conse-quences such a standstill and disruption of online work inhospitals companies and several government agenciesAccording to Symantecrsquos 2017 annual report [2] the SWIFTcase and the WannaCry ransomware case were perhapslaunched by the Lazarus group that could be apoundliated to theDarkSeoul (DS) case in 2013 and the Sony Pictures Enter-tainment (SPE) case in 2014 Symantec found that thehacking skills in the SWIFT case were very similar to thoseused by the Lazarus group presumably one of the North

HindawiSecurity and Communication NetworksVolume 2019 Article ID 1901548 21 pageshttpsdoiorg10115520191901548

Korearsquos state-sponsored hacking group the report alsofound that the malware of the WannaCry ransomware casewas related to the one used by the Lazarus group [3] In theOperation Blockbuster report released by Novetta in 2016the Lazarus group was reported to hypothetically come intwo basic classesmdashthe features known as the wipers and theDDoS malware [4] +e noticeable features of these attacksunderpin our interest in the Lazarus grouprsquos attack related tothe DS case in 2013 and the SPE case in 2014

On March 20 2013 in the DS case the DSrsquos attackdestroyed approximately 48700 computerized and net-worked equipment items such as PCs servers and networkdevices of major banks and TV broadcasters in South KoreaSouth Korea suffered a coordinated strike by a simple butvery effective and destructive malware called Wiper A Band C [5] In certain Windows OS environments the wiperscripts attempted to remove any directories after attemptingto overwrite each file with a specific string pattern (ieldquoHASTATIrdquo ldquoPRINCIPESrdquo or ldquoPRNCPESrdquo) [6 7] Inanother incident initiated by the Lazarus group the SPE washacked by the self-named Guardians of Peace (GOP)hacker group Several malware analysis groups reported thatthe GOP attack was also related to the North Korean cyberarmy [8] +e malware used in this attack contained stringswritten using the Romanization of Korean words (ieKorean words were spelled using Latin letters following theEnglish pronunciation) Of note while the Korean languageas spoken in North Korea and South Korea is linguisticallyidentical there are several important differences in terms ofvowels and consonants phonetic notation and wordspacing [9] In the aforementioned case the Romanizedwords captured in the malware were having various con-temporary North Korean words

From 2009 to 2017 along with the attacks mentionedabove the Lazarus group launched many other attacks (seeFigure 1 for further details)

+ere have been numerous attempts in industry andacademia to do hacker profiling and to handle attack in-cidents +ese approaches can be categorized into the fol-lowing three types the human-centric analysis malware-centric analysis and case-centric analysis +e human-centric analysis approach focuses on hacker network anal-ysis Known hacker activities (eg message postings anddiscussion) on the hacker communities provide a clue toidentify key actors by their reputation In addition it canclassify the tendency of a hacker based on social networkingmethods [10 11] Unlike the human-centric analysis themalware-centric approach primarily assumes that the samemalware and its variants could be developed by the same orclosely similar hacker groups Among others features suchas API call sequence and control flow can be used to estimatethe similarity between the newly detected malware and theknown malware [12ndash14] In fact many previous studies onhacker profiling have primarily focused on using in-formation derived from the analysis of the malware itselfWhile malware analysis could provide information about amalwarersquos functionality and its similarity with the previouslyknown malware family tracing and analysing hacker in-formation based on the malware centric could have the

limitation where the core information can be circumvented+e last approach is the case-centric analysis and ourmethodology falls into this category Overall only severalproposals can be applied to the traditional investigationmethod such as criminal profiling methods to the cyberincident investigation however many systematic ap-proaches are currently under development From theviewpoint of the cyber intelligence analysis the case-centricanalysis has the advantage of making it possible to un-derstand the purpose of attack campaigns it is important tobuild profiles of attackers as with other methods of analysisWhen performed successfully such characterizations canfacilitate estimating and predicting the attackerrsquos next targetin advance

Based on this insight the present study proposes theCBR-based decision support methodology for cybercrimeinvestigation In terms of data website defacement attackcases occurring between 1998 and 2015 were retrieved fromthe public archival site zone-horg (an archive of defacedwebsites httpwwwzone-horg) After crawling web re-sources of the Hypertext Markup Language (HTML) typedata preprocessing for data parsing and data cleaning wereperformed to amend incomplete improperly formatted orduplicate data records +e case vector was designed tointuitively express defaced website cases collected from thepublic archival site +e reasoning engine will be able to startthe major work only after completing the data preprocessingand the case vector design +e similarity measurementbased on CBR was performed in them +e clustering al-gorithm was performed to group-abstracted crime cases intoclasses of similar cases Based on the results concerning theDS and SPE cases we evaluated the performance of theframework for cybercrime investigation by measuring thesimilarity and clustering algorithm +e results demon-strated that the proposed methodology can be used as aDecision Support System (DSS) to obtain meaningful in-formation about the most similar past cases and relatedhacker groups

+e main contributions of the present study are sum-marized as follows

(i) We present a CBR-based decision support meth-odology for cybercrime investigation With theproposed cybercrime investigation scheme securityanalysts can find past attack cases that are mostsimilar to a given attack and thus obtain insights touncover the networks of cybercrime related towebsite defacement To deliver high-value in-telligence we adopt the data-driven analytic system

(ii) We demonstrate the clustering processing and vi-sualization +e clustering processing enables aninvestigator to efficiently explore large data andinterpret the results Furthermore the visualizationhelps an investigator to intuitively recognize crimepatterns

(iii) We propose that it is possible to measure the sim-ilarity score and to perform the clustering algorithmby transforming unstructured data (ie web de-facement cases) into calculable structured data

2 Security and Communication Networks

(iv) We report case studies based on the real datasetgathered from the zone-horg site to demonstrate thevarious aspects of our proposed algorithm Finally tofoster further research our dataset (the dataset forcybercrime investigation focused on the data-drivenwebsite defacement analysis httpocslabhksecuritynetDatasetsweb-hacking-profiling) is made pub-licly available [15 16]

+e rest of the paper is organized as follows Section 2provides a summary of the literature related to our work+edetailed methodology is described in Section 3 Section 4reports the experimental results and analysis based on thecase study +e limitations of our work and a discussion onthe proposed approach are presented in Section 5 FinallySection 6 concludes the paper and suggests directions offurther research

2 Related Work

In this section we primarily highlight the previous studiesclosely related to the CBR and review two streams of literatureon traditional criminal profiling and cybercrime profilingWealso elaborate the data mining-based cybercrime profilingpertaining to the following (1) the CBR studies that helpbetter understand our research context (2) traditionalcriminal profiling and cybercrime profiling review that allowus to obtain an elusive criminal or a concealed clue (3) datamining-based cybercrime profiling literature that can supportand theoretically reinforce our methodology

21 Case-Based Reasoning CBR is a method that uses pastexperiences or cases to solve new problems Even when thenew problems are not exactly identical to the previous casesCBR can suggest a partial solution to the new problems [17]CBR can be categorized as a data-mining technique as it canclassify the given samples and predict the result for a newcase As case studies are intuitive and easily understood byhumans CBR has long been used in many fields includingcustomer technical support medical case search and legal

case search +e general model of the four-step CBR process[18] is shown in Figure 2

+e four phases are as follows

(i) Retrieve given a new website defacement caserelevant cases are retrieved from the knowledge baseto solve the case at hand

(ii) Reuse solutions from previous website defacementcases are mapped for reuse

(iii) Revise on mapping and testing previous solutionsto the target case the solutions are revised toconsider the changes in the cases

(iv) Retain after the solution has been successfullyadapted to the problem a meaningful experience isstored as a new case in the knowledge base

CBR starts with a given set of cases for training formsgeneralizations of the given examples and subsequentlyidentifies the commonalities between the retrieved case andthe target case When applied to the website defacement casecomposed of descriptive and nominal data it can effectivelydetermine the commonality from the crawled hacking casesand quickly search the nearest related case FurthermoreCBR can be used to search themost similar cases and retrievepast solutions from the latest response cases CBR facilitatessecurity administrators to make better decisions For ex-ample Kim et al proposed the DSS for an incident responsebased on CBR [19]

CBR has been extensively used in several areas such asmanagement for product development medicine and inengineering applications [20ndash22] In addition several CBRapproaches were available for cyber incidents profiling Forinstance Kim et al proposed an intelligent system that canmeasure the similarity between the past and new attacks Intheir work the author(s) demonstrated such capability inuncovering zero-day attacks using the string similarityanalysis of the captured packet-level data [23] Horsmanet al proposed the CBR-FTframework which is a method forcollecting and reusing past digital forensic investigationinformation to highlight likely evidential areas on a suspect

The Sonyattacks

TrojanDozer(July 2009)

TrojanKoredos(March 2011)

BackdoorPrioxer(June 2010)

BackdoorPrioxerB(July 2012)

DownloaderCastov(October 2012) TrojanCastov

(June 2013)

InfostealerCastovDownloaderCastov

(May 2013)

TrojanJokra(March 2013)

Attacks on organizations in the

US and South Korea

Attacks against financialinstitutions and their

customers in South Korea

Attacks against banks andlocal broadcasting

organizations in South Korea

2009 20162010 2011 2012 2013 2014 2015 2017

BackdoorDestover(November 2014)

BackdoorDuuzerW32Brambul

BackdoorJoanap(October 2015)

TrojanBanSwift(February 2016)

Manufacturing industryin South Korea targeted

South Koreanorganizationstargeted again

DDoS attacks against SouthKorea

SWIFTattacks

Banks targetedagain

WannaCryrocks the world

DownloaderRatabanka

(February 2017)

WannaCry(May 2017)

Figure 1 Timeline of the Lazarus group activities from 2009 to 2017

Security and Communication Networks 3

operating system It enables an investigator to help quicklyand precisely decide where to search for evidence [24]

22 Traditional Criminal Profiling and Cybercrime ProfilingProfiling is used in various sectors of the society to in-vestigate a criminalrsquos mentality Criminal profiling is aprofiling technique for criminal investigation based on thepsychological and behavioural patterns of a criminal[25 26] +e criminal aspects and crime factors can beidentified through the evidences and insights of the psy-chological and behavioural bias [27] In the field of crimi-nology the widely used profiling technique is called theModus Operandi (MO) It is used to describe a suspectrsquosbehaviour and evidence elements in crime +at is it meanshow a suspect commits their crimes +e Modus Operandichanges based on the offenderrsquos criminal conduct and in-teraction with the surrounding such as time date and lo-cation of crime Moreover it evolves based on how theoffender reaches hisher victim [28 29]

Based on only the traditional criminal profiling tech-niques and empirical knowledge it is difficult for a cyber-crime investigator to reduce the error of the investigativeprocess and to untangle the complexity of a cybercrimeHowever if the investigator is provided with sufficient in-formation and detailed analysis data to understand theunclear motivation and the elusive pattern related to thecybercrime they can infer the reason(s) of the crime at stakeand produce both general and specific outlines of thecriminal [26] +e cybercrime network and characteristicscan be important indicators to differentiate between keyfigures in the cybercrime organizations and those of passinginterest In addition their activity periods and messagecontent patterns of the participants in an illegal communitycan support the investigator to carefully identify andscrutinize the key figure in the cybercrime network [30 31]By automating cybercrime profiling and data-miningmethods of analysis through a cross-analysis of variousbehavioural patterns we can anticipate potential criminalactivities and identify new profiles that pose serious threatsto the community Furthermore data-mining methods suchas entity extraction clusteringclassification technique andsocial network analysis make it possible to efficiently explorelarge data Network visualization enables an investigator tointuitively recognize the crime pattern [32ndash34]

In general the accuracy of CBR depends on the quality ofthe collected data and the overall accuracy is difficult toevaluate [35] Although the effectiveness of data-driveninvestigation can decrease owing to the dynamic and fast-evolving crime patterns understanding the hidden corre-lations and latent behaviour in such data using large dataanalytic techniques is another promising direction in re-search Accordingly many law enforcement agencies havebeen adopting future crime prediction systems based on thestatistics about weather cleanliness location demographicdistribution education level and wealth-level informationBased on the crime prevention through environmentaldesign (CPTED) theory [36] many pieces of data correlatedwith the crime are collected and analysed to estimate thecrime probability However while many data-driven ap-proaches to support traditional criminal profiling areavailable only several research efforts have focused oncybercrime profiling

23 Data-Driven Cybercrime Profiling In addition to thetraditional criminal profiling for offline crime investigationsvarious profiling techniques have been developed in thesetechniques it is assumed that cybercriminals also showsimilar behavioural and psychological characteristics Owingto the recent advances in data-mining and machine-learningalgorithms many studies regarding criminal pattern de-tection classification and clustering have emerged +emethods used in these studies include among others entityextraction clustering association rule mining deviationdetection and classification of social network analysis Acombination of the traditional method and a newer methodenables the pattern identification from both structured andunstructured data For instance entity extraction is used tounderstand concealed patterns in the data such as textsimages and audio data Furthermore clustering is used togroup objects into classes with similar characteristics[37 38] In addition unsupervised methods such as the self-organizing map (SOM) are used to support the results of thetraditional criminal profiling [39] In cases where thecriminal and the related cases are known supervisedlearning is applied [40] However although many advanceshave occurred in big data analytics and machine learningthese approaches are limited in supporting real-time pro-cessing as they require high computing power to handle alarge volume of training data In fact the large volume ofcrime data is a considerable challenge for the investigator interms of gaining the appropriate understanding of a com-plicated relationship or in terms of a timely responseHowever despite the limitations of this approach datamining yields valid useful and appropriate results By datapreprocessing such as data cleaning data integration anddata transformation it intends to reduce noisy data as wellas incomplete and inconsistent data It helps to uncover andconceptualize the concealed or latent crime patterns Byimproving the efficiency of crime data understanding andreducing errors in the results afforded by the data-miningmethod the investigator can perform reasoning timelyjudgment and quick problem solving [41]

Target cases

Proposedsolution

Confirmedsolution

Retrieve Reuse

Revise

Retain

Employ

Defacedwebsites

Similaritymeasurement

Case base

Figure 2 CBR process used in cybercrime investigationemploying knowledge base that is reused over similar new casesand retained for later use

4 Security and Communication Networks

CBR is also used to provide the reasoning power tosearch similar previous cases [25 42] However biased orimperfect collected data deteriorate the quality of the de-cision support provided by CBR +erefore in many casessetting the weight of the selected features is based on em-pirical knowledge which can be subsequently used to enablethe detection and analysis of crime patterns from thetemporal crime activity data Using clustering and classifi-cation techniques as well as speculativemodels for searchingsimilar crime cases in the past investigators can easily ex-tract useful information from the unstructured textualdataset [43] Hence investigators must collect and contin-uously update the comprehensive crime data

Clustering is the task of determining a similar group inthe data Clustering includes supervised learning typesZulfadhilah et al compared four types of clustering algo-rithms K-means hierarchical clustering SOM and Ex-pectation Maximization algorithm (EM clustering)mdashbasedon their performances +ey concluded that the K-meansalgorithm and the EM algorithm are better than the hier-archical clustering algorithm In general partitioning al-gorithms such as the K-means and EM algorithm are highlyrecommended for use in large-size data [44] In summarythe clustering algorithm can facilitate the investigator indetecting crimes patterns and accelerate crime solving +eweighting scheme for attributes can handle the limitations ofthe clustering techniques [45]

3 Methodology

In this section we present the detailed scheme of decisionsupport methodology for cybercrime investigation with thefocus on the website defacement cases A conceptualframework and its process are illustrated in Figure 3 +escheme is proceeded by the following three steps datapreprocessing case vector design and reasoning engineFirst we provide a brief outline of the dataset and describethe merits of the website defacement data Also we sum-marize the preprocessing for data parsing and cleaningregarding the collected data type Next we designed the casevector and chose the significant features to apply the rea-soning performance Finally the reasoning engine hasvarious functionalities and it is intended for the grouping(clustering) of cases based on their similarity

31 Preprocessing As part of the proposed analyticalframework we have developed a crawler to automaticallycollect 212093 website defacement cases from the zone-horg site Many website defacement cases are being dailyrecorded in the archive page of the zone-horg site Each caseregistered in the archive page provides information (ie IPaddress Domain Date OS Notifier and Web server) of thesame format through each mirror page First of all thecrawler collects all public information relevant to each case+ereafter on accessing the domain site it saves data in theraw format of the HTML source After crawling the webresources of raw data the data preprocessing is performed toamend incomplete improperly formatted or duplicate data

records More specifically there are various tag attributes inthe HTML source Encoding and Font data are extractedthrough the ltcharsetgt and ltfont-stylegt tag of the HTMLelements set between ltheadgt and ltheadgt tag in the HTMLsource Also image sound file and the linked site areextracted through the ltfont-familygt ltimggt and lthrefgt tagof them set between ltbodygt and ltbodygt tag in the HTMLsource +e web resources as original raw data were parsedand cleaned depending on the relevant case vector (seeFigure 4) After cleaning the data some significant data fieldswere selectively stored in the systemrsquos case database

+e selected data fields were related to the informationabout the website defacement date related IP address targetdomain target system OS and web server version theseaspects have proven to be useful for cyberattack in-vestigations [46] Specifically the encoding method and thefont whom the HTML source contains were necessary tospeculate on the attackerrsquos regional information For ex-ample if messages remaining in a defaced website arewritten in ISOIEC 8859 encoding we can subsequentlyinfer that the hackersrsquo language is German Spanish orSwedish Furthermore depending on whether all the mes-sages are written in the same encoding method the usedspecial characters such as β or ntilde or a can be used as a clue forguessing attackerrsquos origin In general encodings fromWindows-1250 to Windows-1258 are used in the centralEuropean languages as well as in Turkish Baltic languagesand Vietnamese By contrast GB encoding is used inChinese HKSCS encoding is used in Taiwanese and EUC-KR or ISO-2022-KR encoding is used in Korean [47] Inaddition to the font and encoding information the textimage audio and video found in the messages are alsonecessary parameters for the case identification

32 Case Vector Design We designed the case vector in twotypes concerning the similarity measure and clusteringprocessing +e case vector is summarized in Table 1 +efeatures of various aspects such as the font web serverthanks-to notifier (hackers or hacking groups) as well as thefeatures such as the encoding IP address domain attackdate and OS were extractable from the public archival sitezone-horg Generally more diverse features can be a sig-nificant factor for investigating relationships and associa-tions among hackers or hacking groups and the scale and thedensityintensity of the hacker community However such apremise has some shortcomings +e importance or theweight of all features may be different depending on thecriterion Also if all features are important machine-learning algorithms such as clustering or classification aredifficult to perform in reality because of the high compu-tational cost for analysing Despite having similar meaningssome of the features can be reperformed unnecessarily Tothis end the dimensionality reduction and the feature se-lection were performed in the present study paper After athorough review by security experts the significant featureswere selected for the case vector of website defacement cases+e detailed explanation of the dimensionality reductionand the feature selection is as follows

Security and Communication Networks 5

In theWindows operating system if a specific font is notdesignated as the tag inside the HTML code such as theltfont-familygt property the characters on a website pagemay appear as broken In particular some of the fontsamong the Chinese charactersrsquo cultural area depend on thecharacter encoding (eg font-family Gulim MingLiU andSTHeiti) [48] Similar to the encoding feature although thischaracteristic may be the key evidence to uncover a cor-relation between the victim and the attacker it is extremelyrare in each of the collected website defacement cases+erefore it is not suitable as a case vector for cybercrimeinvestigation Meanwhile in the case of a web server itprovides HTML CSS JavaScript etc when a client requestsa web page using the web server While the Apache and IISweb servers are primarily used in the Windows environ-ment the LiteSpeed web server is primarily used in the Linuxenvironment and the Enterprise web server is primarily usedin the UNIX environment +erefore the web server is

selectively dependent on the OS environment As with thefont feature described before since the web server featurecould not be found in the collected website defacement casesit was not suitable as a case vector for cybercrime in-vestigation Finally although the case vector concerningthanks-to and notifier can be used to analyse a hiddennetwork between the hackers and hacker groups the analysisof a network among hackers and hacking groups throughthem should be addressed in future research

As a result we defined the case vector by dividing intotwo types ie a version for the similarity measure and aversion for the clustering processing As the features of thecase vector the encoding IP address domain (ie servicename gTLD and ccTLD) attack date and OS were used inthe similarity measure However the encoding gTLDccTLD and OS were used in the clustering processing +eencoding is a case vector that provides decisive clues relatedto the attackerrsquos region information In the case of the IP

Case vector designPreprocessing

Clustering module

Cases-centric DB

Reasoning engine

It matches a new attack case with a former

attack case depending on

the defined case vector

It measuresthe similarity

score depending onthe weights and values

It calculates weights and

values

Similarity module

Data parsing

Data cleaning

Feature selection

Feature normalization

It performs the clustering processing through the EM

algorithm

It derives several clusterswhich exhibit similar patterns

Crawler

It gets the metadata and HTML source of

a website defacement case through

the mirror pageArchive page in the zone-horg site

Figure 3 Proposed analytical framework for the data-driven website defacement cases

Figure 4 Sorted dataset through the preprocessing

6 Security and Communication Networks

address and domain it gives clues related to the victimrsquoslocation and position Furthermore the attack date givesclues to the relation between the attacker and the victim+edetailed explanation of key features is provided in Table 1

+e normalization result of various feature elementsstored in the raw form of the HTML source is presented inFigure 5 In the case of encoding ISO series and MSWindows series are applied by normalizing depending onthe encoding used in each region or country In the case ofgTLD it was applied by normalizing depending on thegroups or organizations with similar characteristics In thecase of ccTLD it was applied by normalizing depending oneach continent Although the compression and normaliza-tion of features enable making the analysis such as clus-tering processing and similarity measure simple and clearon the contrary it may also bring about the loss of in-formation in the original data or make it more difficult toanalyse in detail

33 Reasoning Engine In the reasoning process the rea-soning engine first performs a similarity search based onCBR Discrete similarity scores are defined to calculate thedistance of nominal data (eg IP address and domain)Algorithm 1 shows how the similarity module operates bycomparing a retrieved website defacement case and all casesin the cases-centric DB on a case-by-case basis Sub-sequently the reasoning engine evaluates the similarity score

between the given new attack case vector and vectors ofother attack cases Next the reasoning engine performsclustering to group-abstracted crime cases into classes ofsimilar crime cases In crime investigation a cluster groupedas similar crime case subsets helps to infer crime patternsand speeds up the process of solving a crime due to a betterunderstanding of a complicated relationship or in terms of atimely response In the present study we implemented thereasoning engine consisting of two processing entities thesimilarity measure processing and the clustering algorithmprocessing (see below for further details)

331 Similarity Measure As the similarity measure based onthe CBR algorithm we proposed the similarity algorithmoperated by comparing a retrieved website defacement caseand all cases in the cases-centric DB To begin with if one ofthe retrieved cases (RC a new case) is given and there are ldquonrdquocases in the cases-centric DB (TCs all cases in the cases-centric DB) a comparison between RC and TCs are con-ducted as ldquonrdquo times We defined the extent of similaritybetween RC and TCs as a numeral value from ldquo0rdquo to ldquo1rdquowhere ldquo0rdquo means that RC and TC are unrelated and ldquo1rdquomeansthat RC and TC are identical Similarity score (0lt Slt 1)specifies the extent of similarity between RC and TC If thesimilarity score is much closer to ldquo1rdquo RC and TC are moreanalogous to each other In the event of multiple case vectorssimilarity can be expressed as a weighted sum of case vectors

Table 1 Case vector design highlighting two groups of features

Case vectorUsed in process

DescriptionS C

Encoding O O

It is used to represent the different types of languageinformation on the computer It determines the

usable characters and the methods to express them+e feature was normalized based on MS Windows

and the ISO character set

IP address O NA A unique number that allows devices on the networkto identify and communicate with each other

Domain

Service name O NA+e service name is individually made with a differentname depending on the service categories such as

gTLD or ccTLD

gTLD O O+e gTLD feature was normalized depending on theelement having the same meaning (eg go gob and

gobr feature were normalized into gov)

ccTLD O O

+e ccTLD is a unique code assigned to the domainname that represents the country specific region or

an international organization+e ccTLD normalized by the continent is used in theclustering process and the original ccTLD is used in

the similarity process

Date O NA +e attack date performed by the hacker or thehacking group

OS O OA part of a computer system that manages all

hardware and software (eg Windows Linux andUNIX)

S similarity measure C clustering processing

Security and Communication Networks 7

Similarity score 1113944

cv

i1distance RCcvTCcv( 1113857 times weightcv1113858 1113859

cv case vector(ie encoding IP address domain date andOS)

(1)

+ere are various approaches to set the weight of the casevector such as the heuristic method logistic regression anal-ysis and attribute weighting methods Furthermore theseweight values need to be periodically updated to be applied tothe study of recent attack trends However for the initialsetting it is difficult to set the exact numerical value for eachweight values in accordance with the case vector In our ex-periment we set the impact and the weight of the case vector ashighmedium and low according to their importance so that to

concretely categorize the attacker and the victim Above allsince encoding makes it possible to infer the static locatedinformation of the attacker we defined encoding as high-quality information IP address and domain were defined asmedium-quality information +ese case vectors enable theidentification and specification of the victim Finally the tar-geted date and OS were defined as low-quality information Tomeasure clustering and similarity all values of the case vectormixed as numbers and letters were normalized to have a valuefrom 0 to 1 Obviously since these values can be subjective inorder to prevent this subjective bias these values should beacquired and thoroughly reviewed by several experts +istechnique can be easily applied using expert knowledge ofinvestigation experts and is easy to understand from re-searchersrsquo viewpoint +e quantitative method for setting and

Arabic

Baltic

CentralEurope

Chinese

Cyrillic

Greek

Hebrew

Japanese

Korean

SouthernEurope

Taiwanese

Thailand

Turkish

Africa

Australia

CentralAsia

EastAsia

EasternEurope

NorthAmerica

NorthernEurope

SouthAmerica

SouthAsia

SoutheastAsia

SouthernEurope

WestAsia

WesternEurope

Linux-basedOS

MacOS

Unix-basedOS

Windows-basedOS

bull ISO-8859-6bull Windows-1256

bull ISO-2022-KRbull EUC-KR

bull GB2312bull GB18030 bull GBK

bull ISO-2022-JPbull EUC-JPbull ShiftJIS

bull ISO-8859-2bull Windows-1250

bull ISO-8859-13bull Windows-1257

bull ISO-8859-8bull Windows-1255

bull ISO-8859-7bull Windows-1253

bullbull

ISO-8859-5Windows-1251

bull Windows seriesbull Windows server series

bull Unixbull AIX bull Compaq Tru64 etc

bull MacOSbull MacOSX

bull Linux bull FreeBSD bull Avtech etc

bull combull cobull int

bull info

bull org bull or

bull coop

bull govbull gobull gob

bull edubull ac

bull net

bull mil

bull biz

bull fr ie be gl lube dk ad imnl uk je gg etc

bull br sr ar cl do ec fk gf py sr uy ve etc

bull sa ae kw bh az in ir jo kw lb om qa ye etc

bull no dk lv ltse ax fi glis no

bull us bz lc ai bmgd hn ky mx ni pa sv tt vi etc

bull gr mksm ad va ba es it ptrs hr si li bg etc

bull la bu vn kh th

bull in np bt pk lk id mn mo my np ph tl etc

bull kz uz tm tj kg af am tr

bull au pg nz ccck fj gu kinu sb vu wf etc

bull gn jm ke aobw cf ls mztz ug yt zw etc

bull ru by al lv ua pl sk hu ee md ro mk etc

bull kr cn jp twhk kp sg

Encoding gTLD ccTLD OS

com

edu

gov

org

biz

mil

net

coop

info

bull Windows-1253

bull ISO-8859-11bull Windows-874

bull Big5 bull EUC-TW bull Eten

bull ISO-8859-9bull Windows-1254bull IBM857

WestEurope

bull ISO-8859-1bull Windows-1252

Normalization

Figure 5 Normalization of each feature elements

8 Security and Communication Networks

Input TCs(Tested_DB)lowast +e Tested_DB indicates the cases-centric DB lowastRC (Retrieved_Case)⟵ Encodi ngRC IPRC DomainRC DateRC OSRClowast RC means one of the retrieved cases lowastW (Weight)⟵ Encodi ngW IPW DomainW DateW OSW

Output Similarity_score(1) TCEncodi ngTC IPTC DomainTC DateTC OSTC⟵TCs(2) While RC in TCs do(3) if Encodi ngRC Encodi ngTC then(4) Encoding_similarity_value⟵ 10(5) else(6) Encoding_similarity_value⟵ 00(7) end(8) IPRC Octet ARC Octet BRC Octet CRC Octet DRC IPTC Octet ATC Octet BTC Octet CTC Octet DTC(9) if (Octet ARC Octet ATC) (Octet BRC Octet BTC) (Octet CRC Octet CTC) (Octet DRC Octet DTC) then(10) IP_similarity_value⟵ 10(11) else if (Octet ARC Octet ATC) (Octet BRC Octet BTC) (Octet CRC Octet CTC) then(12) IP_similarity_value⟵ 075(13) else if (Octet ARC Octet ATC) (Octet BRC Octet BTC) then(14) IP_similarity_value⟵ 05(15) else if (Octet ARC Octet ATC) then(16) IP_similarity_value⟵ 025(17) else(18) IP_similarity_value⟵ 00(19) end(20) DomainRC ServiceNameRC gTLDRC ccTLDRC DomainTC ServiceNameTC gTLDTC ccTLDTC(21) if an identical domain then(22) Domain_similarity_value⟵ 10(23) else if (ServiceNameRC ServiceNameTC) (gTLDRC gTLDTC) (ccTLDRC ccTLDTC) then(24) Domain similarity_value⟵ 08(25) else if (gTLDRC gTLDTC) (ccTLDRC ccTLDTC) then(26) Domain_similarity_value⟵ 03(27) else if (ServiceNameRC ServiceNameTC) then(28) Domain_similarity_value⟵ 01(29) else if (ccTLDRC ccTLDTC) then(30) Domain_similarity_value⟵ 01(31) else if (gTLDRC gTLDTC) then(32) Domain_similarity_value⟵ 01(33) else(34) Domain_similarity_value⟵ 00(35) end(36) Date_variance⟵ |Da teRC―Da teTC|lowast It converts a date format year month and day (ie yyyy-mm-dd) into a day

calculated with numeric lowast(37) if 0leDate_variancele 365 then(38) Date_similarity_value⟵ 10(39) else if 365ltDate_variancele 1095 then(40) Date_similarity_value⟵ 075(41) else if 1095ltDate_variancele 1825 then(42) Date_similarity_value⟵ 05(43) else if 1825ltDate_variancele 2555 then(44) Date_similarity_value⟵ 025(45) else if 2555ltDate_variance then(46) Date_similarity_value⟵ 00(47) end(48) if OSRC OSTC then(49) OS_similarity_value⟵ 10(50) else(51) OS_similarity_value⟵ 00(52) end(53) Similarity_score⟵ (Encoding_similarity_valuetimes EncodingW) +

(IP_similarity_valuetimes IPW) + (Domain_similarity_valuetimes DomainW) +(Date_similarity_valuetimes DateW) + (OS_similarity_valuetimes OSW)

(54) return Similarity score between RC and TC(55) end while

ALGORITHM 1 Similarity measure module

Security and Communication Networks 9

updating the weight value is an issue worth addressing infurther research In the present study we set the weight valuesfor the case vector including the encoding IP address domainattack date and OS (see Table 2)

Some case vectorsrsquo distance cannot be directly estimatedas they have mixed numerical and nominal data (such as IPaddress range and domain name) For this reason to cal-culate the distance between the nominal data we defined thediscrete similarity measure +e similarity of IP addresseswas calculated by measuring the similarity among the sameoctet of two given IP addresses +e IP address space iscomposed of a number combination of four octets separatedby ldquordquo In the present study we compared if octets from the1st octet to the 4th octet of RC and TC were identicalSubsequently a similarity value was assigned to the IPaddress vector We suggested the discrete similarity valuebetween two IP addresses as visible in Table 2 +e proposedapproach is advantageous in that it enables the distancecalculation between the IP addresses efficiently

(i) IP address of RC zzz yyy xxx www

(ii) IP address of TC zzz yyy xxx www

Meanwhile the similarity between domains is calculatedaccording to their domain properties +e domain iscomposed of the gTLD ccTLD and service name+e gTLDrefers to a generic top-level domain in the domain rule Forinstance com and co are used for commercial companies ororganizations org and or are used for nonprofit organi-zations go and gov are used for government and stateagencies Besides ccTLD refers to a country code top-leveldomain in the domain rule and means a unique sign thatrepresents a specific region such as kr cn br and uk DNSmakes change in the IP address into a unique Domain Namewhich is easy to remember because it consists of a combi-nation of an alphabet letter and a number Among theDomain Name the service name is built corresponding withthe characteristics of the groups organizations or corpo-rations that the gTLD is intending and pursuing +e servicename has diverse and different names depending on thecategories of the gTLD such as educational institutionscommercial enterprises military organizations nonprofitorganizations and government and state agencies Unlikeother case vectors we set the rule for estimating the simi-larity of the domain as depicted in Table 2

Furthermore we defined the attack date similarity Similarto the offline criminal investigation case if the time of a crimeoccurrence is near we can analyse the cases as a similar crimewith a cross-analysis of the target area and the criminalsrsquopatterns +e similarity value depends on the period differencebetween a new case and existing cases As visible in Table 2 thesimilarity value is described according to the date gap of twocases that occurred on different dates In summary accordingto the similarity degree of a variation range of a section thesimilarity values of the attack IP address domain and attackdate were set to the similarity value between 0 and 1

332 Clustering Processing Merely sorting the data andvisually analysing them render it difficult for an investigator to

infer the correlations and similarity among the potentialfeatures of incidents Hence an advanced tool that wouldcapture the complex underlying structures and data prop-erties is required Accordingly in the present study weconducted the clustering process using the EM algorithmbased on the probability of the individual data attributes +isalgorithm does not restrict the number of clusters in theparameters but automatically generates a number of validclusters by cross-validation +ereafter the algorithm de-termines the probability that some data items existed in thecluster bymaximizing the correlation and dependence amongthe objectsWe applied practically the EM algorithm to 80948data items having the information of encoding gTLD ccTLDand OS from 212093 data for clustering +e characterencoding was normalized by a group of congenial cover codeunits (ISO-8859 MS Windows character set GB and EUCseries) We excluded the Unicode because it is too generalwhich accounts for themajority of the collected encoding datafor clustering In the case of the service name even if we canfind out similar combinations of alphabet letters or numbersit is not easy to find commonality or relevance between them+erefore it is not suitable for being used as the similaritymeasure of the reasoning engine Consequently character-istics and metadata concerning the 12 clusters were obtained(see Table 3) +ese clustering results are also visualized andstored in the database (see Figure 6)

+e donut charts include the different features fromoutside to inside (in order) with the corresponding share ofeach feature value separated by a different colour codewithin this same circle Each cluster consists of four circlesand the circle represents from the outside to the inside theencoding gTLD ccTLD and OS +e percentage in Table 3represents howmany cases one cluster contains among all ofwebsite defacement cases collected from the zone-horg site+e representative hacker represents a notable hacker orhacking group among the members of them in each clusterAs described in Figure 6 clusters of similar patterns werefound in the clusters +e most conspicuously similarclusters were 4 and 7 which had the feature of using Arabicand Chinese a feature of the attack against an industrialorganization whose headquarters are located in WesternEurope +e cases in Clusters 4 and 7 accounted for 4129percent among all of website defacement cases collectedfrom the zone-horg site+e results of the clustering processcontribute to the concretization of the similarity between thenew and existing cases A large number of new cases haveflowed in the database and then if the clustering process isperformed with the dataset a clustering result may take on adifferent pattern of course

4 Application

41 Experimental Results and Analysis Considering that theassumption that the attackers tend to use similar or uniqueattack methods is not always valid and it is difficult toevaluate the accuracy of the similarity mechanism As timeprogresses attackersrsquo hacking skills advance and in additionthe attack plan campaign purpose and target groups canchange depending on the situation +erefore in the present

10 Security and Communication Networks

Table 2 Value and the weight for the similarity score by the case vector All of the values of the similarity score are normalized to 0 or 1

Case vector Weight Impact +e similarity measure between a new case andexisting cases Value

Encoding 05 High mdash 0 or 1

IP address 02 Medium

If the same (eg 14324816 and 14324816) 1If the 1st 2nd and 3rd octet are matched (eg

14324816 and 14324818) 075

If the 1st and 2nd octet are matched (eg 14324816and 14324844) 05

Only the 1st octet is matched (eg 14324816 and1431324) 025

No common octet (eg 14324816 and 1631325) 0

Domain 015 Medium

An identical domain 1Service name is matched and one of the gTLD and

ccTLD is matched 08

gTLD and ccTLD is matched 03Service name is matched 01

ccTLD is matched 01gTLD is matched 01

Nonidentical domain 0

Date 01 Low

Period of about 6 months back and forth (1 year) 1Period of about 18 months back and forth (3 years) 075Period of about 30 months back and forth (5 years) 05Period of about 42 months back and forth (7 years) 025Over period of about 42 months (over 7 years) 0

OS 005 Low mdash 0 or 1

Table 3 Characteristics and metadata of several different clusters derived from the clustering processing

Cluster number Ratio () Description Representative hacker (group)

0 784+e group uses Central European languages +eyprincipally attacked against the profit organization

and Linux-based OS in Western EuropeJaMaYcKa Super2li

1 816

+e group uses Arabic and Cyrillic +ey principallyattacked against the organization that manages thenetwork and Linux-based and Unix-based OS +eirattack region is distributed throughout SouthernEurope South America Eastern Europe and

Southeast Asia

BI0S

2 1036

+e group uses Central European languages +eyprincipally attacked against the organization that

manages the network and nonprofit organizations inWestern Europe

JaMaYcKa

3 933+e group uses Central European languages +eyprincipally attacked against the profit organization

and Windows-based OS in Western Europe1923Turk

4 2536+e group uses Arabic and Chinese +ey principally

attacked against the profit organization andWindows-based OS in Western Europe

EL_MuHaMMeD federal-atackorg

5 173

+e group uses Central European languages +eyprincipally attacked against the profit organizationand Unix-based OS in Southern Europe and Eastern

Europe

d3bsimX SuSKuN

6 524

+e group uses Central European languages +eyprincipally attacked against the profit organizationthe educational institution the government and stateagencies and also Windows-based OS in East Asia

1923Turk

Security and Communication Networks 11

study rather than evaluating the accuracy of the similaritymechanism we tested the overall performance of the pro-posed methodology with the ratio of correctly identified

hackers +e developed testing procedures unfolded in thefollowing four steps and are depicted in detail in Figure 7where ldquoKrdquo presents all hackers within the database

Table 3 Continued

Cluster number Ratio () Description Representative hacker (group)

7 1593+e group uses Arabic Chinese and Turkish +eyprincipally attacked against the profit organization

and Linux-based OS in Western EuropeRya iskorpitx

8 911+e group uses Central European languages +eyprincipally attacked against the profit organization

and Windows-based OS in Western Europe1923Turk

9 363

+e group uses Central European languages +eyprincipally attacked against the profit organizationand Linux-based OS in South America and Eastern

Europe

Hmei7

10 139

+e group uses Central European languages +eyprincipally attacked against Windows-based OS inSouth America and Southeast Asia+eir attack target

is mostly the educational institution and thegovernment and state agencies

BHS F4keLive

11 192

+e group uses Arabic and Central Europeanlanguages+ey principally attacked against the profitorganization and Windows-based OS in Southern

Europe

EL_MuHaMMeD linuXploit_cre

Clustering 00

25

25

50

75

0100

25

50

75

0100

25

50

75

0100

25

50

75

0100

25

50

75

0100

25

50

75

0100

25

50

75

0100

25

50

75

0100

25

50

75

0100

25

50

75

0100

25

50

75

0100

50

75

0100

Clustering 01 Clustering 02 Clustering 03

Clustering 04 Clustering 05 Clustering 06 Clustering 07

Clustering 08 Clustering 09 Clustering 10 Clustering 11

West EuropeTurkishCentral EuropeArabicCyrillicChinese

com

net

org

gov

edu

mil

Western EuropeEast AsiaSouthern EuropeSouth AmericaEastern EuropeSoutheast Asia

WindowLinuxUnixMacOS

Encoding

gTLD

ccTLD

OS

Figure 6 Visualization of the 12 different clusters (00 through 11) in our data annotated with various features encoding gTLD ccTLD andOS and their corresponding share (legend on the right side)

12 Security and Communication Networks

Rk Count Casesmk( )Count Casesallk( )

(3)

where ldquomrdquo means the past cases which are within the denedscope concerning a randomly selected hacker ldquokrdquo

(i) Step 1 selection the measurement objects ie 100hackers were randomly selected from the database

(ii) Step 2 case labelling we retrieved all previous attackcases conducted by the randomly selected 100hackers in Step 1 and then subsequently labelled allprevious attack cases by each hacker name

(iii) Step 3 case extraction we selected the most recentcase among the cases labelled in Step 2 as an inputvalue shye similarity score was then estimated bycomparing themost recent case (ie RCmdashone of theretrieved cases) with all other cases in the database(ie TCsmdashall cases in the cases-centric DB)

(iv) Step 4 scoring similarity score was sorteddepending on the value and the weight for thesimilarity score by the case vector (see Table 2) inthe descending orderWhenever the similarity valuewas 0 it was not displayed on the scoring list of Step4 shye feasibility of the proposed methodology wasevaluated based on how many past cases of a hackerthere were in the N scope at the scoring list of Step 4that is regarding the ratio of the attack cases by eachhacker we checked whether the cases were includedat the top N scope (N scope from the top 1 percentto the top 30 percent)

NScope Count CasesScopeK( )Count CasesallK( )

times 100 (2)

First we randomly picked 100 hackers from the col-lected dataset (ie cases-centric DB) thereafter we re-trieved and extracted all past attack cases for each hackershye extracted past cases were labelled with the hackerrsquosname Figure 8 depicts the number of website defacementattack cases in the past for each hacker In Steps 3 and 4similarity between a retrieved case (ie the most recentcase) and all other stored website defacement cases weremeasured

Specically we checked whether the result (ie thesorted hackerrsquos past cases with a high similarity score)stemming from the similarity measurement was included atthe top N scope shyis process was meant to check based onthe similarity score how many past attack cases of randomlypicked 100 hackers were included in the dened topN scopeTo this end we divided the top N scope into eight criterionfactors from the top 1 percent to the top 30 percent and theratio R all the past attack cases for each hacker into sixcriterion factors from 50 percent to 100 percent (ie at 10percent intervals) As illustrated in equations (2) and (3) theN scope and the ratio R were categorized as ratios accordingto the dened measure rule More specically the criterionof the top N scope ie ldquotop N percentrdquo was based on theresult derived from the similarity measurement Attack caseswere sorted in order of high similarity score and thereforethe cases were within the range of topN scope (see Figure 9)Also in the case of the hacking case ratio of a randomly

Step 4 scoring

bullbullbull

Randomly selected100 hackers

from the database

Step 1 selection

bullbullbull

1 eBuGz2 Hmei73 3xp1r3

98 S3cure99 drm1st3r

100 Lulz53c

bullbullbull

Step 2 case labelling1 TheBuGz

100 Lulz53c

Step 3 case extraction

A retrieved case(the most recent case)

bullbullbull

1 TheBuGz

100 Lulz53c

bullbullbull

Cases-centricDB

Hackername Date Encoding IP address Domain OS Score

Hackername Date Encoding IP address Domain OS Score

bullbullbull

1 eBuGz2 Hmei73 3xp1r3

98 S3cure99 drm1st3r

100 Lulz53c

Case1 bullbullbull Casem

Case1 bullbullbull Casemprime

Case1 bullbullbull Casem

Case1 bullbullbull Casemprimei=1

cv[Distance (RCcv TCcv) times Weightcv]

Casemprime

Casemprime

Casem

Casem

Figure 7 shye developed testing procedures from step 1 to step 4

Security and Communication Networks 13

selected hacker some parts of the past attack cases (ie ratioR) concerning a hacker were within the defined N scope (seeFigure 9)

Figure 10 shows the number of an identified hacker fora retrieved case (ie the most recent case) among allhacking cases of each hacker +e X-axis in Figure 10 showsthe criterion of the topN scope including the eight criterionfactors () and of the ratio R including the six criterionfactors () +e Y-axis presents the number of an identifiedhacker in the top N scope among the randomly selected 100hackers in Step 1 As can be seen in Figure 10 the higher theratio R and the narrower the N scope the lower the numberof an identified hacker in the top N scope among therandomly selected 100 hackers On the other hand thelower the ratio R and the wider the N scope the higher thenumber of identified hackers in the top N scope among therandomly selected 100 hackers Consequently even ifhacking cases were caused by the same hacker as the hackeror hacking group which only attacked the same or similarobjects were rare it is impossible to draw results with a highsimilarity score for all cases of a hacker Nevertheless theresults demonstrated that the proposed CBR-based de-cision support methodology can successfully reduce thenumber of hackers and their cases and suggest potential topN percent candidates among hundreds of thousands ofcases

+erefore an investigator should consider the avail-ability and flexibility of data with respect to the data selectioncriteria for the similarity measurement As mentionedabove when a new attack occurs they can limit the searchrange of the data and determine the direction of the criminalinvestigation With such the reduction in the number ofcandidate-related cases the outcomes of our similaritymechanism are highly valuable in terms of reducing theinvestigation time to determine the potential suspect of agiven hacking incident

42 Case Study As mentioned above the accuracy of theCBR depends on the quality of the collected data and theoverall accuracy is difficult to evaluate Nevertheless al-though the data are insufficient to evaluate the proposedmethodology the DS and SPE cases include the ground-truth data with specific information related to the hacker orhacking groups Based on the public ground-truth data ofthe DS and SPE cases we found the most similar top threehackers or hacking groups to them and noticed theircharacteristic by the proposed similarity measure and theclustering processing

+e hackers of the DS cyberattack defaced the groupwarehomepage of LG U+ the 3rd largest telecommunicationcompany in South Korea and the English version of the

Step 4 scoring

1 TheBuGzTop N scope(1~30)

Ratio R (50~100)

Hackername Date Encoding IP address Domain OS Score

Case1 bullbullbull Casem

Casem

Figure 9 Scoring step on the top N scope and the ratio R

0

1000

2000

3000

4000

5000

0 25 50 75 100Hacker

Num

ber o

f cas

es

Figure 8 +e number of website defacement attack cases in the past of each hacker

14 Security and Communication Networks

Korean Broadcasting System (KBS) homepage +ey leftunique images and many messages on the defaced websites+e three Calaveras image (ie skull image) used in the LGU+rsquos defaced website appeared on many European websites+e character encoding set of the message was the WesternEuropean language system Based on these insights we couldinfer that the hackersrsquo background is European ldquoHASTATIrdquowas the word written on the KBS homepage meaning theforefront line of the Roman troops hinting that the DScyberattack could be a starting point rather than a transientattack it was a persistent one Even if we excluded otherimages and messages as well as other features from thesimilarity processes due to the unanticipated loss or absenceof data one could establish the similarity and intent of theattackers with reasonable confidence However given thesufficiently large hacker profiling source such abundant datacould support and enhance the accuracy of inference Fig-ure 11 shows the screenshots of the defaced websites at thattime

In the SPE case similarly to the DS case some imagesand messages were left on the computers of SPE Regardingcolour skulls image and misspellings the imagesFigure 11(c) used in the SPE cases took on the characteristicssimilar to those of the images Figure 11(b) used in the DScases As shown in Figure 11 the colour schemes in greenand red and the visual similarities seen in skull image areother crucial elements for crime tracing In both the DS andSPE cases the phrase such as ldquothis is the beginningrdquo andldquoyour datardquo were commonly found in the messages How-ever given the intentional hacking nature of forging orhiding their identity motivation and location some experts

say that these characteristics are not the conclusive proofthat Sony has been attacked by the same hacker [49ndash51]

For the evaluation of the results of the case study we firstmeasured the similarity between the new website de-facement cases (ie the DS and SPE cases) and the collectedexisting cases in the database +is approach coheres withthe CBR process used in cybercrime investigation (seeFigure 2) Two new website defacement cases the DS and theSPE were applied as RC and the similarity score for each ofthese two cases was computed using the similarity measure(see equation (1)) proposed in Section 331 Provided thatbecause the DS and SPE cases do the function of the targetcases as an input value we considered a direct comparisonbetween the DS and SPE cases for the similarity score wasnot appropriate [52]

+e similarity measure mentioned in the previousparagraph is based on the metadata released by an analysisreport of the DS and SPE real cases We summarized furtherthe characteristics and metadata associated with them inTable 4 +e similarity score was derived through com-parison between the presented metadata of the DS and SPEcases and all cases in the cases-centric DB We gave the mostsimilar top three cases among the result of the similarityscore (see the right side in Table) Notifier Hmei7 and d3b_Xare among the cases that belonged to Clusters 0 and 8 whichwere the two clusters that exhibited identical characteristicsIt can thus be understood that they used the encoding systempertinent to Central European languages based on the Latinlanguage system and typically launched attacks against aprofit organization located in Western Europe Notifieroaddah MTRiX and EL_MuHaMMeD were all classified

Top 1 Top 3 Top 5 Top 10 Top 15 Top 20 Top 25 Top 30Criterion of the top N ()

Num

ber o

f ide

ntifi

ed h

acke

rs

0

20

40

60

80

100

Ratio of the attack cases ()506070

8090100

Figure 10 +e number of identified hackers in the top N scope among the randomly selected 100 hackers

Security and Communication Networks 15

as the same cluster (Cluster 7) where the hackers of Cluster 7used the encoding system pertinent to Arabic and Chineselanguages and typically attacked against the profit organi-zation located in Western Europe

Next to ensure the objectivity of the similarity scorebased on the case study by the DS and SPE we computed thesimilarity score of any randomly selected pair from thewhole case Figure 12(a) shows the distribution of thesimilarity score of the randomly selected cases We took thedistribution of the similarity score using the central limittheorem which describes the average distribution of ran-dom samples extracted from a finite population +e dis-tribution shows that the calculation of the similarity score ofthe randomly selected two website defacement cases wasrepeatedly performed for 10000 times +e similarity scoresof any randomly selected pair of cases were typically dis-tributed around 03 +is result (Figure 12(a)) substantiatesthat the similarity scores are not low even if the similarityscores of the DS and SPE cases (Figure 12(b)) do not appear

numerically high Figure 12(b) shows the similarity scores ofthe DS and SPE cases+e top score of the similarity was 069in the DS case and all measured cases concentrated aroundthe similarity score (X-axis) of 00 to 015 and of 05 to 06 Inthe SPE case the top score of the similarity was 0615 and allmeasured cases concentrated around the similarity score (X-axis) of 00 to 02

Figure 13 shows the distribution of the similarity scorefor randomly selected 100 hackers mentioned in Section 41To know the mean value of the similarity score for eachhacker case we calculated the similarity score from thehackerrsquos own past cases Cases used for the similarity scoremeans not all cases in the cases-centric DB but just the pastcases conducted by the hacker in the cases-centric DB +emean value of the similarity scores in the hackers is 05233+e similarity scores of the tested cases in Table 4 is abovethe mean value +us the similarity scores for each hackeradequately underpin the similarity scores from the TCs inDS and SPE

(a) (b) (c)

Figure 11 A snippet of website defacement cases by a comparison of examples of the DS and SPE the defaced LGU+ groupware homepage(a) and KBS homepage (b) in the DS case and the defaced website in SPE case (c)

Table 4 Further characteristics and metadata associated with the DS and SPE cases

Retrieved case Tested cases

Case name NotifierDarkSeoul (DS) Hmei7 d3b_X StifLer

Encoding Windows-1252 Windows-1252 Windows-1252 ISO-8859-9IP address 203248195178 2038623868 2031243766 77921083Domain gyunggionnet21com httpwwwgarychengcom healthajkgovpk yapikimyasallaricomtrDate 20 Mar 2013 6 Feb 2014 4 Feb 2014 8 Jun 2013OS Windows Windows Windows WindowsSimilarity mdash 0690 0675 0665Cluster mdash 0 8 4

Retrieved case Tested casesCase name Notifier

Sony pictures Entertainment (SPE) Oaddah MTRiX EL_MuHaMMeDEncoding EUC-KR EUC-CN GB2312 GB2312 GB2312IP address 203131222102 2031241555 20829198 2081164534Domain httpwwwsonypicturesstockfootagecom httpwwwhzkcggcom daxdigitalromcom digitalairstripnetDate 24 Nov 2014 14 Jun 2012 16 Dec 2002 18 June 2009OS Windows Windows Windows WindowsSimilarity mdash 0615 0615 0600Cluster mdash 7 7 7+e metadata are arranged according to the defined case vector corresponding with the DS and SPE cases on the left side (shown in part in boldface type)

16 Security and Communication Networks

43 Follow-Up Investigation A case study is a researchmethod involving an in-depth and detailed investigationof a subject of study as well as its related contextualmethodology Hence we conducted follow-up in-vestigations of the most similar top three hackers asmentioned above in Table 4 According to the resultsspecifically over 93 percent of the hackerrsquos attacks weresimilar to the DS case that occurred in 2013 and 2014+eir major targets were com domain sites and theytargeted primarily Germany Italy New Zealand RussiaTurkey Taiwan and South Korea (see Table 5) Twohackers (ie Hmei7 and d3b_X) primarily attackedgovernment agencies Interestingly 20 percent of theattacks by the hackers named d3b_X targeted SouthKorea In the SPE incident the similar hackerrsquos attacksoccurred throughout the period from 2002 to 2014 +ehackers named MTRiX and EL_MuHaMMeD in-tensively executed such attacks in 2003 and 2009 +eirmajor targets were com (or co) and org domain sitesand they targeted primarily Brazil Canada DenmarkFrance Greece Hong Kong and Italy (see Table 5) Twohackers (ie MTRiX and EL_MuHaMMeD) primarilyattacked commercial agencies and additionally attackedthe public and network agencies As shown in Figure 14 to

describe the follow-up investigation more discernibly andto focus on the attack flow we used an alluvial diagramwhich is a type of Sankey diagram developed to representchanges in a network structure over time [53] It shows theinvestigation of the top three hackers with website de-facement cases most similar to the DS case and SPE case+e case vectors were based on the attack year ccTLD andgTLD+e thickness of the attack flow in this figure meansthe degree of attack +is network visualization methodcould support an investigator to understand the flow andcore of the crime clearly by listing the multidimensionalevidence that is complicatedly entangled or hidden suchthat it does not look presentable

5 Limitations and Discussion

+e CBR algorithm has the disadvantage that the perfor-mance evaluation may be degraded if the property de-scribing the case is inappropriate +erefore in order toobtain more accurate results cross-data analysis with othervarious data sources should be considered For examplecybercrime statistics data from law enforcement agenciesthreat intelligence data from malware analysis groups andvulnerability databases could be useful resources to

100806040200Similarity score

600

400

200

0

Freq

uenc

y

Mean = 02930 Var = 00866

(a)

100806040200

Freq

uenc

y e highest similarity score 0615on Sony Pictures Entertainment case

Similarity score100806040200

Similarity score

40000

30000

20000

10000

0

Freq

uenc

y

40000

30000

20000

10000

0

Mean = 0114 Var = 01500

e highest similarity score 069on DarkSeoul case

Mean = 0063Var = 00370

A B

(b)

Figure 12 (a) Probability distribution of the similarity score for any pair of randomly selected cases (b) distribution of the similarity valuebetween the collected website defacement cases with the DS case (A) and the distribution of the similarity value between the collectedwebsite defacement cases with the SPE case (B) +e similarity was calculated between each studied case and all other cases in our system

0

2

4

6

000 025 050 075

Freq

uenc

y

Mean value of the similarity score000 025 050 075

Figure 13 Distribution of the similarity score for randomly selected 100 hackers

Security and Communication Networks 17

improve the accuracy and usability of our proposedmethodology However at the time of writing the presentpaper we did not have access to open and public dataconcerning cybercrime

For that reason we tried to demonstrate the practica-bility of the proposed methodology as a proof of concept+erefore we focused on the dataset of the zone-horg thatincludes a large number of website defacement cases Al-though the zone-horg provides an extensive dataset on thepast incident events not all incidents can be included in ourstudy +erefore if a hacker penetrated some target orga-nizations by APT attacks and performed stealthy activitiessuch hacking activities would not be reported in the datasetof the zone-horg and the proposed methodology would notbe able to detect similar cases with reasonable confidence

6 Conclusion and Future Work

In this study the similarity of website defacement caseswas assessed through the similarity measure and theclustering processing using the CBR as a methodology+e collected raw data of the defaced web sitesrsquo resourceswas sanitized via data parsing and data cleaning processAlso based on the large size of real dataset data-drivenanalysis for the hacker profiling is achieved To this endthe case vector was designed and the significant featureswere chosen for applying to the case-based reasoning Fora successful cybercrime investigation hacker profiling viaclustering analysis is the most basic and importantprocess in order to find out the relevant incident casesand significant data on some prime incidents data-driven

Table 5 Follow-up investigation on the top three hackers with website defacement cases most similar to the DS case and SPE case +e casevector value means the hackerrsquos attack rate

DomainDS case SPE case

Hmei7 d3b_X StifLer Oaddah MTRiX EL_MuHaMMeDCom 7832 8581 10000 10000 8627 8298Edu 162 096 mdash mdash 176 191Net 340 320 mdash mdash 546 574Gov 1216 651 mdash mdash 106 mdashYear Hmei7 d3b_X StifLer Oaddah MTRiX EL_MuHaMMeD2002 mdash mdash mdash mdash 1074 mdash2003 mdash mdash mdash mdash 8908 mdash2006 mdash mdash mdash mdash mdash mdash2007 009 mdash mdash mdash 018 mdash2008 mdash mdash mdash mdash mdash mdash2009 315 mdash mdash mdash mdash 99572010 009 mdash mdash mdash mdash mdash2011 034 mdash mdash mdash mdash mdash2012 340 mdash mdash 10000 mdash mdash2013 3486 3917 10000 mdash mdash mdash2014 5808 5977 mdash mdash mdash 0432015 mdash 107 mdash mdash mdash mdash

d3b~x

Hmei7

StifLer

2009

2012

2013

2014

AustraliaBrazilFrance

Germany

IndonesiaItaly

KoreaNetherlandsNewZealand

PolandRussia

Thailand

Turkey

Unknown

com

gov

net

org

Unknown

No

Yes

Hacker Year ccTLD gTLD Attack

(a)

EL_M

uHaM

MeD

MTRiX

oaddah

2002

2003

2009

2012

BrazilCanada

DenmarkFranceGreece

HongKongItaly

Unknown

com

net

org

Unknown

No

Yes

Hacker Year ccTLD gTLD Attack

(b)

Figure 14 Follow-up investigation on the top three hackers with website defacement cases that are most similar to the DS case (a) and SPEcase (b)

18 Security and Communication Networks

and evidence-driven decision making should be thecritical process Also reducing the amount of data andtime to be analysed are important factors to deliver thehigh value of intelligence data

Although the obtained results appear to be sound andmeaningful it is difficult to evaluate the accuracy of theresults unless the attacker is captured Naturally theground-truth data with specific information about theinvolved hacking groups for verification are rare (ie noadversary claimed that the two attacks were the result oftheir actions) However it is noteworthy that our meth-odology provides a meaningful insight into the confidentialand undercover network of cybercrime as well especiallywhen there is a lack of information Also the proposedmethodology contributes to facilitate the analysis and re-ducing the time required for searching for possible suspectsof cybercrime We believe that the proposed system ismeaningful for further exploration and correlation ofvarious website defacement cases

As mentioned in Discussion and Limitations a cross-data analysis with other various data sources should bereviewed Said differently the use of additional online oroffline information acquired by human intelligence(HUMINT) or different types of signal intelligence(SIGINT) and sources may also help to reason compo-sition requirements of crime and reduce the category ofinvestigation Furthermore the proposed methodologycan be expanded into incident information for compat-ibility and information exchangeability with othercyberthreat intelligence system as the Structured +reatInformation eXpression (STIX) and Trusted AutomatedeXchange of Indicator Information (TAXII) which arekey strategic elements of the information-sharingsystem [54]

+ere are features such as the particular messages (iethanks-to notifier nationality religion and anniversary)or image and mp3 file in the web resources which aregathered from the zone-horg site Although these featuresare limited to only a small number of hackers of the webresources in future research we will try to study a close-knit network among them such as the hub hacking groupkey player and followers Furthermore we also plan tomore definitely classify and systemize the hackersrsquo intentsusing text mining and mood detection techniques +efindings of this prospective study will contribute mean-ingful insights to trace hackersrsquo behavioural patterns and toestimate their primary purpose and intent

Data Availability

+e web-hacking dataset applied to our paper can bedownloaded from the linked site below httpocslabhksecuritynetDatasetsweb-hacking-profiling

Conflicts of Interest

+e authors declare that there are no conflicts of interestregarding the publication of this paper

Acknowledgments

+is work was supported under the framework of internationalcooperation program managed by the National ResearchFoundation of Korea (No 2017K1A3A1A17092614)

References

[1] S S Response ldquoSwift attackersrsquo malware linked to more fi-nancial attacksrdquo 2016 httpswwwsymanteccomconnectblogsswift-attackers-malware-linked-more-financial-attacks

[2] S S Response ldquoWannacry ransomware attacks show strong linksto lazarus grouprdquo 2017 httpswwwsymanteccomconnectblogswannacry-ransomware-attacks-show-strong-links-lazarus-group

[3] K lab ldquoLazarus under the hoodrdquo 2018 httpsmediakasperskycontenthubcomwp-contentuploadssites4320180307180244Lazarus_Under_+e_Hood_PDF_finalpdf

[4] Operation Blockbuster ldquoDestructive malware reportrdquo 2016httpswwwoperationblockbustercomwp-contentuploads201602Operation-Blockbuster-Destructive-Malware-Reportpdf

[5] D Martin and SANS Institute InfoSec Reading Room ldquoTracingthe lineage of DarkSeoulrdquo 2016 httpswwwsansorgreading-roomwhitepaperscriticaltracing-lineage-darkseoul-36787

[6] D S C T U T Intelligence ldquoWiper malware threatanalysisrdquo 2013 httpswwwsecureworkscomresearchwiper-malware-analysis-attacking-korean-financial-sector

[7] R Sherstobitoff M L Itai Liba and O O T C James WalterldquoDissecting operation troy cyberespionage in South Koreardquo2013 httpswwwmcafeecomenterpriseen-usassetswhite-paperswp-dissecting-operation-troypdf

[8] N Horton andA DeSimone ldquoSonyrsquos nightmare before christmasthe 2014 North Korean cyber attack on Sony and lessons for USgovernment actions in cyberspacerdquo 2018 httpswwwjhuapleduContentdocumentsSonyNightmareBeforeChristmaspdf

[9] I K Lee and S R Ramsey 9e Korean Language StateUniversity of New York Albany NY USA 2000

[10] V Benjamin and H Chen ldquoSecuring cyberspace identifyingkey actors in hacker communitiesrdquo in Proceedings of the 2012IEEE International Conference on Intelligence and SecurityInformatics pp 24ndash29 Arlington VA USA June 2012

[11] Y Lu X Luo M Polgar et al ldquoSocial network analysis of acriminal hacker communityrdquo Journal of Computer In-formation Systems vol 51 no 2 pp 31ndash41 2010

[12] J-W Jang H Kang J Woo A Mohaisen and H K KimldquoAndro-autopsy anti-malware system based on similaritymatching of malware and malware creator-centric in-formationrdquo Digital Investigation vol 14 pp 17ndash35 2015

[13] J W Jang and H K Kim ldquoFunction-orientedmobile malwareanalysis as first aidrdquo Mobile Information Systems vol 2016Article ID 6707524 11 pages 2016

[14] Y Ki E Kim and H K Kim ldquoA novel approach to detectmalware based on api call sequence analysisrdquo InternationalJournal of Distributed Sensor Networks vol 11 no 6 ArticleID 659101 2015

[15] M L Han H C Han A R Kang et al ldquoWeb-hacking datasetfor the cyber criminal profilingrdquo 2016 httpocslabhksecuritynetDatasetsweb-hacking-profiling

[16] M L Han H C Han A R Kang B I Kwak A Mohaisenand H K Kim ldquoWAHP web-hacking profiling using case-based reasoningrdquo in Proceedings of the 2016 IEEE Conference

Security and Communication Networks 19

on Communications and Network Security (CNS) pp 344-345Philadelphia PA USA October 2016

[17] A Aamodt and E Plaza ldquoCase-based reasoning foundationalissues methodological variations and system approachesrdquo AICommunications vol 7 no 1 pp 39ndash59 1994

[18] D M L Martins and F B D Lima Neto ldquoHybrid intelligentdecision support using a semiotic case-based reasoning andself-organizing mapsrdquo IEEE Transactions on Systems Manand Cybernetics Systems no 99 pp 1ndash8 2017

[19] H K Kim K H Im and S C Park ldquoDSS for computersecurity incident response applying CBR and collaborativeresponserdquo Expert Systems with Applications vol 37 no 1pp 852ndash870 2010

[20] J-B Lamy B Sekar G Guezennec J Bouaud andB Seroussi ldquoExplainable artificial intelligence for breastcancer a visual case-based reasoning approachrdquo ArtificialIntelligence in Medicine vol 94 pp 42ndash53 2019

[21] M Relich and P Pawlewski ldquoA case-based reasoning ap-proach to cost estimation of new product developmentrdquoNeurocomputing vol 272 pp 40ndash45 2018

[22] E R Reyes S Negny G C Robles et al ldquoImprovement ofonline adaptation knowledge acquisition and reuse in case-based reasoning application to process engineering designrdquoEngineering Applications of Artificial Intelligence vol 41pp 1ndash16 2015

[23] H K Kim S-K Kim and S-H Kim ldquoDecision supportsystem for zero-day attack responserdquo Applied Mathematicsand Information Sciences vol 6 no 1 pp 221Sndash241S 2012

[24] G Horsman C Laing and P Vickers ldquoA case-based rea-soning method for locating evidence during digital forensicdevice triagerdquo Decision Support Systems vol 61 pp 69ndash782014

[25] G Horsman C Laing and P Vickers ldquoA case based reasoningsystem for automated forensic examinationsrdquo in Proceedings ofthe PGNET 2011 the 12th Annual Postgraduate Symposium onthe Convergence of Telecommunications Networking andBroadcasting pp 26ndash31 Liverpool UK June 2011

[26] Z Yin Y Gao and B Chen ldquoOn development of supple-mentary criminal analysis system based on cbr and ontologyrdquoin Proceedings of the 2010 International Conference onComputer Application and System Modeling (ICCASM 2010)vol 14 Taiyuan China October 2010

[27] A J Pinizzotto and N J Finkel ldquoCriminal personality pro-filing an outcome and process studyrdquo Law and HumanBehavior vol 14 no 3 pp 215ndash233 1990

[28] P Chen and J Kurland ldquoTime place and modus operandi asimple apriori algorithm experiment for crime pattern de-tectionrdquo in Proceedings of the 2018 9th International Con-ference on Information Intelligence Systems and Applications(IISA) pp 1ndash3 Zakynthos Greece July 2018

[29] C J R Collie and K Shalev Greene ldquoExamining modusoperandi in stranger child abduction a comparison ofattempted and completed casesrdquo Journal of InvestigativePsychology and Offender Profiling vol 16 no 2 pp 91ndash1092019

[30] V Benjamin B Zhang J F Nunamaker Jr and H ChenldquoExamining hacker participation length in cybercriminalinternet-relay-chat communitiesrdquo Journal of ManagementInformation Systems vol 33 no 2 pp 482ndash510 2016

[31] V Benjamin and H Chen ldquoTime-to-event modeling forpredicting hacker IRC community participant trajectoryrdquo inProceedings of the 2014 IEEE Joint Intelligence and SecurityInformatics Conference pp 25ndash32 +e Hague +e Nether-lands September 2014

[32] K Veena and K Meena ldquoIdentification of cyber criminal byanalysing the users profilerdquo International Journal of NetworkSecurity vol 20 no 4 pp 738ndash745 2018

[33] F Iqbal B C M Fung M Debbabi R Batool andA Marrington ldquoWordnet-based criminal networks miningfor cybercrime investigationrdquo IEEE Access vol 7pp 22740ndash22755 2019

[34] N Qazi and B L W Wong ldquoAn interactive human centereddata science approach towards crime pattern analysisrdquo In-formation Processing ampManagement vol 56 no 6 p 1020662019

[35] N Jain P Sharma R Anchan et al ldquoComputerized forensicapproach using data mining techniquesrdquo in Proceedings of theACM Symposium on Women in Research 2016 pp 55ndash60ACM New York NY USA 2016

[36] P M Cozens G Saville and D Hillier ldquoCrime preventionthrough environmental design (cpted) a review and modernbibliographyrdquo Property Management vol 23 no 5pp 328ndash356 2005

[37] H Hassani X Huang E S Silva andM Ghodsi ldquoA review ofdata mining applications in crimerdquo Statistical Analysis andData Mining 9e ASA Data Science Journal vol 9 no 3pp 139ndash154 2016

[38] A Sharma and S Sharma ldquoAn intelligent analysis of webcrime data using data miningrdquo International Journal of En-gineering and Innovative Technology (IJEIT) vol 2 no 32012

[39] S-T Li S-C Kuo and F-C Tsai ldquoAn intelligent decision-support model using FSOM and rule extraction for crimepreventionrdquo Expert Systems with Applications vol 37 no 10pp 7108ndash7119 2010

[40] Y-H Tseng Z-P Ho K-S Yang and C-C Chen ldquoMiningterm networks from text collections for crime investigationrdquoExpert Systems with Applications vol 39 no 11 pp 10082ndash10090 2012

[41] A Malathi and S S Baboo ldquoAn enhanced algorithm topredict a future crime using data miningrdquo InternationalJournal of Computer Applications vol 21 no 1 2011

[42] S Kapetanakis A Filippoupolitis G Loukas et al ldquoProfilingcyber attackers using case-based reasoningrdquo in Proceedings ofthe 19th UK Workshop on Case-Based Reasoning (UKCBR2014) Cambridge UK December 2014

[43] R Al-Zaidy B C Fung A M Youssef et al ldquoMining criminalnetworks from unstructured text documentsrdquo Digital In-vestigation vol 8 no 3-4 pp 147ndash160 2012

[44] M Zulfadhilah Y Prayudi and I Riadi ldquoCyber profilingusing log analysis and k-means clusteringrdquo InternationalJournal of Advanced Computer Science and Applicationsvol 7 no 7 pp 430ndash435 2016

[45] S V Nath ldquoCrime pattern detection using data miningrdquo inProceedings of the 2006 IEEEWICACM International Con-ference on Web Intelligence and Intelligent Agent TechnologyWorkshops pp 41ndash44 Hong Kong China December 2006

[46] ITPnet ldquoSyria Egypt crises spur escalation of me cyber at-tacksrdquo 2013 httpwwwitpnet594742-syria-egypt-crises-spur-escalation-of-me-cyber-attack

[47] A McEnery and R Xiao ldquoCharacter encoding in corpusconstructionrdquo in Developing Linguistic Corpora A Guide toGood Practice Oxbow Books Ltd Oxford UK 2005

[48] B Bos T Ccedilelik I Hickson et al ldquoCascading style sheets level2 revision 1 (CSS 21) specificationrdquo W3C Working Draft2005 httpwwww3orgTRCSS21

20 Security and Communication Networks

[49] W Stuckey ldquoMassive sony breach sheds light on murkyhacker universerdquo 2018 httpamericaaljazeeracomarticles20141224sony-hacker-universehtml

[50] S Gallagher ldquoSony pictures malware tied to SeoulldquoShamoonrdquo cyber-attacksrdquo 2018 httpsarstechnicacominformation-technology201412sony-pictures-malware-tied-to-seoul-shamoon-cyber-attacks

[51] J Pagliery ldquoSony hack signs point to North Koreardquo 2018httpsmoneycnncom20141205technologysecuritysony-hack-north-korea-employeeindexhtml

[52] K Ketler ldquoCase-based reasoning an introductionrdquo ExpertSystems with Applications vol 6 no 1 pp 3ndash8 1993

[53] M Rosvall and C T Bergstrom ldquoMapping change in largenetworksrdquo PLoS One vol 5 no 1 Article ID e8694 2010

[54] OASIS ldquoSTIXTAXII standardsrdquo 2017-2018 httpsoasis-opengithubiocti-documentation

Security and Communication Networks 21

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 2: CBR-Based Decision Support Methodology for Cybercrime

Korearsquos state-sponsored hacking group the report alsofound that the malware of the WannaCry ransomware casewas related to the one used by the Lazarus group [3] In theOperation Blockbuster report released by Novetta in 2016the Lazarus group was reported to hypothetically come intwo basic classesmdashthe features known as the wipers and theDDoS malware [4] +e noticeable features of these attacksunderpin our interest in the Lazarus grouprsquos attack related tothe DS case in 2013 and the SPE case in 2014

On March 20 2013 in the DS case the DSrsquos attackdestroyed approximately 48700 computerized and net-worked equipment items such as PCs servers and networkdevices of major banks and TV broadcasters in South KoreaSouth Korea suffered a coordinated strike by a simple butvery effective and destructive malware called Wiper A Band C [5] In certain Windows OS environments the wiperscripts attempted to remove any directories after attemptingto overwrite each file with a specific string pattern (ieldquoHASTATIrdquo ldquoPRINCIPESrdquo or ldquoPRNCPESrdquo) [6 7] Inanother incident initiated by the Lazarus group the SPE washacked by the self-named Guardians of Peace (GOP)hacker group Several malware analysis groups reported thatthe GOP attack was also related to the North Korean cyberarmy [8] +e malware used in this attack contained stringswritten using the Romanization of Korean words (ieKorean words were spelled using Latin letters following theEnglish pronunciation) Of note while the Korean languageas spoken in North Korea and South Korea is linguisticallyidentical there are several important differences in terms ofvowels and consonants phonetic notation and wordspacing [9] In the aforementioned case the Romanizedwords captured in the malware were having various con-temporary North Korean words

From 2009 to 2017 along with the attacks mentionedabove the Lazarus group launched many other attacks (seeFigure 1 for further details)

+ere have been numerous attempts in industry andacademia to do hacker profiling and to handle attack in-cidents +ese approaches can be categorized into the fol-lowing three types the human-centric analysis malware-centric analysis and case-centric analysis +e human-centric analysis approach focuses on hacker network anal-ysis Known hacker activities (eg message postings anddiscussion) on the hacker communities provide a clue toidentify key actors by their reputation In addition it canclassify the tendency of a hacker based on social networkingmethods [10 11] Unlike the human-centric analysis themalware-centric approach primarily assumes that the samemalware and its variants could be developed by the same orclosely similar hacker groups Among others features suchas API call sequence and control flow can be used to estimatethe similarity between the newly detected malware and theknown malware [12ndash14] In fact many previous studies onhacker profiling have primarily focused on using in-formation derived from the analysis of the malware itselfWhile malware analysis could provide information about amalwarersquos functionality and its similarity with the previouslyknown malware family tracing and analysing hacker in-formation based on the malware centric could have the

limitation where the core information can be circumvented+e last approach is the case-centric analysis and ourmethodology falls into this category Overall only severalproposals can be applied to the traditional investigationmethod such as criminal profiling methods to the cyberincident investigation however many systematic ap-proaches are currently under development From theviewpoint of the cyber intelligence analysis the case-centricanalysis has the advantage of making it possible to un-derstand the purpose of attack campaigns it is important tobuild profiles of attackers as with other methods of analysisWhen performed successfully such characterizations canfacilitate estimating and predicting the attackerrsquos next targetin advance

Based on this insight the present study proposes theCBR-based decision support methodology for cybercrimeinvestigation In terms of data website defacement attackcases occurring between 1998 and 2015 were retrieved fromthe public archival site zone-horg (an archive of defacedwebsites httpwwwzone-horg) After crawling web re-sources of the Hypertext Markup Language (HTML) typedata preprocessing for data parsing and data cleaning wereperformed to amend incomplete improperly formatted orduplicate data records +e case vector was designed tointuitively express defaced website cases collected from thepublic archival site +e reasoning engine will be able to startthe major work only after completing the data preprocessingand the case vector design +e similarity measurementbased on CBR was performed in them +e clustering al-gorithm was performed to group-abstracted crime cases intoclasses of similar cases Based on the results concerning theDS and SPE cases we evaluated the performance of theframework for cybercrime investigation by measuring thesimilarity and clustering algorithm +e results demon-strated that the proposed methodology can be used as aDecision Support System (DSS) to obtain meaningful in-formation about the most similar past cases and relatedhacker groups

+e main contributions of the present study are sum-marized as follows

(i) We present a CBR-based decision support meth-odology for cybercrime investigation With theproposed cybercrime investigation scheme securityanalysts can find past attack cases that are mostsimilar to a given attack and thus obtain insights touncover the networks of cybercrime related towebsite defacement To deliver high-value in-telligence we adopt the data-driven analytic system

(ii) We demonstrate the clustering processing and vi-sualization +e clustering processing enables aninvestigator to efficiently explore large data andinterpret the results Furthermore the visualizationhelps an investigator to intuitively recognize crimepatterns

(iii) We propose that it is possible to measure the sim-ilarity score and to perform the clustering algorithmby transforming unstructured data (ie web de-facement cases) into calculable structured data

2 Security and Communication Networks

(iv) We report case studies based on the real datasetgathered from the zone-horg site to demonstrate thevarious aspects of our proposed algorithm Finally tofoster further research our dataset (the dataset forcybercrime investigation focused on the data-drivenwebsite defacement analysis httpocslabhksecuritynetDatasetsweb-hacking-profiling) is made pub-licly available [15 16]

+e rest of the paper is organized as follows Section 2provides a summary of the literature related to our work+edetailed methodology is described in Section 3 Section 4reports the experimental results and analysis based on thecase study +e limitations of our work and a discussion onthe proposed approach are presented in Section 5 FinallySection 6 concludes the paper and suggests directions offurther research

2 Related Work

In this section we primarily highlight the previous studiesclosely related to the CBR and review two streams of literatureon traditional criminal profiling and cybercrime profilingWealso elaborate the data mining-based cybercrime profilingpertaining to the following (1) the CBR studies that helpbetter understand our research context (2) traditionalcriminal profiling and cybercrime profiling review that allowus to obtain an elusive criminal or a concealed clue (3) datamining-based cybercrime profiling literature that can supportand theoretically reinforce our methodology

21 Case-Based Reasoning CBR is a method that uses pastexperiences or cases to solve new problems Even when thenew problems are not exactly identical to the previous casesCBR can suggest a partial solution to the new problems [17]CBR can be categorized as a data-mining technique as it canclassify the given samples and predict the result for a newcase As case studies are intuitive and easily understood byhumans CBR has long been used in many fields includingcustomer technical support medical case search and legal

case search +e general model of the four-step CBR process[18] is shown in Figure 2

+e four phases are as follows

(i) Retrieve given a new website defacement caserelevant cases are retrieved from the knowledge baseto solve the case at hand

(ii) Reuse solutions from previous website defacementcases are mapped for reuse

(iii) Revise on mapping and testing previous solutionsto the target case the solutions are revised toconsider the changes in the cases

(iv) Retain after the solution has been successfullyadapted to the problem a meaningful experience isstored as a new case in the knowledge base

CBR starts with a given set of cases for training formsgeneralizations of the given examples and subsequentlyidentifies the commonalities between the retrieved case andthe target case When applied to the website defacement casecomposed of descriptive and nominal data it can effectivelydetermine the commonality from the crawled hacking casesand quickly search the nearest related case FurthermoreCBR can be used to search themost similar cases and retrievepast solutions from the latest response cases CBR facilitatessecurity administrators to make better decisions For ex-ample Kim et al proposed the DSS for an incident responsebased on CBR [19]

CBR has been extensively used in several areas such asmanagement for product development medicine and inengineering applications [20ndash22] In addition several CBRapproaches were available for cyber incidents profiling Forinstance Kim et al proposed an intelligent system that canmeasure the similarity between the past and new attacks Intheir work the author(s) demonstrated such capability inuncovering zero-day attacks using the string similarityanalysis of the captured packet-level data [23] Horsmanet al proposed the CBR-FTframework which is a method forcollecting and reusing past digital forensic investigationinformation to highlight likely evidential areas on a suspect

The Sonyattacks

TrojanDozer(July 2009)

TrojanKoredos(March 2011)

BackdoorPrioxer(June 2010)

BackdoorPrioxerB(July 2012)

DownloaderCastov(October 2012) TrojanCastov

(June 2013)

InfostealerCastovDownloaderCastov

(May 2013)

TrojanJokra(March 2013)

Attacks on organizations in the

US and South Korea

Attacks against financialinstitutions and their

customers in South Korea

Attacks against banks andlocal broadcasting

organizations in South Korea

2009 20162010 2011 2012 2013 2014 2015 2017

BackdoorDestover(November 2014)

BackdoorDuuzerW32Brambul

BackdoorJoanap(October 2015)

TrojanBanSwift(February 2016)

Manufacturing industryin South Korea targeted

South Koreanorganizationstargeted again

DDoS attacks against SouthKorea

SWIFTattacks

Banks targetedagain

WannaCryrocks the world

DownloaderRatabanka

(February 2017)

WannaCry(May 2017)

Figure 1 Timeline of the Lazarus group activities from 2009 to 2017

Security and Communication Networks 3

operating system It enables an investigator to help quicklyand precisely decide where to search for evidence [24]

22 Traditional Criminal Profiling and Cybercrime ProfilingProfiling is used in various sectors of the society to in-vestigate a criminalrsquos mentality Criminal profiling is aprofiling technique for criminal investigation based on thepsychological and behavioural patterns of a criminal[25 26] +e criminal aspects and crime factors can beidentified through the evidences and insights of the psy-chological and behavioural bias [27] In the field of crimi-nology the widely used profiling technique is called theModus Operandi (MO) It is used to describe a suspectrsquosbehaviour and evidence elements in crime +at is it meanshow a suspect commits their crimes +e Modus Operandichanges based on the offenderrsquos criminal conduct and in-teraction with the surrounding such as time date and lo-cation of crime Moreover it evolves based on how theoffender reaches hisher victim [28 29]

Based on only the traditional criminal profiling tech-niques and empirical knowledge it is difficult for a cyber-crime investigator to reduce the error of the investigativeprocess and to untangle the complexity of a cybercrimeHowever if the investigator is provided with sufficient in-formation and detailed analysis data to understand theunclear motivation and the elusive pattern related to thecybercrime they can infer the reason(s) of the crime at stakeand produce both general and specific outlines of thecriminal [26] +e cybercrime network and characteristicscan be important indicators to differentiate between keyfigures in the cybercrime organizations and those of passinginterest In addition their activity periods and messagecontent patterns of the participants in an illegal communitycan support the investigator to carefully identify andscrutinize the key figure in the cybercrime network [30 31]By automating cybercrime profiling and data-miningmethods of analysis through a cross-analysis of variousbehavioural patterns we can anticipate potential criminalactivities and identify new profiles that pose serious threatsto the community Furthermore data-mining methods suchas entity extraction clusteringclassification technique andsocial network analysis make it possible to efficiently explorelarge data Network visualization enables an investigator tointuitively recognize the crime pattern [32ndash34]

In general the accuracy of CBR depends on the quality ofthe collected data and the overall accuracy is difficult toevaluate [35] Although the effectiveness of data-driveninvestigation can decrease owing to the dynamic and fast-evolving crime patterns understanding the hidden corre-lations and latent behaviour in such data using large dataanalytic techniques is another promising direction in re-search Accordingly many law enforcement agencies havebeen adopting future crime prediction systems based on thestatistics about weather cleanliness location demographicdistribution education level and wealth-level informationBased on the crime prevention through environmentaldesign (CPTED) theory [36] many pieces of data correlatedwith the crime are collected and analysed to estimate thecrime probability However while many data-driven ap-proaches to support traditional criminal profiling areavailable only several research efforts have focused oncybercrime profiling

23 Data-Driven Cybercrime Profiling In addition to thetraditional criminal profiling for offline crime investigationsvarious profiling techniques have been developed in thesetechniques it is assumed that cybercriminals also showsimilar behavioural and psychological characteristics Owingto the recent advances in data-mining and machine-learningalgorithms many studies regarding criminal pattern de-tection classification and clustering have emerged +emethods used in these studies include among others entityextraction clustering association rule mining deviationdetection and classification of social network analysis Acombination of the traditional method and a newer methodenables the pattern identification from both structured andunstructured data For instance entity extraction is used tounderstand concealed patterns in the data such as textsimages and audio data Furthermore clustering is used togroup objects into classes with similar characteristics[37 38] In addition unsupervised methods such as the self-organizing map (SOM) are used to support the results of thetraditional criminal profiling [39] In cases where thecriminal and the related cases are known supervisedlearning is applied [40] However although many advanceshave occurred in big data analytics and machine learningthese approaches are limited in supporting real-time pro-cessing as they require high computing power to handle alarge volume of training data In fact the large volume ofcrime data is a considerable challenge for the investigator interms of gaining the appropriate understanding of a com-plicated relationship or in terms of a timely responseHowever despite the limitations of this approach datamining yields valid useful and appropriate results By datapreprocessing such as data cleaning data integration anddata transformation it intends to reduce noisy data as wellas incomplete and inconsistent data It helps to uncover andconceptualize the concealed or latent crime patterns Byimproving the efficiency of crime data understanding andreducing errors in the results afforded by the data-miningmethod the investigator can perform reasoning timelyjudgment and quick problem solving [41]

Target cases

Proposedsolution

Confirmedsolution

Retrieve Reuse

Revise

Retain

Employ

Defacedwebsites

Similaritymeasurement

Case base

Figure 2 CBR process used in cybercrime investigationemploying knowledge base that is reused over similar new casesand retained for later use

4 Security and Communication Networks

CBR is also used to provide the reasoning power tosearch similar previous cases [25 42] However biased orimperfect collected data deteriorate the quality of the de-cision support provided by CBR +erefore in many casessetting the weight of the selected features is based on em-pirical knowledge which can be subsequently used to enablethe detection and analysis of crime patterns from thetemporal crime activity data Using clustering and classifi-cation techniques as well as speculativemodels for searchingsimilar crime cases in the past investigators can easily ex-tract useful information from the unstructured textualdataset [43] Hence investigators must collect and contin-uously update the comprehensive crime data

Clustering is the task of determining a similar group inthe data Clustering includes supervised learning typesZulfadhilah et al compared four types of clustering algo-rithms K-means hierarchical clustering SOM and Ex-pectation Maximization algorithm (EM clustering)mdashbasedon their performances +ey concluded that the K-meansalgorithm and the EM algorithm are better than the hier-archical clustering algorithm In general partitioning al-gorithms such as the K-means and EM algorithm are highlyrecommended for use in large-size data [44] In summarythe clustering algorithm can facilitate the investigator indetecting crimes patterns and accelerate crime solving +eweighting scheme for attributes can handle the limitations ofthe clustering techniques [45]

3 Methodology

In this section we present the detailed scheme of decisionsupport methodology for cybercrime investigation with thefocus on the website defacement cases A conceptualframework and its process are illustrated in Figure 3 +escheme is proceeded by the following three steps datapreprocessing case vector design and reasoning engineFirst we provide a brief outline of the dataset and describethe merits of the website defacement data Also we sum-marize the preprocessing for data parsing and cleaningregarding the collected data type Next we designed the casevector and chose the significant features to apply the rea-soning performance Finally the reasoning engine hasvarious functionalities and it is intended for the grouping(clustering) of cases based on their similarity

31 Preprocessing As part of the proposed analyticalframework we have developed a crawler to automaticallycollect 212093 website defacement cases from the zone-horg site Many website defacement cases are being dailyrecorded in the archive page of the zone-horg site Each caseregistered in the archive page provides information (ie IPaddress Domain Date OS Notifier and Web server) of thesame format through each mirror page First of all thecrawler collects all public information relevant to each case+ereafter on accessing the domain site it saves data in theraw format of the HTML source After crawling the webresources of raw data the data preprocessing is performed toamend incomplete improperly formatted or duplicate data

records More specifically there are various tag attributes inthe HTML source Encoding and Font data are extractedthrough the ltcharsetgt and ltfont-stylegt tag of the HTMLelements set between ltheadgt and ltheadgt tag in the HTMLsource Also image sound file and the linked site areextracted through the ltfont-familygt ltimggt and lthrefgt tagof them set between ltbodygt and ltbodygt tag in the HTMLsource +e web resources as original raw data were parsedand cleaned depending on the relevant case vector (seeFigure 4) After cleaning the data some significant data fieldswere selectively stored in the systemrsquos case database

+e selected data fields were related to the informationabout the website defacement date related IP address targetdomain target system OS and web server version theseaspects have proven to be useful for cyberattack in-vestigations [46] Specifically the encoding method and thefont whom the HTML source contains were necessary tospeculate on the attackerrsquos regional information For ex-ample if messages remaining in a defaced website arewritten in ISOIEC 8859 encoding we can subsequentlyinfer that the hackersrsquo language is German Spanish orSwedish Furthermore depending on whether all the mes-sages are written in the same encoding method the usedspecial characters such as β or ntilde or a can be used as a clue forguessing attackerrsquos origin In general encodings fromWindows-1250 to Windows-1258 are used in the centralEuropean languages as well as in Turkish Baltic languagesand Vietnamese By contrast GB encoding is used inChinese HKSCS encoding is used in Taiwanese and EUC-KR or ISO-2022-KR encoding is used in Korean [47] Inaddition to the font and encoding information the textimage audio and video found in the messages are alsonecessary parameters for the case identification

32 Case Vector Design We designed the case vector in twotypes concerning the similarity measure and clusteringprocessing +e case vector is summarized in Table 1 +efeatures of various aspects such as the font web serverthanks-to notifier (hackers or hacking groups) as well as thefeatures such as the encoding IP address domain attackdate and OS were extractable from the public archival sitezone-horg Generally more diverse features can be a sig-nificant factor for investigating relationships and associa-tions among hackers or hacking groups and the scale and thedensityintensity of the hacker community However such apremise has some shortcomings +e importance or theweight of all features may be different depending on thecriterion Also if all features are important machine-learning algorithms such as clustering or classification aredifficult to perform in reality because of the high compu-tational cost for analysing Despite having similar meaningssome of the features can be reperformed unnecessarily Tothis end the dimensionality reduction and the feature se-lection were performed in the present study paper After athorough review by security experts the significant featureswere selected for the case vector of website defacement cases+e detailed explanation of the dimensionality reductionand the feature selection is as follows

Security and Communication Networks 5

In theWindows operating system if a specific font is notdesignated as the tag inside the HTML code such as theltfont-familygt property the characters on a website pagemay appear as broken In particular some of the fontsamong the Chinese charactersrsquo cultural area depend on thecharacter encoding (eg font-family Gulim MingLiU andSTHeiti) [48] Similar to the encoding feature although thischaracteristic may be the key evidence to uncover a cor-relation between the victim and the attacker it is extremelyrare in each of the collected website defacement cases+erefore it is not suitable as a case vector for cybercrimeinvestigation Meanwhile in the case of a web server itprovides HTML CSS JavaScript etc when a client requestsa web page using the web server While the Apache and IISweb servers are primarily used in the Windows environ-ment the LiteSpeed web server is primarily used in the Linuxenvironment and the Enterprise web server is primarily usedin the UNIX environment +erefore the web server is

selectively dependent on the OS environment As with thefont feature described before since the web server featurecould not be found in the collected website defacement casesit was not suitable as a case vector for cybercrime in-vestigation Finally although the case vector concerningthanks-to and notifier can be used to analyse a hiddennetwork between the hackers and hacker groups the analysisof a network among hackers and hacking groups throughthem should be addressed in future research

As a result we defined the case vector by dividing intotwo types ie a version for the similarity measure and aversion for the clustering processing As the features of thecase vector the encoding IP address domain (ie servicename gTLD and ccTLD) attack date and OS were used inthe similarity measure However the encoding gTLDccTLD and OS were used in the clustering processing +eencoding is a case vector that provides decisive clues relatedto the attackerrsquos region information In the case of the IP

Case vector designPreprocessing

Clustering module

Cases-centric DB

Reasoning engine

It matches a new attack case with a former

attack case depending on

the defined case vector

It measuresthe similarity

score depending onthe weights and values

It calculates weights and

values

Similarity module

Data parsing

Data cleaning

Feature selection

Feature normalization

It performs the clustering processing through the EM

algorithm

It derives several clusterswhich exhibit similar patterns

Crawler

It gets the metadata and HTML source of

a website defacement case through

the mirror pageArchive page in the zone-horg site

Figure 3 Proposed analytical framework for the data-driven website defacement cases

Figure 4 Sorted dataset through the preprocessing

6 Security and Communication Networks

address and domain it gives clues related to the victimrsquoslocation and position Furthermore the attack date givesclues to the relation between the attacker and the victim+edetailed explanation of key features is provided in Table 1

+e normalization result of various feature elementsstored in the raw form of the HTML source is presented inFigure 5 In the case of encoding ISO series and MSWindows series are applied by normalizing depending onthe encoding used in each region or country In the case ofgTLD it was applied by normalizing depending on thegroups or organizations with similar characteristics In thecase of ccTLD it was applied by normalizing depending oneach continent Although the compression and normaliza-tion of features enable making the analysis such as clus-tering processing and similarity measure simple and clearon the contrary it may also bring about the loss of in-formation in the original data or make it more difficult toanalyse in detail

33 Reasoning Engine In the reasoning process the rea-soning engine first performs a similarity search based onCBR Discrete similarity scores are defined to calculate thedistance of nominal data (eg IP address and domain)Algorithm 1 shows how the similarity module operates bycomparing a retrieved website defacement case and all casesin the cases-centric DB on a case-by-case basis Sub-sequently the reasoning engine evaluates the similarity score

between the given new attack case vector and vectors ofother attack cases Next the reasoning engine performsclustering to group-abstracted crime cases into classes ofsimilar crime cases In crime investigation a cluster groupedas similar crime case subsets helps to infer crime patternsand speeds up the process of solving a crime due to a betterunderstanding of a complicated relationship or in terms of atimely response In the present study we implemented thereasoning engine consisting of two processing entities thesimilarity measure processing and the clustering algorithmprocessing (see below for further details)

331 Similarity Measure As the similarity measure based onthe CBR algorithm we proposed the similarity algorithmoperated by comparing a retrieved website defacement caseand all cases in the cases-centric DB To begin with if one ofthe retrieved cases (RC a new case) is given and there are ldquonrdquocases in the cases-centric DB (TCs all cases in the cases-centric DB) a comparison between RC and TCs are con-ducted as ldquonrdquo times We defined the extent of similaritybetween RC and TCs as a numeral value from ldquo0rdquo to ldquo1rdquowhere ldquo0rdquo means that RC and TC are unrelated and ldquo1rdquomeansthat RC and TC are identical Similarity score (0lt Slt 1)specifies the extent of similarity between RC and TC If thesimilarity score is much closer to ldquo1rdquo RC and TC are moreanalogous to each other In the event of multiple case vectorssimilarity can be expressed as a weighted sum of case vectors

Table 1 Case vector design highlighting two groups of features

Case vectorUsed in process

DescriptionS C

Encoding O O

It is used to represent the different types of languageinformation on the computer It determines the

usable characters and the methods to express them+e feature was normalized based on MS Windows

and the ISO character set

IP address O NA A unique number that allows devices on the networkto identify and communicate with each other

Domain

Service name O NA+e service name is individually made with a differentname depending on the service categories such as

gTLD or ccTLD

gTLD O O+e gTLD feature was normalized depending on theelement having the same meaning (eg go gob and

gobr feature were normalized into gov)

ccTLD O O

+e ccTLD is a unique code assigned to the domainname that represents the country specific region or

an international organization+e ccTLD normalized by the continent is used in theclustering process and the original ccTLD is used in

the similarity process

Date O NA +e attack date performed by the hacker or thehacking group

OS O OA part of a computer system that manages all

hardware and software (eg Windows Linux andUNIX)

S similarity measure C clustering processing

Security and Communication Networks 7

Similarity score 1113944

cv

i1distance RCcvTCcv( 1113857 times weightcv1113858 1113859

cv case vector(ie encoding IP address domain date andOS)

(1)

+ere are various approaches to set the weight of the casevector such as the heuristic method logistic regression anal-ysis and attribute weighting methods Furthermore theseweight values need to be periodically updated to be applied tothe study of recent attack trends However for the initialsetting it is difficult to set the exact numerical value for eachweight values in accordance with the case vector In our ex-periment we set the impact and the weight of the case vector ashighmedium and low according to their importance so that to

concretely categorize the attacker and the victim Above allsince encoding makes it possible to infer the static locatedinformation of the attacker we defined encoding as high-quality information IP address and domain were defined asmedium-quality information +ese case vectors enable theidentification and specification of the victim Finally the tar-geted date and OS were defined as low-quality information Tomeasure clustering and similarity all values of the case vectormixed as numbers and letters were normalized to have a valuefrom 0 to 1 Obviously since these values can be subjective inorder to prevent this subjective bias these values should beacquired and thoroughly reviewed by several experts +istechnique can be easily applied using expert knowledge ofinvestigation experts and is easy to understand from re-searchersrsquo viewpoint +e quantitative method for setting and

Arabic

Baltic

CentralEurope

Chinese

Cyrillic

Greek

Hebrew

Japanese

Korean

SouthernEurope

Taiwanese

Thailand

Turkish

Africa

Australia

CentralAsia

EastAsia

EasternEurope

NorthAmerica

NorthernEurope

SouthAmerica

SouthAsia

SoutheastAsia

SouthernEurope

WestAsia

WesternEurope

Linux-basedOS

MacOS

Unix-basedOS

Windows-basedOS

bull ISO-8859-6bull Windows-1256

bull ISO-2022-KRbull EUC-KR

bull GB2312bull GB18030 bull GBK

bull ISO-2022-JPbull EUC-JPbull ShiftJIS

bull ISO-8859-2bull Windows-1250

bull ISO-8859-13bull Windows-1257

bull ISO-8859-8bull Windows-1255

bull ISO-8859-7bull Windows-1253

bullbull

ISO-8859-5Windows-1251

bull Windows seriesbull Windows server series

bull Unixbull AIX bull Compaq Tru64 etc

bull MacOSbull MacOSX

bull Linux bull FreeBSD bull Avtech etc

bull combull cobull int

bull info

bull org bull or

bull coop

bull govbull gobull gob

bull edubull ac

bull net

bull mil

bull biz

bull fr ie be gl lube dk ad imnl uk je gg etc

bull br sr ar cl do ec fk gf py sr uy ve etc

bull sa ae kw bh az in ir jo kw lb om qa ye etc

bull no dk lv ltse ax fi glis no

bull us bz lc ai bmgd hn ky mx ni pa sv tt vi etc

bull gr mksm ad va ba es it ptrs hr si li bg etc

bull la bu vn kh th

bull in np bt pk lk id mn mo my np ph tl etc

bull kz uz tm tj kg af am tr

bull au pg nz ccck fj gu kinu sb vu wf etc

bull gn jm ke aobw cf ls mztz ug yt zw etc

bull ru by al lv ua pl sk hu ee md ro mk etc

bull kr cn jp twhk kp sg

Encoding gTLD ccTLD OS

com

edu

gov

org

biz

mil

net

coop

info

bull Windows-1253

bull ISO-8859-11bull Windows-874

bull Big5 bull EUC-TW bull Eten

bull ISO-8859-9bull Windows-1254bull IBM857

WestEurope

bull ISO-8859-1bull Windows-1252

Normalization

Figure 5 Normalization of each feature elements

8 Security and Communication Networks

Input TCs(Tested_DB)lowast +e Tested_DB indicates the cases-centric DB lowastRC (Retrieved_Case)⟵ Encodi ngRC IPRC DomainRC DateRC OSRClowast RC means one of the retrieved cases lowastW (Weight)⟵ Encodi ngW IPW DomainW DateW OSW

Output Similarity_score(1) TCEncodi ngTC IPTC DomainTC DateTC OSTC⟵TCs(2) While RC in TCs do(3) if Encodi ngRC Encodi ngTC then(4) Encoding_similarity_value⟵ 10(5) else(6) Encoding_similarity_value⟵ 00(7) end(8) IPRC Octet ARC Octet BRC Octet CRC Octet DRC IPTC Octet ATC Octet BTC Octet CTC Octet DTC(9) if (Octet ARC Octet ATC) (Octet BRC Octet BTC) (Octet CRC Octet CTC) (Octet DRC Octet DTC) then(10) IP_similarity_value⟵ 10(11) else if (Octet ARC Octet ATC) (Octet BRC Octet BTC) (Octet CRC Octet CTC) then(12) IP_similarity_value⟵ 075(13) else if (Octet ARC Octet ATC) (Octet BRC Octet BTC) then(14) IP_similarity_value⟵ 05(15) else if (Octet ARC Octet ATC) then(16) IP_similarity_value⟵ 025(17) else(18) IP_similarity_value⟵ 00(19) end(20) DomainRC ServiceNameRC gTLDRC ccTLDRC DomainTC ServiceNameTC gTLDTC ccTLDTC(21) if an identical domain then(22) Domain_similarity_value⟵ 10(23) else if (ServiceNameRC ServiceNameTC) (gTLDRC gTLDTC) (ccTLDRC ccTLDTC) then(24) Domain similarity_value⟵ 08(25) else if (gTLDRC gTLDTC) (ccTLDRC ccTLDTC) then(26) Domain_similarity_value⟵ 03(27) else if (ServiceNameRC ServiceNameTC) then(28) Domain_similarity_value⟵ 01(29) else if (ccTLDRC ccTLDTC) then(30) Domain_similarity_value⟵ 01(31) else if (gTLDRC gTLDTC) then(32) Domain_similarity_value⟵ 01(33) else(34) Domain_similarity_value⟵ 00(35) end(36) Date_variance⟵ |Da teRC―Da teTC|lowast It converts a date format year month and day (ie yyyy-mm-dd) into a day

calculated with numeric lowast(37) if 0leDate_variancele 365 then(38) Date_similarity_value⟵ 10(39) else if 365ltDate_variancele 1095 then(40) Date_similarity_value⟵ 075(41) else if 1095ltDate_variancele 1825 then(42) Date_similarity_value⟵ 05(43) else if 1825ltDate_variancele 2555 then(44) Date_similarity_value⟵ 025(45) else if 2555ltDate_variance then(46) Date_similarity_value⟵ 00(47) end(48) if OSRC OSTC then(49) OS_similarity_value⟵ 10(50) else(51) OS_similarity_value⟵ 00(52) end(53) Similarity_score⟵ (Encoding_similarity_valuetimes EncodingW) +

(IP_similarity_valuetimes IPW) + (Domain_similarity_valuetimes DomainW) +(Date_similarity_valuetimes DateW) + (OS_similarity_valuetimes OSW)

(54) return Similarity score between RC and TC(55) end while

ALGORITHM 1 Similarity measure module

Security and Communication Networks 9

updating the weight value is an issue worth addressing infurther research In the present study we set the weight valuesfor the case vector including the encoding IP address domainattack date and OS (see Table 2)

Some case vectorsrsquo distance cannot be directly estimatedas they have mixed numerical and nominal data (such as IPaddress range and domain name) For this reason to cal-culate the distance between the nominal data we defined thediscrete similarity measure +e similarity of IP addresseswas calculated by measuring the similarity among the sameoctet of two given IP addresses +e IP address space iscomposed of a number combination of four octets separatedby ldquordquo In the present study we compared if octets from the1st octet to the 4th octet of RC and TC were identicalSubsequently a similarity value was assigned to the IPaddress vector We suggested the discrete similarity valuebetween two IP addresses as visible in Table 2 +e proposedapproach is advantageous in that it enables the distancecalculation between the IP addresses efficiently

(i) IP address of RC zzz yyy xxx www

(ii) IP address of TC zzz yyy xxx www

Meanwhile the similarity between domains is calculatedaccording to their domain properties +e domain iscomposed of the gTLD ccTLD and service name+e gTLDrefers to a generic top-level domain in the domain rule Forinstance com and co are used for commercial companies ororganizations org and or are used for nonprofit organi-zations go and gov are used for government and stateagencies Besides ccTLD refers to a country code top-leveldomain in the domain rule and means a unique sign thatrepresents a specific region such as kr cn br and uk DNSmakes change in the IP address into a unique Domain Namewhich is easy to remember because it consists of a combi-nation of an alphabet letter and a number Among theDomain Name the service name is built corresponding withthe characteristics of the groups organizations or corpo-rations that the gTLD is intending and pursuing +e servicename has diverse and different names depending on thecategories of the gTLD such as educational institutionscommercial enterprises military organizations nonprofitorganizations and government and state agencies Unlikeother case vectors we set the rule for estimating the simi-larity of the domain as depicted in Table 2

Furthermore we defined the attack date similarity Similarto the offline criminal investigation case if the time of a crimeoccurrence is near we can analyse the cases as a similar crimewith a cross-analysis of the target area and the criminalsrsquopatterns +e similarity value depends on the period differencebetween a new case and existing cases As visible in Table 2 thesimilarity value is described according to the date gap of twocases that occurred on different dates In summary accordingto the similarity degree of a variation range of a section thesimilarity values of the attack IP address domain and attackdate were set to the similarity value between 0 and 1

332 Clustering Processing Merely sorting the data andvisually analysing them render it difficult for an investigator to

infer the correlations and similarity among the potentialfeatures of incidents Hence an advanced tool that wouldcapture the complex underlying structures and data prop-erties is required Accordingly in the present study weconducted the clustering process using the EM algorithmbased on the probability of the individual data attributes +isalgorithm does not restrict the number of clusters in theparameters but automatically generates a number of validclusters by cross-validation +ereafter the algorithm de-termines the probability that some data items existed in thecluster bymaximizing the correlation and dependence amongthe objectsWe applied practically the EM algorithm to 80948data items having the information of encoding gTLD ccTLDand OS from 212093 data for clustering +e characterencoding was normalized by a group of congenial cover codeunits (ISO-8859 MS Windows character set GB and EUCseries) We excluded the Unicode because it is too generalwhich accounts for themajority of the collected encoding datafor clustering In the case of the service name even if we canfind out similar combinations of alphabet letters or numbersit is not easy to find commonality or relevance between them+erefore it is not suitable for being used as the similaritymeasure of the reasoning engine Consequently character-istics and metadata concerning the 12 clusters were obtained(see Table 3) +ese clustering results are also visualized andstored in the database (see Figure 6)

+e donut charts include the different features fromoutside to inside (in order) with the corresponding share ofeach feature value separated by a different colour codewithin this same circle Each cluster consists of four circlesand the circle represents from the outside to the inside theencoding gTLD ccTLD and OS +e percentage in Table 3represents howmany cases one cluster contains among all ofwebsite defacement cases collected from the zone-horg site+e representative hacker represents a notable hacker orhacking group among the members of them in each clusterAs described in Figure 6 clusters of similar patterns werefound in the clusters +e most conspicuously similarclusters were 4 and 7 which had the feature of using Arabicand Chinese a feature of the attack against an industrialorganization whose headquarters are located in WesternEurope +e cases in Clusters 4 and 7 accounted for 4129percent among all of website defacement cases collectedfrom the zone-horg site+e results of the clustering processcontribute to the concretization of the similarity between thenew and existing cases A large number of new cases haveflowed in the database and then if the clustering process isperformed with the dataset a clustering result may take on adifferent pattern of course

4 Application

41 Experimental Results and Analysis Considering that theassumption that the attackers tend to use similar or uniqueattack methods is not always valid and it is difficult toevaluate the accuracy of the similarity mechanism As timeprogresses attackersrsquo hacking skills advance and in additionthe attack plan campaign purpose and target groups canchange depending on the situation +erefore in the present

10 Security and Communication Networks

Table 2 Value and the weight for the similarity score by the case vector All of the values of the similarity score are normalized to 0 or 1

Case vector Weight Impact +e similarity measure between a new case andexisting cases Value

Encoding 05 High mdash 0 or 1

IP address 02 Medium

If the same (eg 14324816 and 14324816) 1If the 1st 2nd and 3rd octet are matched (eg

14324816 and 14324818) 075

If the 1st and 2nd octet are matched (eg 14324816and 14324844) 05

Only the 1st octet is matched (eg 14324816 and1431324) 025

No common octet (eg 14324816 and 1631325) 0

Domain 015 Medium

An identical domain 1Service name is matched and one of the gTLD and

ccTLD is matched 08

gTLD and ccTLD is matched 03Service name is matched 01

ccTLD is matched 01gTLD is matched 01

Nonidentical domain 0

Date 01 Low

Period of about 6 months back and forth (1 year) 1Period of about 18 months back and forth (3 years) 075Period of about 30 months back and forth (5 years) 05Period of about 42 months back and forth (7 years) 025Over period of about 42 months (over 7 years) 0

OS 005 Low mdash 0 or 1

Table 3 Characteristics and metadata of several different clusters derived from the clustering processing

Cluster number Ratio () Description Representative hacker (group)

0 784+e group uses Central European languages +eyprincipally attacked against the profit organization

and Linux-based OS in Western EuropeJaMaYcKa Super2li

1 816

+e group uses Arabic and Cyrillic +ey principallyattacked against the organization that manages thenetwork and Linux-based and Unix-based OS +eirattack region is distributed throughout SouthernEurope South America Eastern Europe and

Southeast Asia

BI0S

2 1036

+e group uses Central European languages +eyprincipally attacked against the organization that

manages the network and nonprofit organizations inWestern Europe

JaMaYcKa

3 933+e group uses Central European languages +eyprincipally attacked against the profit organization

and Windows-based OS in Western Europe1923Turk

4 2536+e group uses Arabic and Chinese +ey principally

attacked against the profit organization andWindows-based OS in Western Europe

EL_MuHaMMeD federal-atackorg

5 173

+e group uses Central European languages +eyprincipally attacked against the profit organizationand Unix-based OS in Southern Europe and Eastern

Europe

d3bsimX SuSKuN

6 524

+e group uses Central European languages +eyprincipally attacked against the profit organizationthe educational institution the government and stateagencies and also Windows-based OS in East Asia

1923Turk

Security and Communication Networks 11

study rather than evaluating the accuracy of the similaritymechanism we tested the overall performance of the pro-posed methodology with the ratio of correctly identified

hackers +e developed testing procedures unfolded in thefollowing four steps and are depicted in detail in Figure 7where ldquoKrdquo presents all hackers within the database

Table 3 Continued

Cluster number Ratio () Description Representative hacker (group)

7 1593+e group uses Arabic Chinese and Turkish +eyprincipally attacked against the profit organization

and Linux-based OS in Western EuropeRya iskorpitx

8 911+e group uses Central European languages +eyprincipally attacked against the profit organization

and Windows-based OS in Western Europe1923Turk

9 363

+e group uses Central European languages +eyprincipally attacked against the profit organizationand Linux-based OS in South America and Eastern

Europe

Hmei7

10 139

+e group uses Central European languages +eyprincipally attacked against Windows-based OS inSouth America and Southeast Asia+eir attack target

is mostly the educational institution and thegovernment and state agencies

BHS F4keLive

11 192

+e group uses Arabic and Central Europeanlanguages+ey principally attacked against the profitorganization and Windows-based OS in Southern

Europe

EL_MuHaMMeD linuXploit_cre

Clustering 00

25

25

50

75

0100

25

50

75

0100

25

50

75

0100

25

50

75

0100

25

50

75

0100

25

50

75

0100

25

50

75

0100

25

50

75

0100

25

50

75

0100

25

50

75

0100

25

50

75

0100

50

75

0100

Clustering 01 Clustering 02 Clustering 03

Clustering 04 Clustering 05 Clustering 06 Clustering 07

Clustering 08 Clustering 09 Clustering 10 Clustering 11

West EuropeTurkishCentral EuropeArabicCyrillicChinese

com

net

org

gov

edu

mil

Western EuropeEast AsiaSouthern EuropeSouth AmericaEastern EuropeSoutheast Asia

WindowLinuxUnixMacOS

Encoding

gTLD

ccTLD

OS

Figure 6 Visualization of the 12 different clusters (00 through 11) in our data annotated with various features encoding gTLD ccTLD andOS and their corresponding share (legend on the right side)

12 Security and Communication Networks

Rk Count Casesmk( )Count Casesallk( )

(3)

where ldquomrdquo means the past cases which are within the denedscope concerning a randomly selected hacker ldquokrdquo

(i) Step 1 selection the measurement objects ie 100hackers were randomly selected from the database

(ii) Step 2 case labelling we retrieved all previous attackcases conducted by the randomly selected 100hackers in Step 1 and then subsequently labelled allprevious attack cases by each hacker name

(iii) Step 3 case extraction we selected the most recentcase among the cases labelled in Step 2 as an inputvalue shye similarity score was then estimated bycomparing themost recent case (ie RCmdashone of theretrieved cases) with all other cases in the database(ie TCsmdashall cases in the cases-centric DB)

(iv) Step 4 scoring similarity score was sorteddepending on the value and the weight for thesimilarity score by the case vector (see Table 2) inthe descending orderWhenever the similarity valuewas 0 it was not displayed on the scoring list of Step4 shye feasibility of the proposed methodology wasevaluated based on how many past cases of a hackerthere were in the N scope at the scoring list of Step 4that is regarding the ratio of the attack cases by eachhacker we checked whether the cases were includedat the top N scope (N scope from the top 1 percentto the top 30 percent)

NScope Count CasesScopeK( )Count CasesallK( )

times 100 (2)

First we randomly picked 100 hackers from the col-lected dataset (ie cases-centric DB) thereafter we re-trieved and extracted all past attack cases for each hackershye extracted past cases were labelled with the hackerrsquosname Figure 8 depicts the number of website defacementattack cases in the past for each hacker In Steps 3 and 4similarity between a retrieved case (ie the most recentcase) and all other stored website defacement cases weremeasured

Specically we checked whether the result (ie thesorted hackerrsquos past cases with a high similarity score)stemming from the similarity measurement was included atthe top N scope shyis process was meant to check based onthe similarity score how many past attack cases of randomlypicked 100 hackers were included in the dened topN scopeTo this end we divided the top N scope into eight criterionfactors from the top 1 percent to the top 30 percent and theratio R all the past attack cases for each hacker into sixcriterion factors from 50 percent to 100 percent (ie at 10percent intervals) As illustrated in equations (2) and (3) theN scope and the ratio R were categorized as ratios accordingto the dened measure rule More specically the criterionof the top N scope ie ldquotop N percentrdquo was based on theresult derived from the similarity measurement Attack caseswere sorted in order of high similarity score and thereforethe cases were within the range of topN scope (see Figure 9)Also in the case of the hacking case ratio of a randomly

Step 4 scoring

bullbullbull

Randomly selected100 hackers

from the database

Step 1 selection

bullbullbull

1 eBuGz2 Hmei73 3xp1r3

98 S3cure99 drm1st3r

100 Lulz53c

bullbullbull

Step 2 case labelling1 TheBuGz

100 Lulz53c

Step 3 case extraction

A retrieved case(the most recent case)

bullbullbull

1 TheBuGz

100 Lulz53c

bullbullbull

Cases-centricDB

Hackername Date Encoding IP address Domain OS Score

Hackername Date Encoding IP address Domain OS Score

bullbullbull

1 eBuGz2 Hmei73 3xp1r3

98 S3cure99 drm1st3r

100 Lulz53c

Case1 bullbullbull Casem

Case1 bullbullbull Casemprime

Case1 bullbullbull Casem

Case1 bullbullbull Casemprimei=1

cv[Distance (RCcv TCcv) times Weightcv]

Casemprime

Casemprime

Casem

Casem

Figure 7 shye developed testing procedures from step 1 to step 4

Security and Communication Networks 13

selected hacker some parts of the past attack cases (ie ratioR) concerning a hacker were within the defined N scope (seeFigure 9)

Figure 10 shows the number of an identified hacker fora retrieved case (ie the most recent case) among allhacking cases of each hacker +e X-axis in Figure 10 showsthe criterion of the topN scope including the eight criterionfactors () and of the ratio R including the six criterionfactors () +e Y-axis presents the number of an identifiedhacker in the top N scope among the randomly selected 100hackers in Step 1 As can be seen in Figure 10 the higher theratio R and the narrower the N scope the lower the numberof an identified hacker in the top N scope among therandomly selected 100 hackers On the other hand thelower the ratio R and the wider the N scope the higher thenumber of identified hackers in the top N scope among therandomly selected 100 hackers Consequently even ifhacking cases were caused by the same hacker as the hackeror hacking group which only attacked the same or similarobjects were rare it is impossible to draw results with a highsimilarity score for all cases of a hacker Nevertheless theresults demonstrated that the proposed CBR-based de-cision support methodology can successfully reduce thenumber of hackers and their cases and suggest potential topN percent candidates among hundreds of thousands ofcases

+erefore an investigator should consider the avail-ability and flexibility of data with respect to the data selectioncriteria for the similarity measurement As mentionedabove when a new attack occurs they can limit the searchrange of the data and determine the direction of the criminalinvestigation With such the reduction in the number ofcandidate-related cases the outcomes of our similaritymechanism are highly valuable in terms of reducing theinvestigation time to determine the potential suspect of agiven hacking incident

42 Case Study As mentioned above the accuracy of theCBR depends on the quality of the collected data and theoverall accuracy is difficult to evaluate Nevertheless al-though the data are insufficient to evaluate the proposedmethodology the DS and SPE cases include the ground-truth data with specific information related to the hacker orhacking groups Based on the public ground-truth data ofthe DS and SPE cases we found the most similar top threehackers or hacking groups to them and noticed theircharacteristic by the proposed similarity measure and theclustering processing

+e hackers of the DS cyberattack defaced the groupwarehomepage of LG U+ the 3rd largest telecommunicationcompany in South Korea and the English version of the

Step 4 scoring

1 TheBuGzTop N scope(1~30)

Ratio R (50~100)

Hackername Date Encoding IP address Domain OS Score

Case1 bullbullbull Casem

Casem

Figure 9 Scoring step on the top N scope and the ratio R

0

1000

2000

3000

4000

5000

0 25 50 75 100Hacker

Num

ber o

f cas

es

Figure 8 +e number of website defacement attack cases in the past of each hacker

14 Security and Communication Networks

Korean Broadcasting System (KBS) homepage +ey leftunique images and many messages on the defaced websites+e three Calaveras image (ie skull image) used in the LGU+rsquos defaced website appeared on many European websites+e character encoding set of the message was the WesternEuropean language system Based on these insights we couldinfer that the hackersrsquo background is European ldquoHASTATIrdquowas the word written on the KBS homepage meaning theforefront line of the Roman troops hinting that the DScyberattack could be a starting point rather than a transientattack it was a persistent one Even if we excluded otherimages and messages as well as other features from thesimilarity processes due to the unanticipated loss or absenceof data one could establish the similarity and intent of theattackers with reasonable confidence However given thesufficiently large hacker profiling source such abundant datacould support and enhance the accuracy of inference Fig-ure 11 shows the screenshots of the defaced websites at thattime

In the SPE case similarly to the DS case some imagesand messages were left on the computers of SPE Regardingcolour skulls image and misspellings the imagesFigure 11(c) used in the SPE cases took on the characteristicssimilar to those of the images Figure 11(b) used in the DScases As shown in Figure 11 the colour schemes in greenand red and the visual similarities seen in skull image areother crucial elements for crime tracing In both the DS andSPE cases the phrase such as ldquothis is the beginningrdquo andldquoyour datardquo were commonly found in the messages How-ever given the intentional hacking nature of forging orhiding their identity motivation and location some experts

say that these characteristics are not the conclusive proofthat Sony has been attacked by the same hacker [49ndash51]

For the evaluation of the results of the case study we firstmeasured the similarity between the new website de-facement cases (ie the DS and SPE cases) and the collectedexisting cases in the database +is approach coheres withthe CBR process used in cybercrime investigation (seeFigure 2) Two new website defacement cases the DS and theSPE were applied as RC and the similarity score for each ofthese two cases was computed using the similarity measure(see equation (1)) proposed in Section 331 Provided thatbecause the DS and SPE cases do the function of the targetcases as an input value we considered a direct comparisonbetween the DS and SPE cases for the similarity score wasnot appropriate [52]

+e similarity measure mentioned in the previousparagraph is based on the metadata released by an analysisreport of the DS and SPE real cases We summarized furtherthe characteristics and metadata associated with them inTable 4 +e similarity score was derived through com-parison between the presented metadata of the DS and SPEcases and all cases in the cases-centric DB We gave the mostsimilar top three cases among the result of the similarityscore (see the right side in Table) Notifier Hmei7 and d3b_Xare among the cases that belonged to Clusters 0 and 8 whichwere the two clusters that exhibited identical characteristicsIt can thus be understood that they used the encoding systempertinent to Central European languages based on the Latinlanguage system and typically launched attacks against aprofit organization located in Western Europe Notifieroaddah MTRiX and EL_MuHaMMeD were all classified

Top 1 Top 3 Top 5 Top 10 Top 15 Top 20 Top 25 Top 30Criterion of the top N ()

Num

ber o

f ide

ntifi

ed h

acke

rs

0

20

40

60

80

100

Ratio of the attack cases ()506070

8090100

Figure 10 +e number of identified hackers in the top N scope among the randomly selected 100 hackers

Security and Communication Networks 15

as the same cluster (Cluster 7) where the hackers of Cluster 7used the encoding system pertinent to Arabic and Chineselanguages and typically attacked against the profit organi-zation located in Western Europe

Next to ensure the objectivity of the similarity scorebased on the case study by the DS and SPE we computed thesimilarity score of any randomly selected pair from thewhole case Figure 12(a) shows the distribution of thesimilarity score of the randomly selected cases We took thedistribution of the similarity score using the central limittheorem which describes the average distribution of ran-dom samples extracted from a finite population +e dis-tribution shows that the calculation of the similarity score ofthe randomly selected two website defacement cases wasrepeatedly performed for 10000 times +e similarity scoresof any randomly selected pair of cases were typically dis-tributed around 03 +is result (Figure 12(a)) substantiatesthat the similarity scores are not low even if the similarityscores of the DS and SPE cases (Figure 12(b)) do not appear

numerically high Figure 12(b) shows the similarity scores ofthe DS and SPE cases+e top score of the similarity was 069in the DS case and all measured cases concentrated aroundthe similarity score (X-axis) of 00 to 015 and of 05 to 06 Inthe SPE case the top score of the similarity was 0615 and allmeasured cases concentrated around the similarity score (X-axis) of 00 to 02

Figure 13 shows the distribution of the similarity scorefor randomly selected 100 hackers mentioned in Section 41To know the mean value of the similarity score for eachhacker case we calculated the similarity score from thehackerrsquos own past cases Cases used for the similarity scoremeans not all cases in the cases-centric DB but just the pastcases conducted by the hacker in the cases-centric DB +emean value of the similarity scores in the hackers is 05233+e similarity scores of the tested cases in Table 4 is abovethe mean value +us the similarity scores for each hackeradequately underpin the similarity scores from the TCs inDS and SPE

(a) (b) (c)

Figure 11 A snippet of website defacement cases by a comparison of examples of the DS and SPE the defaced LGU+ groupware homepage(a) and KBS homepage (b) in the DS case and the defaced website in SPE case (c)

Table 4 Further characteristics and metadata associated with the DS and SPE cases

Retrieved case Tested cases

Case name NotifierDarkSeoul (DS) Hmei7 d3b_X StifLer

Encoding Windows-1252 Windows-1252 Windows-1252 ISO-8859-9IP address 203248195178 2038623868 2031243766 77921083Domain gyunggionnet21com httpwwwgarychengcom healthajkgovpk yapikimyasallaricomtrDate 20 Mar 2013 6 Feb 2014 4 Feb 2014 8 Jun 2013OS Windows Windows Windows WindowsSimilarity mdash 0690 0675 0665Cluster mdash 0 8 4

Retrieved case Tested casesCase name Notifier

Sony pictures Entertainment (SPE) Oaddah MTRiX EL_MuHaMMeDEncoding EUC-KR EUC-CN GB2312 GB2312 GB2312IP address 203131222102 2031241555 20829198 2081164534Domain httpwwwsonypicturesstockfootagecom httpwwwhzkcggcom daxdigitalromcom digitalairstripnetDate 24 Nov 2014 14 Jun 2012 16 Dec 2002 18 June 2009OS Windows Windows Windows WindowsSimilarity mdash 0615 0615 0600Cluster mdash 7 7 7+e metadata are arranged according to the defined case vector corresponding with the DS and SPE cases on the left side (shown in part in boldface type)

16 Security and Communication Networks

43 Follow-Up Investigation A case study is a researchmethod involving an in-depth and detailed investigationof a subject of study as well as its related contextualmethodology Hence we conducted follow-up in-vestigations of the most similar top three hackers asmentioned above in Table 4 According to the resultsspecifically over 93 percent of the hackerrsquos attacks weresimilar to the DS case that occurred in 2013 and 2014+eir major targets were com domain sites and theytargeted primarily Germany Italy New Zealand RussiaTurkey Taiwan and South Korea (see Table 5) Twohackers (ie Hmei7 and d3b_X) primarily attackedgovernment agencies Interestingly 20 percent of theattacks by the hackers named d3b_X targeted SouthKorea In the SPE incident the similar hackerrsquos attacksoccurred throughout the period from 2002 to 2014 +ehackers named MTRiX and EL_MuHaMMeD in-tensively executed such attacks in 2003 and 2009 +eirmajor targets were com (or co) and org domain sitesand they targeted primarily Brazil Canada DenmarkFrance Greece Hong Kong and Italy (see Table 5) Twohackers (ie MTRiX and EL_MuHaMMeD) primarilyattacked commercial agencies and additionally attackedthe public and network agencies As shown in Figure 14 to

describe the follow-up investigation more discernibly andto focus on the attack flow we used an alluvial diagramwhich is a type of Sankey diagram developed to representchanges in a network structure over time [53] It shows theinvestigation of the top three hackers with website de-facement cases most similar to the DS case and SPE case+e case vectors were based on the attack year ccTLD andgTLD+e thickness of the attack flow in this figure meansthe degree of attack +is network visualization methodcould support an investigator to understand the flow andcore of the crime clearly by listing the multidimensionalevidence that is complicatedly entangled or hidden suchthat it does not look presentable

5 Limitations and Discussion

+e CBR algorithm has the disadvantage that the perfor-mance evaluation may be degraded if the property de-scribing the case is inappropriate +erefore in order toobtain more accurate results cross-data analysis with othervarious data sources should be considered For examplecybercrime statistics data from law enforcement agenciesthreat intelligence data from malware analysis groups andvulnerability databases could be useful resources to

100806040200Similarity score

600

400

200

0

Freq

uenc

y

Mean = 02930 Var = 00866

(a)

100806040200

Freq

uenc

y e highest similarity score 0615on Sony Pictures Entertainment case

Similarity score100806040200

Similarity score

40000

30000

20000

10000

0

Freq

uenc

y

40000

30000

20000

10000

0

Mean = 0114 Var = 01500

e highest similarity score 069on DarkSeoul case

Mean = 0063Var = 00370

A B

(b)

Figure 12 (a) Probability distribution of the similarity score for any pair of randomly selected cases (b) distribution of the similarity valuebetween the collected website defacement cases with the DS case (A) and the distribution of the similarity value between the collectedwebsite defacement cases with the SPE case (B) +e similarity was calculated between each studied case and all other cases in our system

0

2

4

6

000 025 050 075

Freq

uenc

y

Mean value of the similarity score000 025 050 075

Figure 13 Distribution of the similarity score for randomly selected 100 hackers

Security and Communication Networks 17

improve the accuracy and usability of our proposedmethodology However at the time of writing the presentpaper we did not have access to open and public dataconcerning cybercrime

For that reason we tried to demonstrate the practica-bility of the proposed methodology as a proof of concept+erefore we focused on the dataset of the zone-horg thatincludes a large number of website defacement cases Al-though the zone-horg provides an extensive dataset on thepast incident events not all incidents can be included in ourstudy +erefore if a hacker penetrated some target orga-nizations by APT attacks and performed stealthy activitiessuch hacking activities would not be reported in the datasetof the zone-horg and the proposed methodology would notbe able to detect similar cases with reasonable confidence

6 Conclusion and Future Work

In this study the similarity of website defacement caseswas assessed through the similarity measure and theclustering processing using the CBR as a methodology+e collected raw data of the defaced web sitesrsquo resourceswas sanitized via data parsing and data cleaning processAlso based on the large size of real dataset data-drivenanalysis for the hacker profiling is achieved To this endthe case vector was designed and the significant featureswere chosen for applying to the case-based reasoning Fora successful cybercrime investigation hacker profiling viaclustering analysis is the most basic and importantprocess in order to find out the relevant incident casesand significant data on some prime incidents data-driven

Table 5 Follow-up investigation on the top three hackers with website defacement cases most similar to the DS case and SPE case +e casevector value means the hackerrsquos attack rate

DomainDS case SPE case

Hmei7 d3b_X StifLer Oaddah MTRiX EL_MuHaMMeDCom 7832 8581 10000 10000 8627 8298Edu 162 096 mdash mdash 176 191Net 340 320 mdash mdash 546 574Gov 1216 651 mdash mdash 106 mdashYear Hmei7 d3b_X StifLer Oaddah MTRiX EL_MuHaMMeD2002 mdash mdash mdash mdash 1074 mdash2003 mdash mdash mdash mdash 8908 mdash2006 mdash mdash mdash mdash mdash mdash2007 009 mdash mdash mdash 018 mdash2008 mdash mdash mdash mdash mdash mdash2009 315 mdash mdash mdash mdash 99572010 009 mdash mdash mdash mdash mdash2011 034 mdash mdash mdash mdash mdash2012 340 mdash mdash 10000 mdash mdash2013 3486 3917 10000 mdash mdash mdash2014 5808 5977 mdash mdash mdash 0432015 mdash 107 mdash mdash mdash mdash

d3b~x

Hmei7

StifLer

2009

2012

2013

2014

AustraliaBrazilFrance

Germany

IndonesiaItaly

KoreaNetherlandsNewZealand

PolandRussia

Thailand

Turkey

Unknown

com

gov

net

org

Unknown

No

Yes

Hacker Year ccTLD gTLD Attack

(a)

EL_M

uHaM

MeD

MTRiX

oaddah

2002

2003

2009

2012

BrazilCanada

DenmarkFranceGreece

HongKongItaly

Unknown

com

net

org

Unknown

No

Yes

Hacker Year ccTLD gTLD Attack

(b)

Figure 14 Follow-up investigation on the top three hackers with website defacement cases that are most similar to the DS case (a) and SPEcase (b)

18 Security and Communication Networks

and evidence-driven decision making should be thecritical process Also reducing the amount of data andtime to be analysed are important factors to deliver thehigh value of intelligence data

Although the obtained results appear to be sound andmeaningful it is difficult to evaluate the accuracy of theresults unless the attacker is captured Naturally theground-truth data with specific information about theinvolved hacking groups for verification are rare (ie noadversary claimed that the two attacks were the result oftheir actions) However it is noteworthy that our meth-odology provides a meaningful insight into the confidentialand undercover network of cybercrime as well especiallywhen there is a lack of information Also the proposedmethodology contributes to facilitate the analysis and re-ducing the time required for searching for possible suspectsof cybercrime We believe that the proposed system ismeaningful for further exploration and correlation ofvarious website defacement cases

As mentioned in Discussion and Limitations a cross-data analysis with other various data sources should bereviewed Said differently the use of additional online oroffline information acquired by human intelligence(HUMINT) or different types of signal intelligence(SIGINT) and sources may also help to reason compo-sition requirements of crime and reduce the category ofinvestigation Furthermore the proposed methodologycan be expanded into incident information for compat-ibility and information exchangeability with othercyberthreat intelligence system as the Structured +reatInformation eXpression (STIX) and Trusted AutomatedeXchange of Indicator Information (TAXII) which arekey strategic elements of the information-sharingsystem [54]

+ere are features such as the particular messages (iethanks-to notifier nationality religion and anniversary)or image and mp3 file in the web resources which aregathered from the zone-horg site Although these featuresare limited to only a small number of hackers of the webresources in future research we will try to study a close-knit network among them such as the hub hacking groupkey player and followers Furthermore we also plan tomore definitely classify and systemize the hackersrsquo intentsusing text mining and mood detection techniques +efindings of this prospective study will contribute mean-ingful insights to trace hackersrsquo behavioural patterns and toestimate their primary purpose and intent

Data Availability

+e web-hacking dataset applied to our paper can bedownloaded from the linked site below httpocslabhksecuritynetDatasetsweb-hacking-profiling

Conflicts of Interest

+e authors declare that there are no conflicts of interestregarding the publication of this paper

Acknowledgments

+is work was supported under the framework of internationalcooperation program managed by the National ResearchFoundation of Korea (No 2017K1A3A1A17092614)

References

[1] S S Response ldquoSwift attackersrsquo malware linked to more fi-nancial attacksrdquo 2016 httpswwwsymanteccomconnectblogsswift-attackers-malware-linked-more-financial-attacks

[2] S S Response ldquoWannacry ransomware attacks show strong linksto lazarus grouprdquo 2017 httpswwwsymanteccomconnectblogswannacry-ransomware-attacks-show-strong-links-lazarus-group

[3] K lab ldquoLazarus under the hoodrdquo 2018 httpsmediakasperskycontenthubcomwp-contentuploadssites4320180307180244Lazarus_Under_+e_Hood_PDF_finalpdf

[4] Operation Blockbuster ldquoDestructive malware reportrdquo 2016httpswwwoperationblockbustercomwp-contentuploads201602Operation-Blockbuster-Destructive-Malware-Reportpdf

[5] D Martin and SANS Institute InfoSec Reading Room ldquoTracingthe lineage of DarkSeoulrdquo 2016 httpswwwsansorgreading-roomwhitepaperscriticaltracing-lineage-darkseoul-36787

[6] D S C T U T Intelligence ldquoWiper malware threatanalysisrdquo 2013 httpswwwsecureworkscomresearchwiper-malware-analysis-attacking-korean-financial-sector

[7] R Sherstobitoff M L Itai Liba and O O T C James WalterldquoDissecting operation troy cyberespionage in South Koreardquo2013 httpswwwmcafeecomenterpriseen-usassetswhite-paperswp-dissecting-operation-troypdf

[8] N Horton andA DeSimone ldquoSonyrsquos nightmare before christmasthe 2014 North Korean cyber attack on Sony and lessons for USgovernment actions in cyberspacerdquo 2018 httpswwwjhuapleduContentdocumentsSonyNightmareBeforeChristmaspdf

[9] I K Lee and S R Ramsey 9e Korean Language StateUniversity of New York Albany NY USA 2000

[10] V Benjamin and H Chen ldquoSecuring cyberspace identifyingkey actors in hacker communitiesrdquo in Proceedings of the 2012IEEE International Conference on Intelligence and SecurityInformatics pp 24ndash29 Arlington VA USA June 2012

[11] Y Lu X Luo M Polgar et al ldquoSocial network analysis of acriminal hacker communityrdquo Journal of Computer In-formation Systems vol 51 no 2 pp 31ndash41 2010

[12] J-W Jang H Kang J Woo A Mohaisen and H K KimldquoAndro-autopsy anti-malware system based on similaritymatching of malware and malware creator-centric in-formationrdquo Digital Investigation vol 14 pp 17ndash35 2015

[13] J W Jang and H K Kim ldquoFunction-orientedmobile malwareanalysis as first aidrdquo Mobile Information Systems vol 2016Article ID 6707524 11 pages 2016

[14] Y Ki E Kim and H K Kim ldquoA novel approach to detectmalware based on api call sequence analysisrdquo InternationalJournal of Distributed Sensor Networks vol 11 no 6 ArticleID 659101 2015

[15] M L Han H C Han A R Kang et al ldquoWeb-hacking datasetfor the cyber criminal profilingrdquo 2016 httpocslabhksecuritynetDatasetsweb-hacking-profiling

[16] M L Han H C Han A R Kang B I Kwak A Mohaisenand H K Kim ldquoWAHP web-hacking profiling using case-based reasoningrdquo in Proceedings of the 2016 IEEE Conference

Security and Communication Networks 19

on Communications and Network Security (CNS) pp 344-345Philadelphia PA USA October 2016

[17] A Aamodt and E Plaza ldquoCase-based reasoning foundationalissues methodological variations and system approachesrdquo AICommunications vol 7 no 1 pp 39ndash59 1994

[18] D M L Martins and F B D Lima Neto ldquoHybrid intelligentdecision support using a semiotic case-based reasoning andself-organizing mapsrdquo IEEE Transactions on Systems Manand Cybernetics Systems no 99 pp 1ndash8 2017

[19] H K Kim K H Im and S C Park ldquoDSS for computersecurity incident response applying CBR and collaborativeresponserdquo Expert Systems with Applications vol 37 no 1pp 852ndash870 2010

[20] J-B Lamy B Sekar G Guezennec J Bouaud andB Seroussi ldquoExplainable artificial intelligence for breastcancer a visual case-based reasoning approachrdquo ArtificialIntelligence in Medicine vol 94 pp 42ndash53 2019

[21] M Relich and P Pawlewski ldquoA case-based reasoning ap-proach to cost estimation of new product developmentrdquoNeurocomputing vol 272 pp 40ndash45 2018

[22] E R Reyes S Negny G C Robles et al ldquoImprovement ofonline adaptation knowledge acquisition and reuse in case-based reasoning application to process engineering designrdquoEngineering Applications of Artificial Intelligence vol 41pp 1ndash16 2015

[23] H K Kim S-K Kim and S-H Kim ldquoDecision supportsystem for zero-day attack responserdquo Applied Mathematicsand Information Sciences vol 6 no 1 pp 221Sndash241S 2012

[24] G Horsman C Laing and P Vickers ldquoA case-based rea-soning method for locating evidence during digital forensicdevice triagerdquo Decision Support Systems vol 61 pp 69ndash782014

[25] G Horsman C Laing and P Vickers ldquoA case based reasoningsystem for automated forensic examinationsrdquo in Proceedings ofthe PGNET 2011 the 12th Annual Postgraduate Symposium onthe Convergence of Telecommunications Networking andBroadcasting pp 26ndash31 Liverpool UK June 2011

[26] Z Yin Y Gao and B Chen ldquoOn development of supple-mentary criminal analysis system based on cbr and ontologyrdquoin Proceedings of the 2010 International Conference onComputer Application and System Modeling (ICCASM 2010)vol 14 Taiyuan China October 2010

[27] A J Pinizzotto and N J Finkel ldquoCriminal personality pro-filing an outcome and process studyrdquo Law and HumanBehavior vol 14 no 3 pp 215ndash233 1990

[28] P Chen and J Kurland ldquoTime place and modus operandi asimple apriori algorithm experiment for crime pattern de-tectionrdquo in Proceedings of the 2018 9th International Con-ference on Information Intelligence Systems and Applications(IISA) pp 1ndash3 Zakynthos Greece July 2018

[29] C J R Collie and K Shalev Greene ldquoExamining modusoperandi in stranger child abduction a comparison ofattempted and completed casesrdquo Journal of InvestigativePsychology and Offender Profiling vol 16 no 2 pp 91ndash1092019

[30] V Benjamin B Zhang J F Nunamaker Jr and H ChenldquoExamining hacker participation length in cybercriminalinternet-relay-chat communitiesrdquo Journal of ManagementInformation Systems vol 33 no 2 pp 482ndash510 2016

[31] V Benjamin and H Chen ldquoTime-to-event modeling forpredicting hacker IRC community participant trajectoryrdquo inProceedings of the 2014 IEEE Joint Intelligence and SecurityInformatics Conference pp 25ndash32 +e Hague +e Nether-lands September 2014

[32] K Veena and K Meena ldquoIdentification of cyber criminal byanalysing the users profilerdquo International Journal of NetworkSecurity vol 20 no 4 pp 738ndash745 2018

[33] F Iqbal B C M Fung M Debbabi R Batool andA Marrington ldquoWordnet-based criminal networks miningfor cybercrime investigationrdquo IEEE Access vol 7pp 22740ndash22755 2019

[34] N Qazi and B L W Wong ldquoAn interactive human centereddata science approach towards crime pattern analysisrdquo In-formation Processing ampManagement vol 56 no 6 p 1020662019

[35] N Jain P Sharma R Anchan et al ldquoComputerized forensicapproach using data mining techniquesrdquo in Proceedings of theACM Symposium on Women in Research 2016 pp 55ndash60ACM New York NY USA 2016

[36] P M Cozens G Saville and D Hillier ldquoCrime preventionthrough environmental design (cpted) a review and modernbibliographyrdquo Property Management vol 23 no 5pp 328ndash356 2005

[37] H Hassani X Huang E S Silva andM Ghodsi ldquoA review ofdata mining applications in crimerdquo Statistical Analysis andData Mining 9e ASA Data Science Journal vol 9 no 3pp 139ndash154 2016

[38] A Sharma and S Sharma ldquoAn intelligent analysis of webcrime data using data miningrdquo International Journal of En-gineering and Innovative Technology (IJEIT) vol 2 no 32012

[39] S-T Li S-C Kuo and F-C Tsai ldquoAn intelligent decision-support model using FSOM and rule extraction for crimepreventionrdquo Expert Systems with Applications vol 37 no 10pp 7108ndash7119 2010

[40] Y-H Tseng Z-P Ho K-S Yang and C-C Chen ldquoMiningterm networks from text collections for crime investigationrdquoExpert Systems with Applications vol 39 no 11 pp 10082ndash10090 2012

[41] A Malathi and S S Baboo ldquoAn enhanced algorithm topredict a future crime using data miningrdquo InternationalJournal of Computer Applications vol 21 no 1 2011

[42] S Kapetanakis A Filippoupolitis G Loukas et al ldquoProfilingcyber attackers using case-based reasoningrdquo in Proceedings ofthe 19th UK Workshop on Case-Based Reasoning (UKCBR2014) Cambridge UK December 2014

[43] R Al-Zaidy B C Fung A M Youssef et al ldquoMining criminalnetworks from unstructured text documentsrdquo Digital In-vestigation vol 8 no 3-4 pp 147ndash160 2012

[44] M Zulfadhilah Y Prayudi and I Riadi ldquoCyber profilingusing log analysis and k-means clusteringrdquo InternationalJournal of Advanced Computer Science and Applicationsvol 7 no 7 pp 430ndash435 2016

[45] S V Nath ldquoCrime pattern detection using data miningrdquo inProceedings of the 2006 IEEEWICACM International Con-ference on Web Intelligence and Intelligent Agent TechnologyWorkshops pp 41ndash44 Hong Kong China December 2006

[46] ITPnet ldquoSyria Egypt crises spur escalation of me cyber at-tacksrdquo 2013 httpwwwitpnet594742-syria-egypt-crises-spur-escalation-of-me-cyber-attack

[47] A McEnery and R Xiao ldquoCharacter encoding in corpusconstructionrdquo in Developing Linguistic Corpora A Guide toGood Practice Oxbow Books Ltd Oxford UK 2005

[48] B Bos T Ccedilelik I Hickson et al ldquoCascading style sheets level2 revision 1 (CSS 21) specificationrdquo W3C Working Draft2005 httpwwww3orgTRCSS21

20 Security and Communication Networks

[49] W Stuckey ldquoMassive sony breach sheds light on murkyhacker universerdquo 2018 httpamericaaljazeeracomarticles20141224sony-hacker-universehtml

[50] S Gallagher ldquoSony pictures malware tied to SeoulldquoShamoonrdquo cyber-attacksrdquo 2018 httpsarstechnicacominformation-technology201412sony-pictures-malware-tied-to-seoul-shamoon-cyber-attacks

[51] J Pagliery ldquoSony hack signs point to North Koreardquo 2018httpsmoneycnncom20141205technologysecuritysony-hack-north-korea-employeeindexhtml

[52] K Ketler ldquoCase-based reasoning an introductionrdquo ExpertSystems with Applications vol 6 no 1 pp 3ndash8 1993

[53] M Rosvall and C T Bergstrom ldquoMapping change in largenetworksrdquo PLoS One vol 5 no 1 Article ID e8694 2010

[54] OASIS ldquoSTIXTAXII standardsrdquo 2017-2018 httpsoasis-opengithubiocti-documentation

Security and Communication Networks 21

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 3: CBR-Based Decision Support Methodology for Cybercrime

(iv) We report case studies based on the real datasetgathered from the zone-horg site to demonstrate thevarious aspects of our proposed algorithm Finally tofoster further research our dataset (the dataset forcybercrime investigation focused on the data-drivenwebsite defacement analysis httpocslabhksecuritynetDatasetsweb-hacking-profiling) is made pub-licly available [15 16]

+e rest of the paper is organized as follows Section 2provides a summary of the literature related to our work+edetailed methodology is described in Section 3 Section 4reports the experimental results and analysis based on thecase study +e limitations of our work and a discussion onthe proposed approach are presented in Section 5 FinallySection 6 concludes the paper and suggests directions offurther research

2 Related Work

In this section we primarily highlight the previous studiesclosely related to the CBR and review two streams of literatureon traditional criminal profiling and cybercrime profilingWealso elaborate the data mining-based cybercrime profilingpertaining to the following (1) the CBR studies that helpbetter understand our research context (2) traditionalcriminal profiling and cybercrime profiling review that allowus to obtain an elusive criminal or a concealed clue (3) datamining-based cybercrime profiling literature that can supportand theoretically reinforce our methodology

21 Case-Based Reasoning CBR is a method that uses pastexperiences or cases to solve new problems Even when thenew problems are not exactly identical to the previous casesCBR can suggest a partial solution to the new problems [17]CBR can be categorized as a data-mining technique as it canclassify the given samples and predict the result for a newcase As case studies are intuitive and easily understood byhumans CBR has long been used in many fields includingcustomer technical support medical case search and legal

case search +e general model of the four-step CBR process[18] is shown in Figure 2

+e four phases are as follows

(i) Retrieve given a new website defacement caserelevant cases are retrieved from the knowledge baseto solve the case at hand

(ii) Reuse solutions from previous website defacementcases are mapped for reuse

(iii) Revise on mapping and testing previous solutionsto the target case the solutions are revised toconsider the changes in the cases

(iv) Retain after the solution has been successfullyadapted to the problem a meaningful experience isstored as a new case in the knowledge base

CBR starts with a given set of cases for training formsgeneralizations of the given examples and subsequentlyidentifies the commonalities between the retrieved case andthe target case When applied to the website defacement casecomposed of descriptive and nominal data it can effectivelydetermine the commonality from the crawled hacking casesand quickly search the nearest related case FurthermoreCBR can be used to search themost similar cases and retrievepast solutions from the latest response cases CBR facilitatessecurity administrators to make better decisions For ex-ample Kim et al proposed the DSS for an incident responsebased on CBR [19]

CBR has been extensively used in several areas such asmanagement for product development medicine and inengineering applications [20ndash22] In addition several CBRapproaches were available for cyber incidents profiling Forinstance Kim et al proposed an intelligent system that canmeasure the similarity between the past and new attacks Intheir work the author(s) demonstrated such capability inuncovering zero-day attacks using the string similarityanalysis of the captured packet-level data [23] Horsmanet al proposed the CBR-FTframework which is a method forcollecting and reusing past digital forensic investigationinformation to highlight likely evidential areas on a suspect

The Sonyattacks

TrojanDozer(July 2009)

TrojanKoredos(March 2011)

BackdoorPrioxer(June 2010)

BackdoorPrioxerB(July 2012)

DownloaderCastov(October 2012) TrojanCastov

(June 2013)

InfostealerCastovDownloaderCastov

(May 2013)

TrojanJokra(March 2013)

Attacks on organizations in the

US and South Korea

Attacks against financialinstitutions and their

customers in South Korea

Attacks against banks andlocal broadcasting

organizations in South Korea

2009 20162010 2011 2012 2013 2014 2015 2017

BackdoorDestover(November 2014)

BackdoorDuuzerW32Brambul

BackdoorJoanap(October 2015)

TrojanBanSwift(February 2016)

Manufacturing industryin South Korea targeted

South Koreanorganizationstargeted again

DDoS attacks against SouthKorea

SWIFTattacks

Banks targetedagain

WannaCryrocks the world

DownloaderRatabanka

(February 2017)

WannaCry(May 2017)

Figure 1 Timeline of the Lazarus group activities from 2009 to 2017

Security and Communication Networks 3

operating system It enables an investigator to help quicklyand precisely decide where to search for evidence [24]

22 Traditional Criminal Profiling and Cybercrime ProfilingProfiling is used in various sectors of the society to in-vestigate a criminalrsquos mentality Criminal profiling is aprofiling technique for criminal investigation based on thepsychological and behavioural patterns of a criminal[25 26] +e criminal aspects and crime factors can beidentified through the evidences and insights of the psy-chological and behavioural bias [27] In the field of crimi-nology the widely used profiling technique is called theModus Operandi (MO) It is used to describe a suspectrsquosbehaviour and evidence elements in crime +at is it meanshow a suspect commits their crimes +e Modus Operandichanges based on the offenderrsquos criminal conduct and in-teraction with the surrounding such as time date and lo-cation of crime Moreover it evolves based on how theoffender reaches hisher victim [28 29]

Based on only the traditional criminal profiling tech-niques and empirical knowledge it is difficult for a cyber-crime investigator to reduce the error of the investigativeprocess and to untangle the complexity of a cybercrimeHowever if the investigator is provided with sufficient in-formation and detailed analysis data to understand theunclear motivation and the elusive pattern related to thecybercrime they can infer the reason(s) of the crime at stakeand produce both general and specific outlines of thecriminal [26] +e cybercrime network and characteristicscan be important indicators to differentiate between keyfigures in the cybercrime organizations and those of passinginterest In addition their activity periods and messagecontent patterns of the participants in an illegal communitycan support the investigator to carefully identify andscrutinize the key figure in the cybercrime network [30 31]By automating cybercrime profiling and data-miningmethods of analysis through a cross-analysis of variousbehavioural patterns we can anticipate potential criminalactivities and identify new profiles that pose serious threatsto the community Furthermore data-mining methods suchas entity extraction clusteringclassification technique andsocial network analysis make it possible to efficiently explorelarge data Network visualization enables an investigator tointuitively recognize the crime pattern [32ndash34]

In general the accuracy of CBR depends on the quality ofthe collected data and the overall accuracy is difficult toevaluate [35] Although the effectiveness of data-driveninvestigation can decrease owing to the dynamic and fast-evolving crime patterns understanding the hidden corre-lations and latent behaviour in such data using large dataanalytic techniques is another promising direction in re-search Accordingly many law enforcement agencies havebeen adopting future crime prediction systems based on thestatistics about weather cleanliness location demographicdistribution education level and wealth-level informationBased on the crime prevention through environmentaldesign (CPTED) theory [36] many pieces of data correlatedwith the crime are collected and analysed to estimate thecrime probability However while many data-driven ap-proaches to support traditional criminal profiling areavailable only several research efforts have focused oncybercrime profiling

23 Data-Driven Cybercrime Profiling In addition to thetraditional criminal profiling for offline crime investigationsvarious profiling techniques have been developed in thesetechniques it is assumed that cybercriminals also showsimilar behavioural and psychological characteristics Owingto the recent advances in data-mining and machine-learningalgorithms many studies regarding criminal pattern de-tection classification and clustering have emerged +emethods used in these studies include among others entityextraction clustering association rule mining deviationdetection and classification of social network analysis Acombination of the traditional method and a newer methodenables the pattern identification from both structured andunstructured data For instance entity extraction is used tounderstand concealed patterns in the data such as textsimages and audio data Furthermore clustering is used togroup objects into classes with similar characteristics[37 38] In addition unsupervised methods such as the self-organizing map (SOM) are used to support the results of thetraditional criminal profiling [39] In cases where thecriminal and the related cases are known supervisedlearning is applied [40] However although many advanceshave occurred in big data analytics and machine learningthese approaches are limited in supporting real-time pro-cessing as they require high computing power to handle alarge volume of training data In fact the large volume ofcrime data is a considerable challenge for the investigator interms of gaining the appropriate understanding of a com-plicated relationship or in terms of a timely responseHowever despite the limitations of this approach datamining yields valid useful and appropriate results By datapreprocessing such as data cleaning data integration anddata transformation it intends to reduce noisy data as wellas incomplete and inconsistent data It helps to uncover andconceptualize the concealed or latent crime patterns Byimproving the efficiency of crime data understanding andreducing errors in the results afforded by the data-miningmethod the investigator can perform reasoning timelyjudgment and quick problem solving [41]

Target cases

Proposedsolution

Confirmedsolution

Retrieve Reuse

Revise

Retain

Employ

Defacedwebsites

Similaritymeasurement

Case base

Figure 2 CBR process used in cybercrime investigationemploying knowledge base that is reused over similar new casesand retained for later use

4 Security and Communication Networks

CBR is also used to provide the reasoning power tosearch similar previous cases [25 42] However biased orimperfect collected data deteriorate the quality of the de-cision support provided by CBR +erefore in many casessetting the weight of the selected features is based on em-pirical knowledge which can be subsequently used to enablethe detection and analysis of crime patterns from thetemporal crime activity data Using clustering and classifi-cation techniques as well as speculativemodels for searchingsimilar crime cases in the past investigators can easily ex-tract useful information from the unstructured textualdataset [43] Hence investigators must collect and contin-uously update the comprehensive crime data

Clustering is the task of determining a similar group inthe data Clustering includes supervised learning typesZulfadhilah et al compared four types of clustering algo-rithms K-means hierarchical clustering SOM and Ex-pectation Maximization algorithm (EM clustering)mdashbasedon their performances +ey concluded that the K-meansalgorithm and the EM algorithm are better than the hier-archical clustering algorithm In general partitioning al-gorithms such as the K-means and EM algorithm are highlyrecommended for use in large-size data [44] In summarythe clustering algorithm can facilitate the investigator indetecting crimes patterns and accelerate crime solving +eweighting scheme for attributes can handle the limitations ofthe clustering techniques [45]

3 Methodology

In this section we present the detailed scheme of decisionsupport methodology for cybercrime investigation with thefocus on the website defacement cases A conceptualframework and its process are illustrated in Figure 3 +escheme is proceeded by the following three steps datapreprocessing case vector design and reasoning engineFirst we provide a brief outline of the dataset and describethe merits of the website defacement data Also we sum-marize the preprocessing for data parsing and cleaningregarding the collected data type Next we designed the casevector and chose the significant features to apply the rea-soning performance Finally the reasoning engine hasvarious functionalities and it is intended for the grouping(clustering) of cases based on their similarity

31 Preprocessing As part of the proposed analyticalframework we have developed a crawler to automaticallycollect 212093 website defacement cases from the zone-horg site Many website defacement cases are being dailyrecorded in the archive page of the zone-horg site Each caseregistered in the archive page provides information (ie IPaddress Domain Date OS Notifier and Web server) of thesame format through each mirror page First of all thecrawler collects all public information relevant to each case+ereafter on accessing the domain site it saves data in theraw format of the HTML source After crawling the webresources of raw data the data preprocessing is performed toamend incomplete improperly formatted or duplicate data

records More specifically there are various tag attributes inthe HTML source Encoding and Font data are extractedthrough the ltcharsetgt and ltfont-stylegt tag of the HTMLelements set between ltheadgt and ltheadgt tag in the HTMLsource Also image sound file and the linked site areextracted through the ltfont-familygt ltimggt and lthrefgt tagof them set between ltbodygt and ltbodygt tag in the HTMLsource +e web resources as original raw data were parsedand cleaned depending on the relevant case vector (seeFigure 4) After cleaning the data some significant data fieldswere selectively stored in the systemrsquos case database

+e selected data fields were related to the informationabout the website defacement date related IP address targetdomain target system OS and web server version theseaspects have proven to be useful for cyberattack in-vestigations [46] Specifically the encoding method and thefont whom the HTML source contains were necessary tospeculate on the attackerrsquos regional information For ex-ample if messages remaining in a defaced website arewritten in ISOIEC 8859 encoding we can subsequentlyinfer that the hackersrsquo language is German Spanish orSwedish Furthermore depending on whether all the mes-sages are written in the same encoding method the usedspecial characters such as β or ntilde or a can be used as a clue forguessing attackerrsquos origin In general encodings fromWindows-1250 to Windows-1258 are used in the centralEuropean languages as well as in Turkish Baltic languagesand Vietnamese By contrast GB encoding is used inChinese HKSCS encoding is used in Taiwanese and EUC-KR or ISO-2022-KR encoding is used in Korean [47] Inaddition to the font and encoding information the textimage audio and video found in the messages are alsonecessary parameters for the case identification

32 Case Vector Design We designed the case vector in twotypes concerning the similarity measure and clusteringprocessing +e case vector is summarized in Table 1 +efeatures of various aspects such as the font web serverthanks-to notifier (hackers or hacking groups) as well as thefeatures such as the encoding IP address domain attackdate and OS were extractable from the public archival sitezone-horg Generally more diverse features can be a sig-nificant factor for investigating relationships and associa-tions among hackers or hacking groups and the scale and thedensityintensity of the hacker community However such apremise has some shortcomings +e importance or theweight of all features may be different depending on thecriterion Also if all features are important machine-learning algorithms such as clustering or classification aredifficult to perform in reality because of the high compu-tational cost for analysing Despite having similar meaningssome of the features can be reperformed unnecessarily Tothis end the dimensionality reduction and the feature se-lection were performed in the present study paper After athorough review by security experts the significant featureswere selected for the case vector of website defacement cases+e detailed explanation of the dimensionality reductionand the feature selection is as follows

Security and Communication Networks 5

In theWindows operating system if a specific font is notdesignated as the tag inside the HTML code such as theltfont-familygt property the characters on a website pagemay appear as broken In particular some of the fontsamong the Chinese charactersrsquo cultural area depend on thecharacter encoding (eg font-family Gulim MingLiU andSTHeiti) [48] Similar to the encoding feature although thischaracteristic may be the key evidence to uncover a cor-relation between the victim and the attacker it is extremelyrare in each of the collected website defacement cases+erefore it is not suitable as a case vector for cybercrimeinvestigation Meanwhile in the case of a web server itprovides HTML CSS JavaScript etc when a client requestsa web page using the web server While the Apache and IISweb servers are primarily used in the Windows environ-ment the LiteSpeed web server is primarily used in the Linuxenvironment and the Enterprise web server is primarily usedin the UNIX environment +erefore the web server is

selectively dependent on the OS environment As with thefont feature described before since the web server featurecould not be found in the collected website defacement casesit was not suitable as a case vector for cybercrime in-vestigation Finally although the case vector concerningthanks-to and notifier can be used to analyse a hiddennetwork between the hackers and hacker groups the analysisof a network among hackers and hacking groups throughthem should be addressed in future research

As a result we defined the case vector by dividing intotwo types ie a version for the similarity measure and aversion for the clustering processing As the features of thecase vector the encoding IP address domain (ie servicename gTLD and ccTLD) attack date and OS were used inthe similarity measure However the encoding gTLDccTLD and OS were used in the clustering processing +eencoding is a case vector that provides decisive clues relatedto the attackerrsquos region information In the case of the IP

Case vector designPreprocessing

Clustering module

Cases-centric DB

Reasoning engine

It matches a new attack case with a former

attack case depending on

the defined case vector

It measuresthe similarity

score depending onthe weights and values

It calculates weights and

values

Similarity module

Data parsing

Data cleaning

Feature selection

Feature normalization

It performs the clustering processing through the EM

algorithm

It derives several clusterswhich exhibit similar patterns

Crawler

It gets the metadata and HTML source of

a website defacement case through

the mirror pageArchive page in the zone-horg site

Figure 3 Proposed analytical framework for the data-driven website defacement cases

Figure 4 Sorted dataset through the preprocessing

6 Security and Communication Networks

address and domain it gives clues related to the victimrsquoslocation and position Furthermore the attack date givesclues to the relation between the attacker and the victim+edetailed explanation of key features is provided in Table 1

+e normalization result of various feature elementsstored in the raw form of the HTML source is presented inFigure 5 In the case of encoding ISO series and MSWindows series are applied by normalizing depending onthe encoding used in each region or country In the case ofgTLD it was applied by normalizing depending on thegroups or organizations with similar characteristics In thecase of ccTLD it was applied by normalizing depending oneach continent Although the compression and normaliza-tion of features enable making the analysis such as clus-tering processing and similarity measure simple and clearon the contrary it may also bring about the loss of in-formation in the original data or make it more difficult toanalyse in detail

33 Reasoning Engine In the reasoning process the rea-soning engine first performs a similarity search based onCBR Discrete similarity scores are defined to calculate thedistance of nominal data (eg IP address and domain)Algorithm 1 shows how the similarity module operates bycomparing a retrieved website defacement case and all casesin the cases-centric DB on a case-by-case basis Sub-sequently the reasoning engine evaluates the similarity score

between the given new attack case vector and vectors ofother attack cases Next the reasoning engine performsclustering to group-abstracted crime cases into classes ofsimilar crime cases In crime investigation a cluster groupedas similar crime case subsets helps to infer crime patternsand speeds up the process of solving a crime due to a betterunderstanding of a complicated relationship or in terms of atimely response In the present study we implemented thereasoning engine consisting of two processing entities thesimilarity measure processing and the clustering algorithmprocessing (see below for further details)

331 Similarity Measure As the similarity measure based onthe CBR algorithm we proposed the similarity algorithmoperated by comparing a retrieved website defacement caseand all cases in the cases-centric DB To begin with if one ofthe retrieved cases (RC a new case) is given and there are ldquonrdquocases in the cases-centric DB (TCs all cases in the cases-centric DB) a comparison between RC and TCs are con-ducted as ldquonrdquo times We defined the extent of similaritybetween RC and TCs as a numeral value from ldquo0rdquo to ldquo1rdquowhere ldquo0rdquo means that RC and TC are unrelated and ldquo1rdquomeansthat RC and TC are identical Similarity score (0lt Slt 1)specifies the extent of similarity between RC and TC If thesimilarity score is much closer to ldquo1rdquo RC and TC are moreanalogous to each other In the event of multiple case vectorssimilarity can be expressed as a weighted sum of case vectors

Table 1 Case vector design highlighting two groups of features

Case vectorUsed in process

DescriptionS C

Encoding O O

It is used to represent the different types of languageinformation on the computer It determines the

usable characters and the methods to express them+e feature was normalized based on MS Windows

and the ISO character set

IP address O NA A unique number that allows devices on the networkto identify and communicate with each other

Domain

Service name O NA+e service name is individually made with a differentname depending on the service categories such as

gTLD or ccTLD

gTLD O O+e gTLD feature was normalized depending on theelement having the same meaning (eg go gob and

gobr feature were normalized into gov)

ccTLD O O

+e ccTLD is a unique code assigned to the domainname that represents the country specific region or

an international organization+e ccTLD normalized by the continent is used in theclustering process and the original ccTLD is used in

the similarity process

Date O NA +e attack date performed by the hacker or thehacking group

OS O OA part of a computer system that manages all

hardware and software (eg Windows Linux andUNIX)

S similarity measure C clustering processing

Security and Communication Networks 7

Similarity score 1113944

cv

i1distance RCcvTCcv( 1113857 times weightcv1113858 1113859

cv case vector(ie encoding IP address domain date andOS)

(1)

+ere are various approaches to set the weight of the casevector such as the heuristic method logistic regression anal-ysis and attribute weighting methods Furthermore theseweight values need to be periodically updated to be applied tothe study of recent attack trends However for the initialsetting it is difficult to set the exact numerical value for eachweight values in accordance with the case vector In our ex-periment we set the impact and the weight of the case vector ashighmedium and low according to their importance so that to

concretely categorize the attacker and the victim Above allsince encoding makes it possible to infer the static locatedinformation of the attacker we defined encoding as high-quality information IP address and domain were defined asmedium-quality information +ese case vectors enable theidentification and specification of the victim Finally the tar-geted date and OS were defined as low-quality information Tomeasure clustering and similarity all values of the case vectormixed as numbers and letters were normalized to have a valuefrom 0 to 1 Obviously since these values can be subjective inorder to prevent this subjective bias these values should beacquired and thoroughly reviewed by several experts +istechnique can be easily applied using expert knowledge ofinvestigation experts and is easy to understand from re-searchersrsquo viewpoint +e quantitative method for setting and

Arabic

Baltic

CentralEurope

Chinese

Cyrillic

Greek

Hebrew

Japanese

Korean

SouthernEurope

Taiwanese

Thailand

Turkish

Africa

Australia

CentralAsia

EastAsia

EasternEurope

NorthAmerica

NorthernEurope

SouthAmerica

SouthAsia

SoutheastAsia

SouthernEurope

WestAsia

WesternEurope

Linux-basedOS

MacOS

Unix-basedOS

Windows-basedOS

bull ISO-8859-6bull Windows-1256

bull ISO-2022-KRbull EUC-KR

bull GB2312bull GB18030 bull GBK

bull ISO-2022-JPbull EUC-JPbull ShiftJIS

bull ISO-8859-2bull Windows-1250

bull ISO-8859-13bull Windows-1257

bull ISO-8859-8bull Windows-1255

bull ISO-8859-7bull Windows-1253

bullbull

ISO-8859-5Windows-1251

bull Windows seriesbull Windows server series

bull Unixbull AIX bull Compaq Tru64 etc

bull MacOSbull MacOSX

bull Linux bull FreeBSD bull Avtech etc

bull combull cobull int

bull info

bull org bull or

bull coop

bull govbull gobull gob

bull edubull ac

bull net

bull mil

bull biz

bull fr ie be gl lube dk ad imnl uk je gg etc

bull br sr ar cl do ec fk gf py sr uy ve etc

bull sa ae kw bh az in ir jo kw lb om qa ye etc

bull no dk lv ltse ax fi glis no

bull us bz lc ai bmgd hn ky mx ni pa sv tt vi etc

bull gr mksm ad va ba es it ptrs hr si li bg etc

bull la bu vn kh th

bull in np bt pk lk id mn mo my np ph tl etc

bull kz uz tm tj kg af am tr

bull au pg nz ccck fj gu kinu sb vu wf etc

bull gn jm ke aobw cf ls mztz ug yt zw etc

bull ru by al lv ua pl sk hu ee md ro mk etc

bull kr cn jp twhk kp sg

Encoding gTLD ccTLD OS

com

edu

gov

org

biz

mil

net

coop

info

bull Windows-1253

bull ISO-8859-11bull Windows-874

bull Big5 bull EUC-TW bull Eten

bull ISO-8859-9bull Windows-1254bull IBM857

WestEurope

bull ISO-8859-1bull Windows-1252

Normalization

Figure 5 Normalization of each feature elements

8 Security and Communication Networks

Input TCs(Tested_DB)lowast +e Tested_DB indicates the cases-centric DB lowastRC (Retrieved_Case)⟵ Encodi ngRC IPRC DomainRC DateRC OSRClowast RC means one of the retrieved cases lowastW (Weight)⟵ Encodi ngW IPW DomainW DateW OSW

Output Similarity_score(1) TCEncodi ngTC IPTC DomainTC DateTC OSTC⟵TCs(2) While RC in TCs do(3) if Encodi ngRC Encodi ngTC then(4) Encoding_similarity_value⟵ 10(5) else(6) Encoding_similarity_value⟵ 00(7) end(8) IPRC Octet ARC Octet BRC Octet CRC Octet DRC IPTC Octet ATC Octet BTC Octet CTC Octet DTC(9) if (Octet ARC Octet ATC) (Octet BRC Octet BTC) (Octet CRC Octet CTC) (Octet DRC Octet DTC) then(10) IP_similarity_value⟵ 10(11) else if (Octet ARC Octet ATC) (Octet BRC Octet BTC) (Octet CRC Octet CTC) then(12) IP_similarity_value⟵ 075(13) else if (Octet ARC Octet ATC) (Octet BRC Octet BTC) then(14) IP_similarity_value⟵ 05(15) else if (Octet ARC Octet ATC) then(16) IP_similarity_value⟵ 025(17) else(18) IP_similarity_value⟵ 00(19) end(20) DomainRC ServiceNameRC gTLDRC ccTLDRC DomainTC ServiceNameTC gTLDTC ccTLDTC(21) if an identical domain then(22) Domain_similarity_value⟵ 10(23) else if (ServiceNameRC ServiceNameTC) (gTLDRC gTLDTC) (ccTLDRC ccTLDTC) then(24) Domain similarity_value⟵ 08(25) else if (gTLDRC gTLDTC) (ccTLDRC ccTLDTC) then(26) Domain_similarity_value⟵ 03(27) else if (ServiceNameRC ServiceNameTC) then(28) Domain_similarity_value⟵ 01(29) else if (ccTLDRC ccTLDTC) then(30) Domain_similarity_value⟵ 01(31) else if (gTLDRC gTLDTC) then(32) Domain_similarity_value⟵ 01(33) else(34) Domain_similarity_value⟵ 00(35) end(36) Date_variance⟵ |Da teRC―Da teTC|lowast It converts a date format year month and day (ie yyyy-mm-dd) into a day

calculated with numeric lowast(37) if 0leDate_variancele 365 then(38) Date_similarity_value⟵ 10(39) else if 365ltDate_variancele 1095 then(40) Date_similarity_value⟵ 075(41) else if 1095ltDate_variancele 1825 then(42) Date_similarity_value⟵ 05(43) else if 1825ltDate_variancele 2555 then(44) Date_similarity_value⟵ 025(45) else if 2555ltDate_variance then(46) Date_similarity_value⟵ 00(47) end(48) if OSRC OSTC then(49) OS_similarity_value⟵ 10(50) else(51) OS_similarity_value⟵ 00(52) end(53) Similarity_score⟵ (Encoding_similarity_valuetimes EncodingW) +

(IP_similarity_valuetimes IPW) + (Domain_similarity_valuetimes DomainW) +(Date_similarity_valuetimes DateW) + (OS_similarity_valuetimes OSW)

(54) return Similarity score between RC and TC(55) end while

ALGORITHM 1 Similarity measure module

Security and Communication Networks 9

updating the weight value is an issue worth addressing infurther research In the present study we set the weight valuesfor the case vector including the encoding IP address domainattack date and OS (see Table 2)

Some case vectorsrsquo distance cannot be directly estimatedas they have mixed numerical and nominal data (such as IPaddress range and domain name) For this reason to cal-culate the distance between the nominal data we defined thediscrete similarity measure +e similarity of IP addresseswas calculated by measuring the similarity among the sameoctet of two given IP addresses +e IP address space iscomposed of a number combination of four octets separatedby ldquordquo In the present study we compared if octets from the1st octet to the 4th octet of RC and TC were identicalSubsequently a similarity value was assigned to the IPaddress vector We suggested the discrete similarity valuebetween two IP addresses as visible in Table 2 +e proposedapproach is advantageous in that it enables the distancecalculation between the IP addresses efficiently

(i) IP address of RC zzz yyy xxx www

(ii) IP address of TC zzz yyy xxx www

Meanwhile the similarity between domains is calculatedaccording to their domain properties +e domain iscomposed of the gTLD ccTLD and service name+e gTLDrefers to a generic top-level domain in the domain rule Forinstance com and co are used for commercial companies ororganizations org and or are used for nonprofit organi-zations go and gov are used for government and stateagencies Besides ccTLD refers to a country code top-leveldomain in the domain rule and means a unique sign thatrepresents a specific region such as kr cn br and uk DNSmakes change in the IP address into a unique Domain Namewhich is easy to remember because it consists of a combi-nation of an alphabet letter and a number Among theDomain Name the service name is built corresponding withthe characteristics of the groups organizations or corpo-rations that the gTLD is intending and pursuing +e servicename has diverse and different names depending on thecategories of the gTLD such as educational institutionscommercial enterprises military organizations nonprofitorganizations and government and state agencies Unlikeother case vectors we set the rule for estimating the simi-larity of the domain as depicted in Table 2

Furthermore we defined the attack date similarity Similarto the offline criminal investigation case if the time of a crimeoccurrence is near we can analyse the cases as a similar crimewith a cross-analysis of the target area and the criminalsrsquopatterns +e similarity value depends on the period differencebetween a new case and existing cases As visible in Table 2 thesimilarity value is described according to the date gap of twocases that occurred on different dates In summary accordingto the similarity degree of a variation range of a section thesimilarity values of the attack IP address domain and attackdate were set to the similarity value between 0 and 1

332 Clustering Processing Merely sorting the data andvisually analysing them render it difficult for an investigator to

infer the correlations and similarity among the potentialfeatures of incidents Hence an advanced tool that wouldcapture the complex underlying structures and data prop-erties is required Accordingly in the present study weconducted the clustering process using the EM algorithmbased on the probability of the individual data attributes +isalgorithm does not restrict the number of clusters in theparameters but automatically generates a number of validclusters by cross-validation +ereafter the algorithm de-termines the probability that some data items existed in thecluster bymaximizing the correlation and dependence amongthe objectsWe applied practically the EM algorithm to 80948data items having the information of encoding gTLD ccTLDand OS from 212093 data for clustering +e characterencoding was normalized by a group of congenial cover codeunits (ISO-8859 MS Windows character set GB and EUCseries) We excluded the Unicode because it is too generalwhich accounts for themajority of the collected encoding datafor clustering In the case of the service name even if we canfind out similar combinations of alphabet letters or numbersit is not easy to find commonality or relevance between them+erefore it is not suitable for being used as the similaritymeasure of the reasoning engine Consequently character-istics and metadata concerning the 12 clusters were obtained(see Table 3) +ese clustering results are also visualized andstored in the database (see Figure 6)

+e donut charts include the different features fromoutside to inside (in order) with the corresponding share ofeach feature value separated by a different colour codewithin this same circle Each cluster consists of four circlesand the circle represents from the outside to the inside theencoding gTLD ccTLD and OS +e percentage in Table 3represents howmany cases one cluster contains among all ofwebsite defacement cases collected from the zone-horg site+e representative hacker represents a notable hacker orhacking group among the members of them in each clusterAs described in Figure 6 clusters of similar patterns werefound in the clusters +e most conspicuously similarclusters were 4 and 7 which had the feature of using Arabicand Chinese a feature of the attack against an industrialorganization whose headquarters are located in WesternEurope +e cases in Clusters 4 and 7 accounted for 4129percent among all of website defacement cases collectedfrom the zone-horg site+e results of the clustering processcontribute to the concretization of the similarity between thenew and existing cases A large number of new cases haveflowed in the database and then if the clustering process isperformed with the dataset a clustering result may take on adifferent pattern of course

4 Application

41 Experimental Results and Analysis Considering that theassumption that the attackers tend to use similar or uniqueattack methods is not always valid and it is difficult toevaluate the accuracy of the similarity mechanism As timeprogresses attackersrsquo hacking skills advance and in additionthe attack plan campaign purpose and target groups canchange depending on the situation +erefore in the present

10 Security and Communication Networks

Table 2 Value and the weight for the similarity score by the case vector All of the values of the similarity score are normalized to 0 or 1

Case vector Weight Impact +e similarity measure between a new case andexisting cases Value

Encoding 05 High mdash 0 or 1

IP address 02 Medium

If the same (eg 14324816 and 14324816) 1If the 1st 2nd and 3rd octet are matched (eg

14324816 and 14324818) 075

If the 1st and 2nd octet are matched (eg 14324816and 14324844) 05

Only the 1st octet is matched (eg 14324816 and1431324) 025

No common octet (eg 14324816 and 1631325) 0

Domain 015 Medium

An identical domain 1Service name is matched and one of the gTLD and

ccTLD is matched 08

gTLD and ccTLD is matched 03Service name is matched 01

ccTLD is matched 01gTLD is matched 01

Nonidentical domain 0

Date 01 Low

Period of about 6 months back and forth (1 year) 1Period of about 18 months back and forth (3 years) 075Period of about 30 months back and forth (5 years) 05Period of about 42 months back and forth (7 years) 025Over period of about 42 months (over 7 years) 0

OS 005 Low mdash 0 or 1

Table 3 Characteristics and metadata of several different clusters derived from the clustering processing

Cluster number Ratio () Description Representative hacker (group)

0 784+e group uses Central European languages +eyprincipally attacked against the profit organization

and Linux-based OS in Western EuropeJaMaYcKa Super2li

1 816

+e group uses Arabic and Cyrillic +ey principallyattacked against the organization that manages thenetwork and Linux-based and Unix-based OS +eirattack region is distributed throughout SouthernEurope South America Eastern Europe and

Southeast Asia

BI0S

2 1036

+e group uses Central European languages +eyprincipally attacked against the organization that

manages the network and nonprofit organizations inWestern Europe

JaMaYcKa

3 933+e group uses Central European languages +eyprincipally attacked against the profit organization

and Windows-based OS in Western Europe1923Turk

4 2536+e group uses Arabic and Chinese +ey principally

attacked against the profit organization andWindows-based OS in Western Europe

EL_MuHaMMeD federal-atackorg

5 173

+e group uses Central European languages +eyprincipally attacked against the profit organizationand Unix-based OS in Southern Europe and Eastern

Europe

d3bsimX SuSKuN

6 524

+e group uses Central European languages +eyprincipally attacked against the profit organizationthe educational institution the government and stateagencies and also Windows-based OS in East Asia

1923Turk

Security and Communication Networks 11

study rather than evaluating the accuracy of the similaritymechanism we tested the overall performance of the pro-posed methodology with the ratio of correctly identified

hackers +e developed testing procedures unfolded in thefollowing four steps and are depicted in detail in Figure 7where ldquoKrdquo presents all hackers within the database

Table 3 Continued

Cluster number Ratio () Description Representative hacker (group)

7 1593+e group uses Arabic Chinese and Turkish +eyprincipally attacked against the profit organization

and Linux-based OS in Western EuropeRya iskorpitx

8 911+e group uses Central European languages +eyprincipally attacked against the profit organization

and Windows-based OS in Western Europe1923Turk

9 363

+e group uses Central European languages +eyprincipally attacked against the profit organizationand Linux-based OS in South America and Eastern

Europe

Hmei7

10 139

+e group uses Central European languages +eyprincipally attacked against Windows-based OS inSouth America and Southeast Asia+eir attack target

is mostly the educational institution and thegovernment and state agencies

BHS F4keLive

11 192

+e group uses Arabic and Central Europeanlanguages+ey principally attacked against the profitorganization and Windows-based OS in Southern

Europe

EL_MuHaMMeD linuXploit_cre

Clustering 00

25

25

50

75

0100

25

50

75

0100

25

50

75

0100

25

50

75

0100

25

50

75

0100

25

50

75

0100

25

50

75

0100

25

50

75

0100

25

50

75

0100

25

50

75

0100

25

50

75

0100

50

75

0100

Clustering 01 Clustering 02 Clustering 03

Clustering 04 Clustering 05 Clustering 06 Clustering 07

Clustering 08 Clustering 09 Clustering 10 Clustering 11

West EuropeTurkishCentral EuropeArabicCyrillicChinese

com

net

org

gov

edu

mil

Western EuropeEast AsiaSouthern EuropeSouth AmericaEastern EuropeSoutheast Asia

WindowLinuxUnixMacOS

Encoding

gTLD

ccTLD

OS

Figure 6 Visualization of the 12 different clusters (00 through 11) in our data annotated with various features encoding gTLD ccTLD andOS and their corresponding share (legend on the right side)

12 Security and Communication Networks

Rk Count Casesmk( )Count Casesallk( )

(3)

where ldquomrdquo means the past cases which are within the denedscope concerning a randomly selected hacker ldquokrdquo

(i) Step 1 selection the measurement objects ie 100hackers were randomly selected from the database

(ii) Step 2 case labelling we retrieved all previous attackcases conducted by the randomly selected 100hackers in Step 1 and then subsequently labelled allprevious attack cases by each hacker name

(iii) Step 3 case extraction we selected the most recentcase among the cases labelled in Step 2 as an inputvalue shye similarity score was then estimated bycomparing themost recent case (ie RCmdashone of theretrieved cases) with all other cases in the database(ie TCsmdashall cases in the cases-centric DB)

(iv) Step 4 scoring similarity score was sorteddepending on the value and the weight for thesimilarity score by the case vector (see Table 2) inthe descending orderWhenever the similarity valuewas 0 it was not displayed on the scoring list of Step4 shye feasibility of the proposed methodology wasevaluated based on how many past cases of a hackerthere were in the N scope at the scoring list of Step 4that is regarding the ratio of the attack cases by eachhacker we checked whether the cases were includedat the top N scope (N scope from the top 1 percentto the top 30 percent)

NScope Count CasesScopeK( )Count CasesallK( )

times 100 (2)

First we randomly picked 100 hackers from the col-lected dataset (ie cases-centric DB) thereafter we re-trieved and extracted all past attack cases for each hackershye extracted past cases were labelled with the hackerrsquosname Figure 8 depicts the number of website defacementattack cases in the past for each hacker In Steps 3 and 4similarity between a retrieved case (ie the most recentcase) and all other stored website defacement cases weremeasured

Specically we checked whether the result (ie thesorted hackerrsquos past cases with a high similarity score)stemming from the similarity measurement was included atthe top N scope shyis process was meant to check based onthe similarity score how many past attack cases of randomlypicked 100 hackers were included in the dened topN scopeTo this end we divided the top N scope into eight criterionfactors from the top 1 percent to the top 30 percent and theratio R all the past attack cases for each hacker into sixcriterion factors from 50 percent to 100 percent (ie at 10percent intervals) As illustrated in equations (2) and (3) theN scope and the ratio R were categorized as ratios accordingto the dened measure rule More specically the criterionof the top N scope ie ldquotop N percentrdquo was based on theresult derived from the similarity measurement Attack caseswere sorted in order of high similarity score and thereforethe cases were within the range of topN scope (see Figure 9)Also in the case of the hacking case ratio of a randomly

Step 4 scoring

bullbullbull

Randomly selected100 hackers

from the database

Step 1 selection

bullbullbull

1 eBuGz2 Hmei73 3xp1r3

98 S3cure99 drm1st3r

100 Lulz53c

bullbullbull

Step 2 case labelling1 TheBuGz

100 Lulz53c

Step 3 case extraction

A retrieved case(the most recent case)

bullbullbull

1 TheBuGz

100 Lulz53c

bullbullbull

Cases-centricDB

Hackername Date Encoding IP address Domain OS Score

Hackername Date Encoding IP address Domain OS Score

bullbullbull

1 eBuGz2 Hmei73 3xp1r3

98 S3cure99 drm1st3r

100 Lulz53c

Case1 bullbullbull Casem

Case1 bullbullbull Casemprime

Case1 bullbullbull Casem

Case1 bullbullbull Casemprimei=1

cv[Distance (RCcv TCcv) times Weightcv]

Casemprime

Casemprime

Casem

Casem

Figure 7 shye developed testing procedures from step 1 to step 4

Security and Communication Networks 13

selected hacker some parts of the past attack cases (ie ratioR) concerning a hacker were within the defined N scope (seeFigure 9)

Figure 10 shows the number of an identified hacker fora retrieved case (ie the most recent case) among allhacking cases of each hacker +e X-axis in Figure 10 showsthe criterion of the topN scope including the eight criterionfactors () and of the ratio R including the six criterionfactors () +e Y-axis presents the number of an identifiedhacker in the top N scope among the randomly selected 100hackers in Step 1 As can be seen in Figure 10 the higher theratio R and the narrower the N scope the lower the numberof an identified hacker in the top N scope among therandomly selected 100 hackers On the other hand thelower the ratio R and the wider the N scope the higher thenumber of identified hackers in the top N scope among therandomly selected 100 hackers Consequently even ifhacking cases were caused by the same hacker as the hackeror hacking group which only attacked the same or similarobjects were rare it is impossible to draw results with a highsimilarity score for all cases of a hacker Nevertheless theresults demonstrated that the proposed CBR-based de-cision support methodology can successfully reduce thenumber of hackers and their cases and suggest potential topN percent candidates among hundreds of thousands ofcases

+erefore an investigator should consider the avail-ability and flexibility of data with respect to the data selectioncriteria for the similarity measurement As mentionedabove when a new attack occurs they can limit the searchrange of the data and determine the direction of the criminalinvestigation With such the reduction in the number ofcandidate-related cases the outcomes of our similaritymechanism are highly valuable in terms of reducing theinvestigation time to determine the potential suspect of agiven hacking incident

42 Case Study As mentioned above the accuracy of theCBR depends on the quality of the collected data and theoverall accuracy is difficult to evaluate Nevertheless al-though the data are insufficient to evaluate the proposedmethodology the DS and SPE cases include the ground-truth data with specific information related to the hacker orhacking groups Based on the public ground-truth data ofthe DS and SPE cases we found the most similar top threehackers or hacking groups to them and noticed theircharacteristic by the proposed similarity measure and theclustering processing

+e hackers of the DS cyberattack defaced the groupwarehomepage of LG U+ the 3rd largest telecommunicationcompany in South Korea and the English version of the

Step 4 scoring

1 TheBuGzTop N scope(1~30)

Ratio R (50~100)

Hackername Date Encoding IP address Domain OS Score

Case1 bullbullbull Casem

Casem

Figure 9 Scoring step on the top N scope and the ratio R

0

1000

2000

3000

4000

5000

0 25 50 75 100Hacker

Num

ber o

f cas

es

Figure 8 +e number of website defacement attack cases in the past of each hacker

14 Security and Communication Networks

Korean Broadcasting System (KBS) homepage +ey leftunique images and many messages on the defaced websites+e three Calaveras image (ie skull image) used in the LGU+rsquos defaced website appeared on many European websites+e character encoding set of the message was the WesternEuropean language system Based on these insights we couldinfer that the hackersrsquo background is European ldquoHASTATIrdquowas the word written on the KBS homepage meaning theforefront line of the Roman troops hinting that the DScyberattack could be a starting point rather than a transientattack it was a persistent one Even if we excluded otherimages and messages as well as other features from thesimilarity processes due to the unanticipated loss or absenceof data one could establish the similarity and intent of theattackers with reasonable confidence However given thesufficiently large hacker profiling source such abundant datacould support and enhance the accuracy of inference Fig-ure 11 shows the screenshots of the defaced websites at thattime

In the SPE case similarly to the DS case some imagesand messages were left on the computers of SPE Regardingcolour skulls image and misspellings the imagesFigure 11(c) used in the SPE cases took on the characteristicssimilar to those of the images Figure 11(b) used in the DScases As shown in Figure 11 the colour schemes in greenand red and the visual similarities seen in skull image areother crucial elements for crime tracing In both the DS andSPE cases the phrase such as ldquothis is the beginningrdquo andldquoyour datardquo were commonly found in the messages How-ever given the intentional hacking nature of forging orhiding their identity motivation and location some experts

say that these characteristics are not the conclusive proofthat Sony has been attacked by the same hacker [49ndash51]

For the evaluation of the results of the case study we firstmeasured the similarity between the new website de-facement cases (ie the DS and SPE cases) and the collectedexisting cases in the database +is approach coheres withthe CBR process used in cybercrime investigation (seeFigure 2) Two new website defacement cases the DS and theSPE were applied as RC and the similarity score for each ofthese two cases was computed using the similarity measure(see equation (1)) proposed in Section 331 Provided thatbecause the DS and SPE cases do the function of the targetcases as an input value we considered a direct comparisonbetween the DS and SPE cases for the similarity score wasnot appropriate [52]

+e similarity measure mentioned in the previousparagraph is based on the metadata released by an analysisreport of the DS and SPE real cases We summarized furtherthe characteristics and metadata associated with them inTable 4 +e similarity score was derived through com-parison between the presented metadata of the DS and SPEcases and all cases in the cases-centric DB We gave the mostsimilar top three cases among the result of the similarityscore (see the right side in Table) Notifier Hmei7 and d3b_Xare among the cases that belonged to Clusters 0 and 8 whichwere the two clusters that exhibited identical characteristicsIt can thus be understood that they used the encoding systempertinent to Central European languages based on the Latinlanguage system and typically launched attacks against aprofit organization located in Western Europe Notifieroaddah MTRiX and EL_MuHaMMeD were all classified

Top 1 Top 3 Top 5 Top 10 Top 15 Top 20 Top 25 Top 30Criterion of the top N ()

Num

ber o

f ide

ntifi

ed h

acke

rs

0

20

40

60

80

100

Ratio of the attack cases ()506070

8090100

Figure 10 +e number of identified hackers in the top N scope among the randomly selected 100 hackers

Security and Communication Networks 15

as the same cluster (Cluster 7) where the hackers of Cluster 7used the encoding system pertinent to Arabic and Chineselanguages and typically attacked against the profit organi-zation located in Western Europe

Next to ensure the objectivity of the similarity scorebased on the case study by the DS and SPE we computed thesimilarity score of any randomly selected pair from thewhole case Figure 12(a) shows the distribution of thesimilarity score of the randomly selected cases We took thedistribution of the similarity score using the central limittheorem which describes the average distribution of ran-dom samples extracted from a finite population +e dis-tribution shows that the calculation of the similarity score ofthe randomly selected two website defacement cases wasrepeatedly performed for 10000 times +e similarity scoresof any randomly selected pair of cases were typically dis-tributed around 03 +is result (Figure 12(a)) substantiatesthat the similarity scores are not low even if the similarityscores of the DS and SPE cases (Figure 12(b)) do not appear

numerically high Figure 12(b) shows the similarity scores ofthe DS and SPE cases+e top score of the similarity was 069in the DS case and all measured cases concentrated aroundthe similarity score (X-axis) of 00 to 015 and of 05 to 06 Inthe SPE case the top score of the similarity was 0615 and allmeasured cases concentrated around the similarity score (X-axis) of 00 to 02

Figure 13 shows the distribution of the similarity scorefor randomly selected 100 hackers mentioned in Section 41To know the mean value of the similarity score for eachhacker case we calculated the similarity score from thehackerrsquos own past cases Cases used for the similarity scoremeans not all cases in the cases-centric DB but just the pastcases conducted by the hacker in the cases-centric DB +emean value of the similarity scores in the hackers is 05233+e similarity scores of the tested cases in Table 4 is abovethe mean value +us the similarity scores for each hackeradequately underpin the similarity scores from the TCs inDS and SPE

(a) (b) (c)

Figure 11 A snippet of website defacement cases by a comparison of examples of the DS and SPE the defaced LGU+ groupware homepage(a) and KBS homepage (b) in the DS case and the defaced website in SPE case (c)

Table 4 Further characteristics and metadata associated with the DS and SPE cases

Retrieved case Tested cases

Case name NotifierDarkSeoul (DS) Hmei7 d3b_X StifLer

Encoding Windows-1252 Windows-1252 Windows-1252 ISO-8859-9IP address 203248195178 2038623868 2031243766 77921083Domain gyunggionnet21com httpwwwgarychengcom healthajkgovpk yapikimyasallaricomtrDate 20 Mar 2013 6 Feb 2014 4 Feb 2014 8 Jun 2013OS Windows Windows Windows WindowsSimilarity mdash 0690 0675 0665Cluster mdash 0 8 4

Retrieved case Tested casesCase name Notifier

Sony pictures Entertainment (SPE) Oaddah MTRiX EL_MuHaMMeDEncoding EUC-KR EUC-CN GB2312 GB2312 GB2312IP address 203131222102 2031241555 20829198 2081164534Domain httpwwwsonypicturesstockfootagecom httpwwwhzkcggcom daxdigitalromcom digitalairstripnetDate 24 Nov 2014 14 Jun 2012 16 Dec 2002 18 June 2009OS Windows Windows Windows WindowsSimilarity mdash 0615 0615 0600Cluster mdash 7 7 7+e metadata are arranged according to the defined case vector corresponding with the DS and SPE cases on the left side (shown in part in boldface type)

16 Security and Communication Networks

43 Follow-Up Investigation A case study is a researchmethod involving an in-depth and detailed investigationof a subject of study as well as its related contextualmethodology Hence we conducted follow-up in-vestigations of the most similar top three hackers asmentioned above in Table 4 According to the resultsspecifically over 93 percent of the hackerrsquos attacks weresimilar to the DS case that occurred in 2013 and 2014+eir major targets were com domain sites and theytargeted primarily Germany Italy New Zealand RussiaTurkey Taiwan and South Korea (see Table 5) Twohackers (ie Hmei7 and d3b_X) primarily attackedgovernment agencies Interestingly 20 percent of theattacks by the hackers named d3b_X targeted SouthKorea In the SPE incident the similar hackerrsquos attacksoccurred throughout the period from 2002 to 2014 +ehackers named MTRiX and EL_MuHaMMeD in-tensively executed such attacks in 2003 and 2009 +eirmajor targets were com (or co) and org domain sitesand they targeted primarily Brazil Canada DenmarkFrance Greece Hong Kong and Italy (see Table 5) Twohackers (ie MTRiX and EL_MuHaMMeD) primarilyattacked commercial agencies and additionally attackedthe public and network agencies As shown in Figure 14 to

describe the follow-up investigation more discernibly andto focus on the attack flow we used an alluvial diagramwhich is a type of Sankey diagram developed to representchanges in a network structure over time [53] It shows theinvestigation of the top three hackers with website de-facement cases most similar to the DS case and SPE case+e case vectors were based on the attack year ccTLD andgTLD+e thickness of the attack flow in this figure meansthe degree of attack +is network visualization methodcould support an investigator to understand the flow andcore of the crime clearly by listing the multidimensionalevidence that is complicatedly entangled or hidden suchthat it does not look presentable

5 Limitations and Discussion

+e CBR algorithm has the disadvantage that the perfor-mance evaluation may be degraded if the property de-scribing the case is inappropriate +erefore in order toobtain more accurate results cross-data analysis with othervarious data sources should be considered For examplecybercrime statistics data from law enforcement agenciesthreat intelligence data from malware analysis groups andvulnerability databases could be useful resources to

100806040200Similarity score

600

400

200

0

Freq

uenc

y

Mean = 02930 Var = 00866

(a)

100806040200

Freq

uenc

y e highest similarity score 0615on Sony Pictures Entertainment case

Similarity score100806040200

Similarity score

40000

30000

20000

10000

0

Freq

uenc

y

40000

30000

20000

10000

0

Mean = 0114 Var = 01500

e highest similarity score 069on DarkSeoul case

Mean = 0063Var = 00370

A B

(b)

Figure 12 (a) Probability distribution of the similarity score for any pair of randomly selected cases (b) distribution of the similarity valuebetween the collected website defacement cases with the DS case (A) and the distribution of the similarity value between the collectedwebsite defacement cases with the SPE case (B) +e similarity was calculated between each studied case and all other cases in our system

0

2

4

6

000 025 050 075

Freq

uenc

y

Mean value of the similarity score000 025 050 075

Figure 13 Distribution of the similarity score for randomly selected 100 hackers

Security and Communication Networks 17

improve the accuracy and usability of our proposedmethodology However at the time of writing the presentpaper we did not have access to open and public dataconcerning cybercrime

For that reason we tried to demonstrate the practica-bility of the proposed methodology as a proof of concept+erefore we focused on the dataset of the zone-horg thatincludes a large number of website defacement cases Al-though the zone-horg provides an extensive dataset on thepast incident events not all incidents can be included in ourstudy +erefore if a hacker penetrated some target orga-nizations by APT attacks and performed stealthy activitiessuch hacking activities would not be reported in the datasetof the zone-horg and the proposed methodology would notbe able to detect similar cases with reasonable confidence

6 Conclusion and Future Work

In this study the similarity of website defacement caseswas assessed through the similarity measure and theclustering processing using the CBR as a methodology+e collected raw data of the defaced web sitesrsquo resourceswas sanitized via data parsing and data cleaning processAlso based on the large size of real dataset data-drivenanalysis for the hacker profiling is achieved To this endthe case vector was designed and the significant featureswere chosen for applying to the case-based reasoning Fora successful cybercrime investigation hacker profiling viaclustering analysis is the most basic and importantprocess in order to find out the relevant incident casesand significant data on some prime incidents data-driven

Table 5 Follow-up investigation on the top three hackers with website defacement cases most similar to the DS case and SPE case +e casevector value means the hackerrsquos attack rate

DomainDS case SPE case

Hmei7 d3b_X StifLer Oaddah MTRiX EL_MuHaMMeDCom 7832 8581 10000 10000 8627 8298Edu 162 096 mdash mdash 176 191Net 340 320 mdash mdash 546 574Gov 1216 651 mdash mdash 106 mdashYear Hmei7 d3b_X StifLer Oaddah MTRiX EL_MuHaMMeD2002 mdash mdash mdash mdash 1074 mdash2003 mdash mdash mdash mdash 8908 mdash2006 mdash mdash mdash mdash mdash mdash2007 009 mdash mdash mdash 018 mdash2008 mdash mdash mdash mdash mdash mdash2009 315 mdash mdash mdash mdash 99572010 009 mdash mdash mdash mdash mdash2011 034 mdash mdash mdash mdash mdash2012 340 mdash mdash 10000 mdash mdash2013 3486 3917 10000 mdash mdash mdash2014 5808 5977 mdash mdash mdash 0432015 mdash 107 mdash mdash mdash mdash

d3b~x

Hmei7

StifLer

2009

2012

2013

2014

AustraliaBrazilFrance

Germany

IndonesiaItaly

KoreaNetherlandsNewZealand

PolandRussia

Thailand

Turkey

Unknown

com

gov

net

org

Unknown

No

Yes

Hacker Year ccTLD gTLD Attack

(a)

EL_M

uHaM

MeD

MTRiX

oaddah

2002

2003

2009

2012

BrazilCanada

DenmarkFranceGreece

HongKongItaly

Unknown

com

net

org

Unknown

No

Yes

Hacker Year ccTLD gTLD Attack

(b)

Figure 14 Follow-up investigation on the top three hackers with website defacement cases that are most similar to the DS case (a) and SPEcase (b)

18 Security and Communication Networks

and evidence-driven decision making should be thecritical process Also reducing the amount of data andtime to be analysed are important factors to deliver thehigh value of intelligence data

Although the obtained results appear to be sound andmeaningful it is difficult to evaluate the accuracy of theresults unless the attacker is captured Naturally theground-truth data with specific information about theinvolved hacking groups for verification are rare (ie noadversary claimed that the two attacks were the result oftheir actions) However it is noteworthy that our meth-odology provides a meaningful insight into the confidentialand undercover network of cybercrime as well especiallywhen there is a lack of information Also the proposedmethodology contributes to facilitate the analysis and re-ducing the time required for searching for possible suspectsof cybercrime We believe that the proposed system ismeaningful for further exploration and correlation ofvarious website defacement cases

As mentioned in Discussion and Limitations a cross-data analysis with other various data sources should bereviewed Said differently the use of additional online oroffline information acquired by human intelligence(HUMINT) or different types of signal intelligence(SIGINT) and sources may also help to reason compo-sition requirements of crime and reduce the category ofinvestigation Furthermore the proposed methodologycan be expanded into incident information for compat-ibility and information exchangeability with othercyberthreat intelligence system as the Structured +reatInformation eXpression (STIX) and Trusted AutomatedeXchange of Indicator Information (TAXII) which arekey strategic elements of the information-sharingsystem [54]

+ere are features such as the particular messages (iethanks-to notifier nationality religion and anniversary)or image and mp3 file in the web resources which aregathered from the zone-horg site Although these featuresare limited to only a small number of hackers of the webresources in future research we will try to study a close-knit network among them such as the hub hacking groupkey player and followers Furthermore we also plan tomore definitely classify and systemize the hackersrsquo intentsusing text mining and mood detection techniques +efindings of this prospective study will contribute mean-ingful insights to trace hackersrsquo behavioural patterns and toestimate their primary purpose and intent

Data Availability

+e web-hacking dataset applied to our paper can bedownloaded from the linked site below httpocslabhksecuritynetDatasetsweb-hacking-profiling

Conflicts of Interest

+e authors declare that there are no conflicts of interestregarding the publication of this paper

Acknowledgments

+is work was supported under the framework of internationalcooperation program managed by the National ResearchFoundation of Korea (No 2017K1A3A1A17092614)

References

[1] S S Response ldquoSwift attackersrsquo malware linked to more fi-nancial attacksrdquo 2016 httpswwwsymanteccomconnectblogsswift-attackers-malware-linked-more-financial-attacks

[2] S S Response ldquoWannacry ransomware attacks show strong linksto lazarus grouprdquo 2017 httpswwwsymanteccomconnectblogswannacry-ransomware-attacks-show-strong-links-lazarus-group

[3] K lab ldquoLazarus under the hoodrdquo 2018 httpsmediakasperskycontenthubcomwp-contentuploadssites4320180307180244Lazarus_Under_+e_Hood_PDF_finalpdf

[4] Operation Blockbuster ldquoDestructive malware reportrdquo 2016httpswwwoperationblockbustercomwp-contentuploads201602Operation-Blockbuster-Destructive-Malware-Reportpdf

[5] D Martin and SANS Institute InfoSec Reading Room ldquoTracingthe lineage of DarkSeoulrdquo 2016 httpswwwsansorgreading-roomwhitepaperscriticaltracing-lineage-darkseoul-36787

[6] D S C T U T Intelligence ldquoWiper malware threatanalysisrdquo 2013 httpswwwsecureworkscomresearchwiper-malware-analysis-attacking-korean-financial-sector

[7] R Sherstobitoff M L Itai Liba and O O T C James WalterldquoDissecting operation troy cyberespionage in South Koreardquo2013 httpswwwmcafeecomenterpriseen-usassetswhite-paperswp-dissecting-operation-troypdf

[8] N Horton andA DeSimone ldquoSonyrsquos nightmare before christmasthe 2014 North Korean cyber attack on Sony and lessons for USgovernment actions in cyberspacerdquo 2018 httpswwwjhuapleduContentdocumentsSonyNightmareBeforeChristmaspdf

[9] I K Lee and S R Ramsey 9e Korean Language StateUniversity of New York Albany NY USA 2000

[10] V Benjamin and H Chen ldquoSecuring cyberspace identifyingkey actors in hacker communitiesrdquo in Proceedings of the 2012IEEE International Conference on Intelligence and SecurityInformatics pp 24ndash29 Arlington VA USA June 2012

[11] Y Lu X Luo M Polgar et al ldquoSocial network analysis of acriminal hacker communityrdquo Journal of Computer In-formation Systems vol 51 no 2 pp 31ndash41 2010

[12] J-W Jang H Kang J Woo A Mohaisen and H K KimldquoAndro-autopsy anti-malware system based on similaritymatching of malware and malware creator-centric in-formationrdquo Digital Investigation vol 14 pp 17ndash35 2015

[13] J W Jang and H K Kim ldquoFunction-orientedmobile malwareanalysis as first aidrdquo Mobile Information Systems vol 2016Article ID 6707524 11 pages 2016

[14] Y Ki E Kim and H K Kim ldquoA novel approach to detectmalware based on api call sequence analysisrdquo InternationalJournal of Distributed Sensor Networks vol 11 no 6 ArticleID 659101 2015

[15] M L Han H C Han A R Kang et al ldquoWeb-hacking datasetfor the cyber criminal profilingrdquo 2016 httpocslabhksecuritynetDatasetsweb-hacking-profiling

[16] M L Han H C Han A R Kang B I Kwak A Mohaisenand H K Kim ldquoWAHP web-hacking profiling using case-based reasoningrdquo in Proceedings of the 2016 IEEE Conference

Security and Communication Networks 19

on Communications and Network Security (CNS) pp 344-345Philadelphia PA USA October 2016

[17] A Aamodt and E Plaza ldquoCase-based reasoning foundationalissues methodological variations and system approachesrdquo AICommunications vol 7 no 1 pp 39ndash59 1994

[18] D M L Martins and F B D Lima Neto ldquoHybrid intelligentdecision support using a semiotic case-based reasoning andself-organizing mapsrdquo IEEE Transactions on Systems Manand Cybernetics Systems no 99 pp 1ndash8 2017

[19] H K Kim K H Im and S C Park ldquoDSS for computersecurity incident response applying CBR and collaborativeresponserdquo Expert Systems with Applications vol 37 no 1pp 852ndash870 2010

[20] J-B Lamy B Sekar G Guezennec J Bouaud andB Seroussi ldquoExplainable artificial intelligence for breastcancer a visual case-based reasoning approachrdquo ArtificialIntelligence in Medicine vol 94 pp 42ndash53 2019

[21] M Relich and P Pawlewski ldquoA case-based reasoning ap-proach to cost estimation of new product developmentrdquoNeurocomputing vol 272 pp 40ndash45 2018

[22] E R Reyes S Negny G C Robles et al ldquoImprovement ofonline adaptation knowledge acquisition and reuse in case-based reasoning application to process engineering designrdquoEngineering Applications of Artificial Intelligence vol 41pp 1ndash16 2015

[23] H K Kim S-K Kim and S-H Kim ldquoDecision supportsystem for zero-day attack responserdquo Applied Mathematicsand Information Sciences vol 6 no 1 pp 221Sndash241S 2012

[24] G Horsman C Laing and P Vickers ldquoA case-based rea-soning method for locating evidence during digital forensicdevice triagerdquo Decision Support Systems vol 61 pp 69ndash782014

[25] G Horsman C Laing and P Vickers ldquoA case based reasoningsystem for automated forensic examinationsrdquo in Proceedings ofthe PGNET 2011 the 12th Annual Postgraduate Symposium onthe Convergence of Telecommunications Networking andBroadcasting pp 26ndash31 Liverpool UK June 2011

[26] Z Yin Y Gao and B Chen ldquoOn development of supple-mentary criminal analysis system based on cbr and ontologyrdquoin Proceedings of the 2010 International Conference onComputer Application and System Modeling (ICCASM 2010)vol 14 Taiyuan China October 2010

[27] A J Pinizzotto and N J Finkel ldquoCriminal personality pro-filing an outcome and process studyrdquo Law and HumanBehavior vol 14 no 3 pp 215ndash233 1990

[28] P Chen and J Kurland ldquoTime place and modus operandi asimple apriori algorithm experiment for crime pattern de-tectionrdquo in Proceedings of the 2018 9th International Con-ference on Information Intelligence Systems and Applications(IISA) pp 1ndash3 Zakynthos Greece July 2018

[29] C J R Collie and K Shalev Greene ldquoExamining modusoperandi in stranger child abduction a comparison ofattempted and completed casesrdquo Journal of InvestigativePsychology and Offender Profiling vol 16 no 2 pp 91ndash1092019

[30] V Benjamin B Zhang J F Nunamaker Jr and H ChenldquoExamining hacker participation length in cybercriminalinternet-relay-chat communitiesrdquo Journal of ManagementInformation Systems vol 33 no 2 pp 482ndash510 2016

[31] V Benjamin and H Chen ldquoTime-to-event modeling forpredicting hacker IRC community participant trajectoryrdquo inProceedings of the 2014 IEEE Joint Intelligence and SecurityInformatics Conference pp 25ndash32 +e Hague +e Nether-lands September 2014

[32] K Veena and K Meena ldquoIdentification of cyber criminal byanalysing the users profilerdquo International Journal of NetworkSecurity vol 20 no 4 pp 738ndash745 2018

[33] F Iqbal B C M Fung M Debbabi R Batool andA Marrington ldquoWordnet-based criminal networks miningfor cybercrime investigationrdquo IEEE Access vol 7pp 22740ndash22755 2019

[34] N Qazi and B L W Wong ldquoAn interactive human centereddata science approach towards crime pattern analysisrdquo In-formation Processing ampManagement vol 56 no 6 p 1020662019

[35] N Jain P Sharma R Anchan et al ldquoComputerized forensicapproach using data mining techniquesrdquo in Proceedings of theACM Symposium on Women in Research 2016 pp 55ndash60ACM New York NY USA 2016

[36] P M Cozens G Saville and D Hillier ldquoCrime preventionthrough environmental design (cpted) a review and modernbibliographyrdquo Property Management vol 23 no 5pp 328ndash356 2005

[37] H Hassani X Huang E S Silva andM Ghodsi ldquoA review ofdata mining applications in crimerdquo Statistical Analysis andData Mining 9e ASA Data Science Journal vol 9 no 3pp 139ndash154 2016

[38] A Sharma and S Sharma ldquoAn intelligent analysis of webcrime data using data miningrdquo International Journal of En-gineering and Innovative Technology (IJEIT) vol 2 no 32012

[39] S-T Li S-C Kuo and F-C Tsai ldquoAn intelligent decision-support model using FSOM and rule extraction for crimepreventionrdquo Expert Systems with Applications vol 37 no 10pp 7108ndash7119 2010

[40] Y-H Tseng Z-P Ho K-S Yang and C-C Chen ldquoMiningterm networks from text collections for crime investigationrdquoExpert Systems with Applications vol 39 no 11 pp 10082ndash10090 2012

[41] A Malathi and S S Baboo ldquoAn enhanced algorithm topredict a future crime using data miningrdquo InternationalJournal of Computer Applications vol 21 no 1 2011

[42] S Kapetanakis A Filippoupolitis G Loukas et al ldquoProfilingcyber attackers using case-based reasoningrdquo in Proceedings ofthe 19th UK Workshop on Case-Based Reasoning (UKCBR2014) Cambridge UK December 2014

[43] R Al-Zaidy B C Fung A M Youssef et al ldquoMining criminalnetworks from unstructured text documentsrdquo Digital In-vestigation vol 8 no 3-4 pp 147ndash160 2012

[44] M Zulfadhilah Y Prayudi and I Riadi ldquoCyber profilingusing log analysis and k-means clusteringrdquo InternationalJournal of Advanced Computer Science and Applicationsvol 7 no 7 pp 430ndash435 2016

[45] S V Nath ldquoCrime pattern detection using data miningrdquo inProceedings of the 2006 IEEEWICACM International Con-ference on Web Intelligence and Intelligent Agent TechnologyWorkshops pp 41ndash44 Hong Kong China December 2006

[46] ITPnet ldquoSyria Egypt crises spur escalation of me cyber at-tacksrdquo 2013 httpwwwitpnet594742-syria-egypt-crises-spur-escalation-of-me-cyber-attack

[47] A McEnery and R Xiao ldquoCharacter encoding in corpusconstructionrdquo in Developing Linguistic Corpora A Guide toGood Practice Oxbow Books Ltd Oxford UK 2005

[48] B Bos T Ccedilelik I Hickson et al ldquoCascading style sheets level2 revision 1 (CSS 21) specificationrdquo W3C Working Draft2005 httpwwww3orgTRCSS21

20 Security and Communication Networks

[49] W Stuckey ldquoMassive sony breach sheds light on murkyhacker universerdquo 2018 httpamericaaljazeeracomarticles20141224sony-hacker-universehtml

[50] S Gallagher ldquoSony pictures malware tied to SeoulldquoShamoonrdquo cyber-attacksrdquo 2018 httpsarstechnicacominformation-technology201412sony-pictures-malware-tied-to-seoul-shamoon-cyber-attacks

[51] J Pagliery ldquoSony hack signs point to North Koreardquo 2018httpsmoneycnncom20141205technologysecuritysony-hack-north-korea-employeeindexhtml

[52] K Ketler ldquoCase-based reasoning an introductionrdquo ExpertSystems with Applications vol 6 no 1 pp 3ndash8 1993

[53] M Rosvall and C T Bergstrom ldquoMapping change in largenetworksrdquo PLoS One vol 5 no 1 Article ID e8694 2010

[54] OASIS ldquoSTIXTAXII standardsrdquo 2017-2018 httpsoasis-opengithubiocti-documentation

Security and Communication Networks 21

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 4: CBR-Based Decision Support Methodology for Cybercrime

operating system It enables an investigator to help quicklyand precisely decide where to search for evidence [24]

22 Traditional Criminal Profiling and Cybercrime ProfilingProfiling is used in various sectors of the society to in-vestigate a criminalrsquos mentality Criminal profiling is aprofiling technique for criminal investigation based on thepsychological and behavioural patterns of a criminal[25 26] +e criminal aspects and crime factors can beidentified through the evidences and insights of the psy-chological and behavioural bias [27] In the field of crimi-nology the widely used profiling technique is called theModus Operandi (MO) It is used to describe a suspectrsquosbehaviour and evidence elements in crime +at is it meanshow a suspect commits their crimes +e Modus Operandichanges based on the offenderrsquos criminal conduct and in-teraction with the surrounding such as time date and lo-cation of crime Moreover it evolves based on how theoffender reaches hisher victim [28 29]

Based on only the traditional criminal profiling tech-niques and empirical knowledge it is difficult for a cyber-crime investigator to reduce the error of the investigativeprocess and to untangle the complexity of a cybercrimeHowever if the investigator is provided with sufficient in-formation and detailed analysis data to understand theunclear motivation and the elusive pattern related to thecybercrime they can infer the reason(s) of the crime at stakeand produce both general and specific outlines of thecriminal [26] +e cybercrime network and characteristicscan be important indicators to differentiate between keyfigures in the cybercrime organizations and those of passinginterest In addition their activity periods and messagecontent patterns of the participants in an illegal communitycan support the investigator to carefully identify andscrutinize the key figure in the cybercrime network [30 31]By automating cybercrime profiling and data-miningmethods of analysis through a cross-analysis of variousbehavioural patterns we can anticipate potential criminalactivities and identify new profiles that pose serious threatsto the community Furthermore data-mining methods suchas entity extraction clusteringclassification technique andsocial network analysis make it possible to efficiently explorelarge data Network visualization enables an investigator tointuitively recognize the crime pattern [32ndash34]

In general the accuracy of CBR depends on the quality ofthe collected data and the overall accuracy is difficult toevaluate [35] Although the effectiveness of data-driveninvestigation can decrease owing to the dynamic and fast-evolving crime patterns understanding the hidden corre-lations and latent behaviour in such data using large dataanalytic techniques is another promising direction in re-search Accordingly many law enforcement agencies havebeen adopting future crime prediction systems based on thestatistics about weather cleanliness location demographicdistribution education level and wealth-level informationBased on the crime prevention through environmentaldesign (CPTED) theory [36] many pieces of data correlatedwith the crime are collected and analysed to estimate thecrime probability However while many data-driven ap-proaches to support traditional criminal profiling areavailable only several research efforts have focused oncybercrime profiling

23 Data-Driven Cybercrime Profiling In addition to thetraditional criminal profiling for offline crime investigationsvarious profiling techniques have been developed in thesetechniques it is assumed that cybercriminals also showsimilar behavioural and psychological characteristics Owingto the recent advances in data-mining and machine-learningalgorithms many studies regarding criminal pattern de-tection classification and clustering have emerged +emethods used in these studies include among others entityextraction clustering association rule mining deviationdetection and classification of social network analysis Acombination of the traditional method and a newer methodenables the pattern identification from both structured andunstructured data For instance entity extraction is used tounderstand concealed patterns in the data such as textsimages and audio data Furthermore clustering is used togroup objects into classes with similar characteristics[37 38] In addition unsupervised methods such as the self-organizing map (SOM) are used to support the results of thetraditional criminal profiling [39] In cases where thecriminal and the related cases are known supervisedlearning is applied [40] However although many advanceshave occurred in big data analytics and machine learningthese approaches are limited in supporting real-time pro-cessing as they require high computing power to handle alarge volume of training data In fact the large volume ofcrime data is a considerable challenge for the investigator interms of gaining the appropriate understanding of a com-plicated relationship or in terms of a timely responseHowever despite the limitations of this approach datamining yields valid useful and appropriate results By datapreprocessing such as data cleaning data integration anddata transformation it intends to reduce noisy data as wellas incomplete and inconsistent data It helps to uncover andconceptualize the concealed or latent crime patterns Byimproving the efficiency of crime data understanding andreducing errors in the results afforded by the data-miningmethod the investigator can perform reasoning timelyjudgment and quick problem solving [41]

Target cases

Proposedsolution

Confirmedsolution

Retrieve Reuse

Revise

Retain

Employ

Defacedwebsites

Similaritymeasurement

Case base

Figure 2 CBR process used in cybercrime investigationemploying knowledge base that is reused over similar new casesand retained for later use

4 Security and Communication Networks

CBR is also used to provide the reasoning power tosearch similar previous cases [25 42] However biased orimperfect collected data deteriorate the quality of the de-cision support provided by CBR +erefore in many casessetting the weight of the selected features is based on em-pirical knowledge which can be subsequently used to enablethe detection and analysis of crime patterns from thetemporal crime activity data Using clustering and classifi-cation techniques as well as speculativemodels for searchingsimilar crime cases in the past investigators can easily ex-tract useful information from the unstructured textualdataset [43] Hence investigators must collect and contin-uously update the comprehensive crime data

Clustering is the task of determining a similar group inthe data Clustering includes supervised learning typesZulfadhilah et al compared four types of clustering algo-rithms K-means hierarchical clustering SOM and Ex-pectation Maximization algorithm (EM clustering)mdashbasedon their performances +ey concluded that the K-meansalgorithm and the EM algorithm are better than the hier-archical clustering algorithm In general partitioning al-gorithms such as the K-means and EM algorithm are highlyrecommended for use in large-size data [44] In summarythe clustering algorithm can facilitate the investigator indetecting crimes patterns and accelerate crime solving +eweighting scheme for attributes can handle the limitations ofthe clustering techniques [45]

3 Methodology

In this section we present the detailed scheme of decisionsupport methodology for cybercrime investigation with thefocus on the website defacement cases A conceptualframework and its process are illustrated in Figure 3 +escheme is proceeded by the following three steps datapreprocessing case vector design and reasoning engineFirst we provide a brief outline of the dataset and describethe merits of the website defacement data Also we sum-marize the preprocessing for data parsing and cleaningregarding the collected data type Next we designed the casevector and chose the significant features to apply the rea-soning performance Finally the reasoning engine hasvarious functionalities and it is intended for the grouping(clustering) of cases based on their similarity

31 Preprocessing As part of the proposed analyticalframework we have developed a crawler to automaticallycollect 212093 website defacement cases from the zone-horg site Many website defacement cases are being dailyrecorded in the archive page of the zone-horg site Each caseregistered in the archive page provides information (ie IPaddress Domain Date OS Notifier and Web server) of thesame format through each mirror page First of all thecrawler collects all public information relevant to each case+ereafter on accessing the domain site it saves data in theraw format of the HTML source After crawling the webresources of raw data the data preprocessing is performed toamend incomplete improperly formatted or duplicate data

records More specifically there are various tag attributes inthe HTML source Encoding and Font data are extractedthrough the ltcharsetgt and ltfont-stylegt tag of the HTMLelements set between ltheadgt and ltheadgt tag in the HTMLsource Also image sound file and the linked site areextracted through the ltfont-familygt ltimggt and lthrefgt tagof them set between ltbodygt and ltbodygt tag in the HTMLsource +e web resources as original raw data were parsedand cleaned depending on the relevant case vector (seeFigure 4) After cleaning the data some significant data fieldswere selectively stored in the systemrsquos case database

+e selected data fields were related to the informationabout the website defacement date related IP address targetdomain target system OS and web server version theseaspects have proven to be useful for cyberattack in-vestigations [46] Specifically the encoding method and thefont whom the HTML source contains were necessary tospeculate on the attackerrsquos regional information For ex-ample if messages remaining in a defaced website arewritten in ISOIEC 8859 encoding we can subsequentlyinfer that the hackersrsquo language is German Spanish orSwedish Furthermore depending on whether all the mes-sages are written in the same encoding method the usedspecial characters such as β or ntilde or a can be used as a clue forguessing attackerrsquos origin In general encodings fromWindows-1250 to Windows-1258 are used in the centralEuropean languages as well as in Turkish Baltic languagesand Vietnamese By contrast GB encoding is used inChinese HKSCS encoding is used in Taiwanese and EUC-KR or ISO-2022-KR encoding is used in Korean [47] Inaddition to the font and encoding information the textimage audio and video found in the messages are alsonecessary parameters for the case identification

32 Case Vector Design We designed the case vector in twotypes concerning the similarity measure and clusteringprocessing +e case vector is summarized in Table 1 +efeatures of various aspects such as the font web serverthanks-to notifier (hackers or hacking groups) as well as thefeatures such as the encoding IP address domain attackdate and OS were extractable from the public archival sitezone-horg Generally more diverse features can be a sig-nificant factor for investigating relationships and associa-tions among hackers or hacking groups and the scale and thedensityintensity of the hacker community However such apremise has some shortcomings +e importance or theweight of all features may be different depending on thecriterion Also if all features are important machine-learning algorithms such as clustering or classification aredifficult to perform in reality because of the high compu-tational cost for analysing Despite having similar meaningssome of the features can be reperformed unnecessarily Tothis end the dimensionality reduction and the feature se-lection were performed in the present study paper After athorough review by security experts the significant featureswere selected for the case vector of website defacement cases+e detailed explanation of the dimensionality reductionand the feature selection is as follows

Security and Communication Networks 5

In theWindows operating system if a specific font is notdesignated as the tag inside the HTML code such as theltfont-familygt property the characters on a website pagemay appear as broken In particular some of the fontsamong the Chinese charactersrsquo cultural area depend on thecharacter encoding (eg font-family Gulim MingLiU andSTHeiti) [48] Similar to the encoding feature although thischaracteristic may be the key evidence to uncover a cor-relation between the victim and the attacker it is extremelyrare in each of the collected website defacement cases+erefore it is not suitable as a case vector for cybercrimeinvestigation Meanwhile in the case of a web server itprovides HTML CSS JavaScript etc when a client requestsa web page using the web server While the Apache and IISweb servers are primarily used in the Windows environ-ment the LiteSpeed web server is primarily used in the Linuxenvironment and the Enterprise web server is primarily usedin the UNIX environment +erefore the web server is

selectively dependent on the OS environment As with thefont feature described before since the web server featurecould not be found in the collected website defacement casesit was not suitable as a case vector for cybercrime in-vestigation Finally although the case vector concerningthanks-to and notifier can be used to analyse a hiddennetwork between the hackers and hacker groups the analysisof a network among hackers and hacking groups throughthem should be addressed in future research

As a result we defined the case vector by dividing intotwo types ie a version for the similarity measure and aversion for the clustering processing As the features of thecase vector the encoding IP address domain (ie servicename gTLD and ccTLD) attack date and OS were used inthe similarity measure However the encoding gTLDccTLD and OS were used in the clustering processing +eencoding is a case vector that provides decisive clues relatedto the attackerrsquos region information In the case of the IP

Case vector designPreprocessing

Clustering module

Cases-centric DB

Reasoning engine

It matches a new attack case with a former

attack case depending on

the defined case vector

It measuresthe similarity

score depending onthe weights and values

It calculates weights and

values

Similarity module

Data parsing

Data cleaning

Feature selection

Feature normalization

It performs the clustering processing through the EM

algorithm

It derives several clusterswhich exhibit similar patterns

Crawler

It gets the metadata and HTML source of

a website defacement case through

the mirror pageArchive page in the zone-horg site

Figure 3 Proposed analytical framework for the data-driven website defacement cases

Figure 4 Sorted dataset through the preprocessing

6 Security and Communication Networks

address and domain it gives clues related to the victimrsquoslocation and position Furthermore the attack date givesclues to the relation between the attacker and the victim+edetailed explanation of key features is provided in Table 1

+e normalization result of various feature elementsstored in the raw form of the HTML source is presented inFigure 5 In the case of encoding ISO series and MSWindows series are applied by normalizing depending onthe encoding used in each region or country In the case ofgTLD it was applied by normalizing depending on thegroups or organizations with similar characteristics In thecase of ccTLD it was applied by normalizing depending oneach continent Although the compression and normaliza-tion of features enable making the analysis such as clus-tering processing and similarity measure simple and clearon the contrary it may also bring about the loss of in-formation in the original data or make it more difficult toanalyse in detail

33 Reasoning Engine In the reasoning process the rea-soning engine first performs a similarity search based onCBR Discrete similarity scores are defined to calculate thedistance of nominal data (eg IP address and domain)Algorithm 1 shows how the similarity module operates bycomparing a retrieved website defacement case and all casesin the cases-centric DB on a case-by-case basis Sub-sequently the reasoning engine evaluates the similarity score

between the given new attack case vector and vectors ofother attack cases Next the reasoning engine performsclustering to group-abstracted crime cases into classes ofsimilar crime cases In crime investigation a cluster groupedas similar crime case subsets helps to infer crime patternsand speeds up the process of solving a crime due to a betterunderstanding of a complicated relationship or in terms of atimely response In the present study we implemented thereasoning engine consisting of two processing entities thesimilarity measure processing and the clustering algorithmprocessing (see below for further details)

331 Similarity Measure As the similarity measure based onthe CBR algorithm we proposed the similarity algorithmoperated by comparing a retrieved website defacement caseand all cases in the cases-centric DB To begin with if one ofthe retrieved cases (RC a new case) is given and there are ldquonrdquocases in the cases-centric DB (TCs all cases in the cases-centric DB) a comparison between RC and TCs are con-ducted as ldquonrdquo times We defined the extent of similaritybetween RC and TCs as a numeral value from ldquo0rdquo to ldquo1rdquowhere ldquo0rdquo means that RC and TC are unrelated and ldquo1rdquomeansthat RC and TC are identical Similarity score (0lt Slt 1)specifies the extent of similarity between RC and TC If thesimilarity score is much closer to ldquo1rdquo RC and TC are moreanalogous to each other In the event of multiple case vectorssimilarity can be expressed as a weighted sum of case vectors

Table 1 Case vector design highlighting two groups of features

Case vectorUsed in process

DescriptionS C

Encoding O O

It is used to represent the different types of languageinformation on the computer It determines the

usable characters and the methods to express them+e feature was normalized based on MS Windows

and the ISO character set

IP address O NA A unique number that allows devices on the networkto identify and communicate with each other

Domain

Service name O NA+e service name is individually made with a differentname depending on the service categories such as

gTLD or ccTLD

gTLD O O+e gTLD feature was normalized depending on theelement having the same meaning (eg go gob and

gobr feature were normalized into gov)

ccTLD O O

+e ccTLD is a unique code assigned to the domainname that represents the country specific region or

an international organization+e ccTLD normalized by the continent is used in theclustering process and the original ccTLD is used in

the similarity process

Date O NA +e attack date performed by the hacker or thehacking group

OS O OA part of a computer system that manages all

hardware and software (eg Windows Linux andUNIX)

S similarity measure C clustering processing

Security and Communication Networks 7

Similarity score 1113944

cv

i1distance RCcvTCcv( 1113857 times weightcv1113858 1113859

cv case vector(ie encoding IP address domain date andOS)

(1)

+ere are various approaches to set the weight of the casevector such as the heuristic method logistic regression anal-ysis and attribute weighting methods Furthermore theseweight values need to be periodically updated to be applied tothe study of recent attack trends However for the initialsetting it is difficult to set the exact numerical value for eachweight values in accordance with the case vector In our ex-periment we set the impact and the weight of the case vector ashighmedium and low according to their importance so that to

concretely categorize the attacker and the victim Above allsince encoding makes it possible to infer the static locatedinformation of the attacker we defined encoding as high-quality information IP address and domain were defined asmedium-quality information +ese case vectors enable theidentification and specification of the victim Finally the tar-geted date and OS were defined as low-quality information Tomeasure clustering and similarity all values of the case vectormixed as numbers and letters were normalized to have a valuefrom 0 to 1 Obviously since these values can be subjective inorder to prevent this subjective bias these values should beacquired and thoroughly reviewed by several experts +istechnique can be easily applied using expert knowledge ofinvestigation experts and is easy to understand from re-searchersrsquo viewpoint +e quantitative method for setting and

Arabic

Baltic

CentralEurope

Chinese

Cyrillic

Greek

Hebrew

Japanese

Korean

SouthernEurope

Taiwanese

Thailand

Turkish

Africa

Australia

CentralAsia

EastAsia

EasternEurope

NorthAmerica

NorthernEurope

SouthAmerica

SouthAsia

SoutheastAsia

SouthernEurope

WestAsia

WesternEurope

Linux-basedOS

MacOS

Unix-basedOS

Windows-basedOS

bull ISO-8859-6bull Windows-1256

bull ISO-2022-KRbull EUC-KR

bull GB2312bull GB18030 bull GBK

bull ISO-2022-JPbull EUC-JPbull ShiftJIS

bull ISO-8859-2bull Windows-1250

bull ISO-8859-13bull Windows-1257

bull ISO-8859-8bull Windows-1255

bull ISO-8859-7bull Windows-1253

bullbull

ISO-8859-5Windows-1251

bull Windows seriesbull Windows server series

bull Unixbull AIX bull Compaq Tru64 etc

bull MacOSbull MacOSX

bull Linux bull FreeBSD bull Avtech etc

bull combull cobull int

bull info

bull org bull or

bull coop

bull govbull gobull gob

bull edubull ac

bull net

bull mil

bull biz

bull fr ie be gl lube dk ad imnl uk je gg etc

bull br sr ar cl do ec fk gf py sr uy ve etc

bull sa ae kw bh az in ir jo kw lb om qa ye etc

bull no dk lv ltse ax fi glis no

bull us bz lc ai bmgd hn ky mx ni pa sv tt vi etc

bull gr mksm ad va ba es it ptrs hr si li bg etc

bull la bu vn kh th

bull in np bt pk lk id mn mo my np ph tl etc

bull kz uz tm tj kg af am tr

bull au pg nz ccck fj gu kinu sb vu wf etc

bull gn jm ke aobw cf ls mztz ug yt zw etc

bull ru by al lv ua pl sk hu ee md ro mk etc

bull kr cn jp twhk kp sg

Encoding gTLD ccTLD OS

com

edu

gov

org

biz

mil

net

coop

info

bull Windows-1253

bull ISO-8859-11bull Windows-874

bull Big5 bull EUC-TW bull Eten

bull ISO-8859-9bull Windows-1254bull IBM857

WestEurope

bull ISO-8859-1bull Windows-1252

Normalization

Figure 5 Normalization of each feature elements

8 Security and Communication Networks

Input TCs(Tested_DB)lowast +e Tested_DB indicates the cases-centric DB lowastRC (Retrieved_Case)⟵ Encodi ngRC IPRC DomainRC DateRC OSRClowast RC means one of the retrieved cases lowastW (Weight)⟵ Encodi ngW IPW DomainW DateW OSW

Output Similarity_score(1) TCEncodi ngTC IPTC DomainTC DateTC OSTC⟵TCs(2) While RC in TCs do(3) if Encodi ngRC Encodi ngTC then(4) Encoding_similarity_value⟵ 10(5) else(6) Encoding_similarity_value⟵ 00(7) end(8) IPRC Octet ARC Octet BRC Octet CRC Octet DRC IPTC Octet ATC Octet BTC Octet CTC Octet DTC(9) if (Octet ARC Octet ATC) (Octet BRC Octet BTC) (Octet CRC Octet CTC) (Octet DRC Octet DTC) then(10) IP_similarity_value⟵ 10(11) else if (Octet ARC Octet ATC) (Octet BRC Octet BTC) (Octet CRC Octet CTC) then(12) IP_similarity_value⟵ 075(13) else if (Octet ARC Octet ATC) (Octet BRC Octet BTC) then(14) IP_similarity_value⟵ 05(15) else if (Octet ARC Octet ATC) then(16) IP_similarity_value⟵ 025(17) else(18) IP_similarity_value⟵ 00(19) end(20) DomainRC ServiceNameRC gTLDRC ccTLDRC DomainTC ServiceNameTC gTLDTC ccTLDTC(21) if an identical domain then(22) Domain_similarity_value⟵ 10(23) else if (ServiceNameRC ServiceNameTC) (gTLDRC gTLDTC) (ccTLDRC ccTLDTC) then(24) Domain similarity_value⟵ 08(25) else if (gTLDRC gTLDTC) (ccTLDRC ccTLDTC) then(26) Domain_similarity_value⟵ 03(27) else if (ServiceNameRC ServiceNameTC) then(28) Domain_similarity_value⟵ 01(29) else if (ccTLDRC ccTLDTC) then(30) Domain_similarity_value⟵ 01(31) else if (gTLDRC gTLDTC) then(32) Domain_similarity_value⟵ 01(33) else(34) Domain_similarity_value⟵ 00(35) end(36) Date_variance⟵ |Da teRC―Da teTC|lowast It converts a date format year month and day (ie yyyy-mm-dd) into a day

calculated with numeric lowast(37) if 0leDate_variancele 365 then(38) Date_similarity_value⟵ 10(39) else if 365ltDate_variancele 1095 then(40) Date_similarity_value⟵ 075(41) else if 1095ltDate_variancele 1825 then(42) Date_similarity_value⟵ 05(43) else if 1825ltDate_variancele 2555 then(44) Date_similarity_value⟵ 025(45) else if 2555ltDate_variance then(46) Date_similarity_value⟵ 00(47) end(48) if OSRC OSTC then(49) OS_similarity_value⟵ 10(50) else(51) OS_similarity_value⟵ 00(52) end(53) Similarity_score⟵ (Encoding_similarity_valuetimes EncodingW) +

(IP_similarity_valuetimes IPW) + (Domain_similarity_valuetimes DomainW) +(Date_similarity_valuetimes DateW) + (OS_similarity_valuetimes OSW)

(54) return Similarity score between RC and TC(55) end while

ALGORITHM 1 Similarity measure module

Security and Communication Networks 9

updating the weight value is an issue worth addressing infurther research In the present study we set the weight valuesfor the case vector including the encoding IP address domainattack date and OS (see Table 2)

Some case vectorsrsquo distance cannot be directly estimatedas they have mixed numerical and nominal data (such as IPaddress range and domain name) For this reason to cal-culate the distance between the nominal data we defined thediscrete similarity measure +e similarity of IP addresseswas calculated by measuring the similarity among the sameoctet of two given IP addresses +e IP address space iscomposed of a number combination of four octets separatedby ldquordquo In the present study we compared if octets from the1st octet to the 4th octet of RC and TC were identicalSubsequently a similarity value was assigned to the IPaddress vector We suggested the discrete similarity valuebetween two IP addresses as visible in Table 2 +e proposedapproach is advantageous in that it enables the distancecalculation between the IP addresses efficiently

(i) IP address of RC zzz yyy xxx www

(ii) IP address of TC zzz yyy xxx www

Meanwhile the similarity between domains is calculatedaccording to their domain properties +e domain iscomposed of the gTLD ccTLD and service name+e gTLDrefers to a generic top-level domain in the domain rule Forinstance com and co are used for commercial companies ororganizations org and or are used for nonprofit organi-zations go and gov are used for government and stateagencies Besides ccTLD refers to a country code top-leveldomain in the domain rule and means a unique sign thatrepresents a specific region such as kr cn br and uk DNSmakes change in the IP address into a unique Domain Namewhich is easy to remember because it consists of a combi-nation of an alphabet letter and a number Among theDomain Name the service name is built corresponding withthe characteristics of the groups organizations or corpo-rations that the gTLD is intending and pursuing +e servicename has diverse and different names depending on thecategories of the gTLD such as educational institutionscommercial enterprises military organizations nonprofitorganizations and government and state agencies Unlikeother case vectors we set the rule for estimating the simi-larity of the domain as depicted in Table 2

Furthermore we defined the attack date similarity Similarto the offline criminal investigation case if the time of a crimeoccurrence is near we can analyse the cases as a similar crimewith a cross-analysis of the target area and the criminalsrsquopatterns +e similarity value depends on the period differencebetween a new case and existing cases As visible in Table 2 thesimilarity value is described according to the date gap of twocases that occurred on different dates In summary accordingto the similarity degree of a variation range of a section thesimilarity values of the attack IP address domain and attackdate were set to the similarity value between 0 and 1

332 Clustering Processing Merely sorting the data andvisually analysing them render it difficult for an investigator to

infer the correlations and similarity among the potentialfeatures of incidents Hence an advanced tool that wouldcapture the complex underlying structures and data prop-erties is required Accordingly in the present study weconducted the clustering process using the EM algorithmbased on the probability of the individual data attributes +isalgorithm does not restrict the number of clusters in theparameters but automatically generates a number of validclusters by cross-validation +ereafter the algorithm de-termines the probability that some data items existed in thecluster bymaximizing the correlation and dependence amongthe objectsWe applied practically the EM algorithm to 80948data items having the information of encoding gTLD ccTLDand OS from 212093 data for clustering +e characterencoding was normalized by a group of congenial cover codeunits (ISO-8859 MS Windows character set GB and EUCseries) We excluded the Unicode because it is too generalwhich accounts for themajority of the collected encoding datafor clustering In the case of the service name even if we canfind out similar combinations of alphabet letters or numbersit is not easy to find commonality or relevance between them+erefore it is not suitable for being used as the similaritymeasure of the reasoning engine Consequently character-istics and metadata concerning the 12 clusters were obtained(see Table 3) +ese clustering results are also visualized andstored in the database (see Figure 6)

+e donut charts include the different features fromoutside to inside (in order) with the corresponding share ofeach feature value separated by a different colour codewithin this same circle Each cluster consists of four circlesand the circle represents from the outside to the inside theencoding gTLD ccTLD and OS +e percentage in Table 3represents howmany cases one cluster contains among all ofwebsite defacement cases collected from the zone-horg site+e representative hacker represents a notable hacker orhacking group among the members of them in each clusterAs described in Figure 6 clusters of similar patterns werefound in the clusters +e most conspicuously similarclusters were 4 and 7 which had the feature of using Arabicand Chinese a feature of the attack against an industrialorganization whose headquarters are located in WesternEurope +e cases in Clusters 4 and 7 accounted for 4129percent among all of website defacement cases collectedfrom the zone-horg site+e results of the clustering processcontribute to the concretization of the similarity between thenew and existing cases A large number of new cases haveflowed in the database and then if the clustering process isperformed with the dataset a clustering result may take on adifferent pattern of course

4 Application

41 Experimental Results and Analysis Considering that theassumption that the attackers tend to use similar or uniqueattack methods is not always valid and it is difficult toevaluate the accuracy of the similarity mechanism As timeprogresses attackersrsquo hacking skills advance and in additionthe attack plan campaign purpose and target groups canchange depending on the situation +erefore in the present

10 Security and Communication Networks

Table 2 Value and the weight for the similarity score by the case vector All of the values of the similarity score are normalized to 0 or 1

Case vector Weight Impact +e similarity measure between a new case andexisting cases Value

Encoding 05 High mdash 0 or 1

IP address 02 Medium

If the same (eg 14324816 and 14324816) 1If the 1st 2nd and 3rd octet are matched (eg

14324816 and 14324818) 075

If the 1st and 2nd octet are matched (eg 14324816and 14324844) 05

Only the 1st octet is matched (eg 14324816 and1431324) 025

No common octet (eg 14324816 and 1631325) 0

Domain 015 Medium

An identical domain 1Service name is matched and one of the gTLD and

ccTLD is matched 08

gTLD and ccTLD is matched 03Service name is matched 01

ccTLD is matched 01gTLD is matched 01

Nonidentical domain 0

Date 01 Low

Period of about 6 months back and forth (1 year) 1Period of about 18 months back and forth (3 years) 075Period of about 30 months back and forth (5 years) 05Period of about 42 months back and forth (7 years) 025Over period of about 42 months (over 7 years) 0

OS 005 Low mdash 0 or 1

Table 3 Characteristics and metadata of several different clusters derived from the clustering processing

Cluster number Ratio () Description Representative hacker (group)

0 784+e group uses Central European languages +eyprincipally attacked against the profit organization

and Linux-based OS in Western EuropeJaMaYcKa Super2li

1 816

+e group uses Arabic and Cyrillic +ey principallyattacked against the organization that manages thenetwork and Linux-based and Unix-based OS +eirattack region is distributed throughout SouthernEurope South America Eastern Europe and

Southeast Asia

BI0S

2 1036

+e group uses Central European languages +eyprincipally attacked against the organization that

manages the network and nonprofit organizations inWestern Europe

JaMaYcKa

3 933+e group uses Central European languages +eyprincipally attacked against the profit organization

and Windows-based OS in Western Europe1923Turk

4 2536+e group uses Arabic and Chinese +ey principally

attacked against the profit organization andWindows-based OS in Western Europe

EL_MuHaMMeD federal-atackorg

5 173

+e group uses Central European languages +eyprincipally attacked against the profit organizationand Unix-based OS in Southern Europe and Eastern

Europe

d3bsimX SuSKuN

6 524

+e group uses Central European languages +eyprincipally attacked against the profit organizationthe educational institution the government and stateagencies and also Windows-based OS in East Asia

1923Turk

Security and Communication Networks 11

study rather than evaluating the accuracy of the similaritymechanism we tested the overall performance of the pro-posed methodology with the ratio of correctly identified

hackers +e developed testing procedures unfolded in thefollowing four steps and are depicted in detail in Figure 7where ldquoKrdquo presents all hackers within the database

Table 3 Continued

Cluster number Ratio () Description Representative hacker (group)

7 1593+e group uses Arabic Chinese and Turkish +eyprincipally attacked against the profit organization

and Linux-based OS in Western EuropeRya iskorpitx

8 911+e group uses Central European languages +eyprincipally attacked against the profit organization

and Windows-based OS in Western Europe1923Turk

9 363

+e group uses Central European languages +eyprincipally attacked against the profit organizationand Linux-based OS in South America and Eastern

Europe

Hmei7

10 139

+e group uses Central European languages +eyprincipally attacked against Windows-based OS inSouth America and Southeast Asia+eir attack target

is mostly the educational institution and thegovernment and state agencies

BHS F4keLive

11 192

+e group uses Arabic and Central Europeanlanguages+ey principally attacked against the profitorganization and Windows-based OS in Southern

Europe

EL_MuHaMMeD linuXploit_cre

Clustering 00

25

25

50

75

0100

25

50

75

0100

25

50

75

0100

25

50

75

0100

25

50

75

0100

25

50

75

0100

25

50

75

0100

25

50

75

0100

25

50

75

0100

25

50

75

0100

25

50

75

0100

50

75

0100

Clustering 01 Clustering 02 Clustering 03

Clustering 04 Clustering 05 Clustering 06 Clustering 07

Clustering 08 Clustering 09 Clustering 10 Clustering 11

West EuropeTurkishCentral EuropeArabicCyrillicChinese

com

net

org

gov

edu

mil

Western EuropeEast AsiaSouthern EuropeSouth AmericaEastern EuropeSoutheast Asia

WindowLinuxUnixMacOS

Encoding

gTLD

ccTLD

OS

Figure 6 Visualization of the 12 different clusters (00 through 11) in our data annotated with various features encoding gTLD ccTLD andOS and their corresponding share (legend on the right side)

12 Security and Communication Networks

Rk Count Casesmk( )Count Casesallk( )

(3)

where ldquomrdquo means the past cases which are within the denedscope concerning a randomly selected hacker ldquokrdquo

(i) Step 1 selection the measurement objects ie 100hackers were randomly selected from the database

(ii) Step 2 case labelling we retrieved all previous attackcases conducted by the randomly selected 100hackers in Step 1 and then subsequently labelled allprevious attack cases by each hacker name

(iii) Step 3 case extraction we selected the most recentcase among the cases labelled in Step 2 as an inputvalue shye similarity score was then estimated bycomparing themost recent case (ie RCmdashone of theretrieved cases) with all other cases in the database(ie TCsmdashall cases in the cases-centric DB)

(iv) Step 4 scoring similarity score was sorteddepending on the value and the weight for thesimilarity score by the case vector (see Table 2) inthe descending orderWhenever the similarity valuewas 0 it was not displayed on the scoring list of Step4 shye feasibility of the proposed methodology wasevaluated based on how many past cases of a hackerthere were in the N scope at the scoring list of Step 4that is regarding the ratio of the attack cases by eachhacker we checked whether the cases were includedat the top N scope (N scope from the top 1 percentto the top 30 percent)

NScope Count CasesScopeK( )Count CasesallK( )

times 100 (2)

First we randomly picked 100 hackers from the col-lected dataset (ie cases-centric DB) thereafter we re-trieved and extracted all past attack cases for each hackershye extracted past cases were labelled with the hackerrsquosname Figure 8 depicts the number of website defacementattack cases in the past for each hacker In Steps 3 and 4similarity between a retrieved case (ie the most recentcase) and all other stored website defacement cases weremeasured

Specically we checked whether the result (ie thesorted hackerrsquos past cases with a high similarity score)stemming from the similarity measurement was included atthe top N scope shyis process was meant to check based onthe similarity score how many past attack cases of randomlypicked 100 hackers were included in the dened topN scopeTo this end we divided the top N scope into eight criterionfactors from the top 1 percent to the top 30 percent and theratio R all the past attack cases for each hacker into sixcriterion factors from 50 percent to 100 percent (ie at 10percent intervals) As illustrated in equations (2) and (3) theN scope and the ratio R were categorized as ratios accordingto the dened measure rule More specically the criterionof the top N scope ie ldquotop N percentrdquo was based on theresult derived from the similarity measurement Attack caseswere sorted in order of high similarity score and thereforethe cases were within the range of topN scope (see Figure 9)Also in the case of the hacking case ratio of a randomly

Step 4 scoring

bullbullbull

Randomly selected100 hackers

from the database

Step 1 selection

bullbullbull

1 eBuGz2 Hmei73 3xp1r3

98 S3cure99 drm1st3r

100 Lulz53c

bullbullbull

Step 2 case labelling1 TheBuGz

100 Lulz53c

Step 3 case extraction

A retrieved case(the most recent case)

bullbullbull

1 TheBuGz

100 Lulz53c

bullbullbull

Cases-centricDB

Hackername Date Encoding IP address Domain OS Score

Hackername Date Encoding IP address Domain OS Score

bullbullbull

1 eBuGz2 Hmei73 3xp1r3

98 S3cure99 drm1st3r

100 Lulz53c

Case1 bullbullbull Casem

Case1 bullbullbull Casemprime

Case1 bullbullbull Casem

Case1 bullbullbull Casemprimei=1

cv[Distance (RCcv TCcv) times Weightcv]

Casemprime

Casemprime

Casem

Casem

Figure 7 shye developed testing procedures from step 1 to step 4

Security and Communication Networks 13

selected hacker some parts of the past attack cases (ie ratioR) concerning a hacker were within the defined N scope (seeFigure 9)

Figure 10 shows the number of an identified hacker fora retrieved case (ie the most recent case) among allhacking cases of each hacker +e X-axis in Figure 10 showsthe criterion of the topN scope including the eight criterionfactors () and of the ratio R including the six criterionfactors () +e Y-axis presents the number of an identifiedhacker in the top N scope among the randomly selected 100hackers in Step 1 As can be seen in Figure 10 the higher theratio R and the narrower the N scope the lower the numberof an identified hacker in the top N scope among therandomly selected 100 hackers On the other hand thelower the ratio R and the wider the N scope the higher thenumber of identified hackers in the top N scope among therandomly selected 100 hackers Consequently even ifhacking cases were caused by the same hacker as the hackeror hacking group which only attacked the same or similarobjects were rare it is impossible to draw results with a highsimilarity score for all cases of a hacker Nevertheless theresults demonstrated that the proposed CBR-based de-cision support methodology can successfully reduce thenumber of hackers and their cases and suggest potential topN percent candidates among hundreds of thousands ofcases

+erefore an investigator should consider the avail-ability and flexibility of data with respect to the data selectioncriteria for the similarity measurement As mentionedabove when a new attack occurs they can limit the searchrange of the data and determine the direction of the criminalinvestigation With such the reduction in the number ofcandidate-related cases the outcomes of our similaritymechanism are highly valuable in terms of reducing theinvestigation time to determine the potential suspect of agiven hacking incident

42 Case Study As mentioned above the accuracy of theCBR depends on the quality of the collected data and theoverall accuracy is difficult to evaluate Nevertheless al-though the data are insufficient to evaluate the proposedmethodology the DS and SPE cases include the ground-truth data with specific information related to the hacker orhacking groups Based on the public ground-truth data ofthe DS and SPE cases we found the most similar top threehackers or hacking groups to them and noticed theircharacteristic by the proposed similarity measure and theclustering processing

+e hackers of the DS cyberattack defaced the groupwarehomepage of LG U+ the 3rd largest telecommunicationcompany in South Korea and the English version of the

Step 4 scoring

1 TheBuGzTop N scope(1~30)

Ratio R (50~100)

Hackername Date Encoding IP address Domain OS Score

Case1 bullbullbull Casem

Casem

Figure 9 Scoring step on the top N scope and the ratio R

0

1000

2000

3000

4000

5000

0 25 50 75 100Hacker

Num

ber o

f cas

es

Figure 8 +e number of website defacement attack cases in the past of each hacker

14 Security and Communication Networks

Korean Broadcasting System (KBS) homepage +ey leftunique images and many messages on the defaced websites+e three Calaveras image (ie skull image) used in the LGU+rsquos defaced website appeared on many European websites+e character encoding set of the message was the WesternEuropean language system Based on these insights we couldinfer that the hackersrsquo background is European ldquoHASTATIrdquowas the word written on the KBS homepage meaning theforefront line of the Roman troops hinting that the DScyberattack could be a starting point rather than a transientattack it was a persistent one Even if we excluded otherimages and messages as well as other features from thesimilarity processes due to the unanticipated loss or absenceof data one could establish the similarity and intent of theattackers with reasonable confidence However given thesufficiently large hacker profiling source such abundant datacould support and enhance the accuracy of inference Fig-ure 11 shows the screenshots of the defaced websites at thattime

In the SPE case similarly to the DS case some imagesand messages were left on the computers of SPE Regardingcolour skulls image and misspellings the imagesFigure 11(c) used in the SPE cases took on the characteristicssimilar to those of the images Figure 11(b) used in the DScases As shown in Figure 11 the colour schemes in greenand red and the visual similarities seen in skull image areother crucial elements for crime tracing In both the DS andSPE cases the phrase such as ldquothis is the beginningrdquo andldquoyour datardquo were commonly found in the messages How-ever given the intentional hacking nature of forging orhiding their identity motivation and location some experts

say that these characteristics are not the conclusive proofthat Sony has been attacked by the same hacker [49ndash51]

For the evaluation of the results of the case study we firstmeasured the similarity between the new website de-facement cases (ie the DS and SPE cases) and the collectedexisting cases in the database +is approach coheres withthe CBR process used in cybercrime investigation (seeFigure 2) Two new website defacement cases the DS and theSPE were applied as RC and the similarity score for each ofthese two cases was computed using the similarity measure(see equation (1)) proposed in Section 331 Provided thatbecause the DS and SPE cases do the function of the targetcases as an input value we considered a direct comparisonbetween the DS and SPE cases for the similarity score wasnot appropriate [52]

+e similarity measure mentioned in the previousparagraph is based on the metadata released by an analysisreport of the DS and SPE real cases We summarized furtherthe characteristics and metadata associated with them inTable 4 +e similarity score was derived through com-parison between the presented metadata of the DS and SPEcases and all cases in the cases-centric DB We gave the mostsimilar top three cases among the result of the similarityscore (see the right side in Table) Notifier Hmei7 and d3b_Xare among the cases that belonged to Clusters 0 and 8 whichwere the two clusters that exhibited identical characteristicsIt can thus be understood that they used the encoding systempertinent to Central European languages based on the Latinlanguage system and typically launched attacks against aprofit organization located in Western Europe Notifieroaddah MTRiX and EL_MuHaMMeD were all classified

Top 1 Top 3 Top 5 Top 10 Top 15 Top 20 Top 25 Top 30Criterion of the top N ()

Num

ber o

f ide

ntifi

ed h

acke

rs

0

20

40

60

80

100

Ratio of the attack cases ()506070

8090100

Figure 10 +e number of identified hackers in the top N scope among the randomly selected 100 hackers

Security and Communication Networks 15

as the same cluster (Cluster 7) where the hackers of Cluster 7used the encoding system pertinent to Arabic and Chineselanguages and typically attacked against the profit organi-zation located in Western Europe

Next to ensure the objectivity of the similarity scorebased on the case study by the DS and SPE we computed thesimilarity score of any randomly selected pair from thewhole case Figure 12(a) shows the distribution of thesimilarity score of the randomly selected cases We took thedistribution of the similarity score using the central limittheorem which describes the average distribution of ran-dom samples extracted from a finite population +e dis-tribution shows that the calculation of the similarity score ofthe randomly selected two website defacement cases wasrepeatedly performed for 10000 times +e similarity scoresof any randomly selected pair of cases were typically dis-tributed around 03 +is result (Figure 12(a)) substantiatesthat the similarity scores are not low even if the similarityscores of the DS and SPE cases (Figure 12(b)) do not appear

numerically high Figure 12(b) shows the similarity scores ofthe DS and SPE cases+e top score of the similarity was 069in the DS case and all measured cases concentrated aroundthe similarity score (X-axis) of 00 to 015 and of 05 to 06 Inthe SPE case the top score of the similarity was 0615 and allmeasured cases concentrated around the similarity score (X-axis) of 00 to 02

Figure 13 shows the distribution of the similarity scorefor randomly selected 100 hackers mentioned in Section 41To know the mean value of the similarity score for eachhacker case we calculated the similarity score from thehackerrsquos own past cases Cases used for the similarity scoremeans not all cases in the cases-centric DB but just the pastcases conducted by the hacker in the cases-centric DB +emean value of the similarity scores in the hackers is 05233+e similarity scores of the tested cases in Table 4 is abovethe mean value +us the similarity scores for each hackeradequately underpin the similarity scores from the TCs inDS and SPE

(a) (b) (c)

Figure 11 A snippet of website defacement cases by a comparison of examples of the DS and SPE the defaced LGU+ groupware homepage(a) and KBS homepage (b) in the DS case and the defaced website in SPE case (c)

Table 4 Further characteristics and metadata associated with the DS and SPE cases

Retrieved case Tested cases

Case name NotifierDarkSeoul (DS) Hmei7 d3b_X StifLer

Encoding Windows-1252 Windows-1252 Windows-1252 ISO-8859-9IP address 203248195178 2038623868 2031243766 77921083Domain gyunggionnet21com httpwwwgarychengcom healthajkgovpk yapikimyasallaricomtrDate 20 Mar 2013 6 Feb 2014 4 Feb 2014 8 Jun 2013OS Windows Windows Windows WindowsSimilarity mdash 0690 0675 0665Cluster mdash 0 8 4

Retrieved case Tested casesCase name Notifier

Sony pictures Entertainment (SPE) Oaddah MTRiX EL_MuHaMMeDEncoding EUC-KR EUC-CN GB2312 GB2312 GB2312IP address 203131222102 2031241555 20829198 2081164534Domain httpwwwsonypicturesstockfootagecom httpwwwhzkcggcom daxdigitalromcom digitalairstripnetDate 24 Nov 2014 14 Jun 2012 16 Dec 2002 18 June 2009OS Windows Windows Windows WindowsSimilarity mdash 0615 0615 0600Cluster mdash 7 7 7+e metadata are arranged according to the defined case vector corresponding with the DS and SPE cases on the left side (shown in part in boldface type)

16 Security and Communication Networks

43 Follow-Up Investigation A case study is a researchmethod involving an in-depth and detailed investigationof a subject of study as well as its related contextualmethodology Hence we conducted follow-up in-vestigations of the most similar top three hackers asmentioned above in Table 4 According to the resultsspecifically over 93 percent of the hackerrsquos attacks weresimilar to the DS case that occurred in 2013 and 2014+eir major targets were com domain sites and theytargeted primarily Germany Italy New Zealand RussiaTurkey Taiwan and South Korea (see Table 5) Twohackers (ie Hmei7 and d3b_X) primarily attackedgovernment agencies Interestingly 20 percent of theattacks by the hackers named d3b_X targeted SouthKorea In the SPE incident the similar hackerrsquos attacksoccurred throughout the period from 2002 to 2014 +ehackers named MTRiX and EL_MuHaMMeD in-tensively executed such attacks in 2003 and 2009 +eirmajor targets were com (or co) and org domain sitesand they targeted primarily Brazil Canada DenmarkFrance Greece Hong Kong and Italy (see Table 5) Twohackers (ie MTRiX and EL_MuHaMMeD) primarilyattacked commercial agencies and additionally attackedthe public and network agencies As shown in Figure 14 to

describe the follow-up investigation more discernibly andto focus on the attack flow we used an alluvial diagramwhich is a type of Sankey diagram developed to representchanges in a network structure over time [53] It shows theinvestigation of the top three hackers with website de-facement cases most similar to the DS case and SPE case+e case vectors were based on the attack year ccTLD andgTLD+e thickness of the attack flow in this figure meansthe degree of attack +is network visualization methodcould support an investigator to understand the flow andcore of the crime clearly by listing the multidimensionalevidence that is complicatedly entangled or hidden suchthat it does not look presentable

5 Limitations and Discussion

+e CBR algorithm has the disadvantage that the perfor-mance evaluation may be degraded if the property de-scribing the case is inappropriate +erefore in order toobtain more accurate results cross-data analysis with othervarious data sources should be considered For examplecybercrime statistics data from law enforcement agenciesthreat intelligence data from malware analysis groups andvulnerability databases could be useful resources to

100806040200Similarity score

600

400

200

0

Freq

uenc

y

Mean = 02930 Var = 00866

(a)

100806040200

Freq

uenc

y e highest similarity score 0615on Sony Pictures Entertainment case

Similarity score100806040200

Similarity score

40000

30000

20000

10000

0

Freq

uenc

y

40000

30000

20000

10000

0

Mean = 0114 Var = 01500

e highest similarity score 069on DarkSeoul case

Mean = 0063Var = 00370

A B

(b)

Figure 12 (a) Probability distribution of the similarity score for any pair of randomly selected cases (b) distribution of the similarity valuebetween the collected website defacement cases with the DS case (A) and the distribution of the similarity value between the collectedwebsite defacement cases with the SPE case (B) +e similarity was calculated between each studied case and all other cases in our system

0

2

4

6

000 025 050 075

Freq

uenc

y

Mean value of the similarity score000 025 050 075

Figure 13 Distribution of the similarity score for randomly selected 100 hackers

Security and Communication Networks 17

improve the accuracy and usability of our proposedmethodology However at the time of writing the presentpaper we did not have access to open and public dataconcerning cybercrime

For that reason we tried to demonstrate the practica-bility of the proposed methodology as a proof of concept+erefore we focused on the dataset of the zone-horg thatincludes a large number of website defacement cases Al-though the zone-horg provides an extensive dataset on thepast incident events not all incidents can be included in ourstudy +erefore if a hacker penetrated some target orga-nizations by APT attacks and performed stealthy activitiessuch hacking activities would not be reported in the datasetof the zone-horg and the proposed methodology would notbe able to detect similar cases with reasonable confidence

6 Conclusion and Future Work

In this study the similarity of website defacement caseswas assessed through the similarity measure and theclustering processing using the CBR as a methodology+e collected raw data of the defaced web sitesrsquo resourceswas sanitized via data parsing and data cleaning processAlso based on the large size of real dataset data-drivenanalysis for the hacker profiling is achieved To this endthe case vector was designed and the significant featureswere chosen for applying to the case-based reasoning Fora successful cybercrime investigation hacker profiling viaclustering analysis is the most basic and importantprocess in order to find out the relevant incident casesand significant data on some prime incidents data-driven

Table 5 Follow-up investigation on the top three hackers with website defacement cases most similar to the DS case and SPE case +e casevector value means the hackerrsquos attack rate

DomainDS case SPE case

Hmei7 d3b_X StifLer Oaddah MTRiX EL_MuHaMMeDCom 7832 8581 10000 10000 8627 8298Edu 162 096 mdash mdash 176 191Net 340 320 mdash mdash 546 574Gov 1216 651 mdash mdash 106 mdashYear Hmei7 d3b_X StifLer Oaddah MTRiX EL_MuHaMMeD2002 mdash mdash mdash mdash 1074 mdash2003 mdash mdash mdash mdash 8908 mdash2006 mdash mdash mdash mdash mdash mdash2007 009 mdash mdash mdash 018 mdash2008 mdash mdash mdash mdash mdash mdash2009 315 mdash mdash mdash mdash 99572010 009 mdash mdash mdash mdash mdash2011 034 mdash mdash mdash mdash mdash2012 340 mdash mdash 10000 mdash mdash2013 3486 3917 10000 mdash mdash mdash2014 5808 5977 mdash mdash mdash 0432015 mdash 107 mdash mdash mdash mdash

d3b~x

Hmei7

StifLer

2009

2012

2013

2014

AustraliaBrazilFrance

Germany

IndonesiaItaly

KoreaNetherlandsNewZealand

PolandRussia

Thailand

Turkey

Unknown

com

gov

net

org

Unknown

No

Yes

Hacker Year ccTLD gTLD Attack

(a)

EL_M

uHaM

MeD

MTRiX

oaddah

2002

2003

2009

2012

BrazilCanada

DenmarkFranceGreece

HongKongItaly

Unknown

com

net

org

Unknown

No

Yes

Hacker Year ccTLD gTLD Attack

(b)

Figure 14 Follow-up investigation on the top three hackers with website defacement cases that are most similar to the DS case (a) and SPEcase (b)

18 Security and Communication Networks

and evidence-driven decision making should be thecritical process Also reducing the amount of data andtime to be analysed are important factors to deliver thehigh value of intelligence data

Although the obtained results appear to be sound andmeaningful it is difficult to evaluate the accuracy of theresults unless the attacker is captured Naturally theground-truth data with specific information about theinvolved hacking groups for verification are rare (ie noadversary claimed that the two attacks were the result oftheir actions) However it is noteworthy that our meth-odology provides a meaningful insight into the confidentialand undercover network of cybercrime as well especiallywhen there is a lack of information Also the proposedmethodology contributes to facilitate the analysis and re-ducing the time required for searching for possible suspectsof cybercrime We believe that the proposed system ismeaningful for further exploration and correlation ofvarious website defacement cases

As mentioned in Discussion and Limitations a cross-data analysis with other various data sources should bereviewed Said differently the use of additional online oroffline information acquired by human intelligence(HUMINT) or different types of signal intelligence(SIGINT) and sources may also help to reason compo-sition requirements of crime and reduce the category ofinvestigation Furthermore the proposed methodologycan be expanded into incident information for compat-ibility and information exchangeability with othercyberthreat intelligence system as the Structured +reatInformation eXpression (STIX) and Trusted AutomatedeXchange of Indicator Information (TAXII) which arekey strategic elements of the information-sharingsystem [54]

+ere are features such as the particular messages (iethanks-to notifier nationality religion and anniversary)or image and mp3 file in the web resources which aregathered from the zone-horg site Although these featuresare limited to only a small number of hackers of the webresources in future research we will try to study a close-knit network among them such as the hub hacking groupkey player and followers Furthermore we also plan tomore definitely classify and systemize the hackersrsquo intentsusing text mining and mood detection techniques +efindings of this prospective study will contribute mean-ingful insights to trace hackersrsquo behavioural patterns and toestimate their primary purpose and intent

Data Availability

+e web-hacking dataset applied to our paper can bedownloaded from the linked site below httpocslabhksecuritynetDatasetsweb-hacking-profiling

Conflicts of Interest

+e authors declare that there are no conflicts of interestregarding the publication of this paper

Acknowledgments

+is work was supported under the framework of internationalcooperation program managed by the National ResearchFoundation of Korea (No 2017K1A3A1A17092614)

References

[1] S S Response ldquoSwift attackersrsquo malware linked to more fi-nancial attacksrdquo 2016 httpswwwsymanteccomconnectblogsswift-attackers-malware-linked-more-financial-attacks

[2] S S Response ldquoWannacry ransomware attacks show strong linksto lazarus grouprdquo 2017 httpswwwsymanteccomconnectblogswannacry-ransomware-attacks-show-strong-links-lazarus-group

[3] K lab ldquoLazarus under the hoodrdquo 2018 httpsmediakasperskycontenthubcomwp-contentuploadssites4320180307180244Lazarus_Under_+e_Hood_PDF_finalpdf

[4] Operation Blockbuster ldquoDestructive malware reportrdquo 2016httpswwwoperationblockbustercomwp-contentuploads201602Operation-Blockbuster-Destructive-Malware-Reportpdf

[5] D Martin and SANS Institute InfoSec Reading Room ldquoTracingthe lineage of DarkSeoulrdquo 2016 httpswwwsansorgreading-roomwhitepaperscriticaltracing-lineage-darkseoul-36787

[6] D S C T U T Intelligence ldquoWiper malware threatanalysisrdquo 2013 httpswwwsecureworkscomresearchwiper-malware-analysis-attacking-korean-financial-sector

[7] R Sherstobitoff M L Itai Liba and O O T C James WalterldquoDissecting operation troy cyberespionage in South Koreardquo2013 httpswwwmcafeecomenterpriseen-usassetswhite-paperswp-dissecting-operation-troypdf

[8] N Horton andA DeSimone ldquoSonyrsquos nightmare before christmasthe 2014 North Korean cyber attack on Sony and lessons for USgovernment actions in cyberspacerdquo 2018 httpswwwjhuapleduContentdocumentsSonyNightmareBeforeChristmaspdf

[9] I K Lee and S R Ramsey 9e Korean Language StateUniversity of New York Albany NY USA 2000

[10] V Benjamin and H Chen ldquoSecuring cyberspace identifyingkey actors in hacker communitiesrdquo in Proceedings of the 2012IEEE International Conference on Intelligence and SecurityInformatics pp 24ndash29 Arlington VA USA June 2012

[11] Y Lu X Luo M Polgar et al ldquoSocial network analysis of acriminal hacker communityrdquo Journal of Computer In-formation Systems vol 51 no 2 pp 31ndash41 2010

[12] J-W Jang H Kang J Woo A Mohaisen and H K KimldquoAndro-autopsy anti-malware system based on similaritymatching of malware and malware creator-centric in-formationrdquo Digital Investigation vol 14 pp 17ndash35 2015

[13] J W Jang and H K Kim ldquoFunction-orientedmobile malwareanalysis as first aidrdquo Mobile Information Systems vol 2016Article ID 6707524 11 pages 2016

[14] Y Ki E Kim and H K Kim ldquoA novel approach to detectmalware based on api call sequence analysisrdquo InternationalJournal of Distributed Sensor Networks vol 11 no 6 ArticleID 659101 2015

[15] M L Han H C Han A R Kang et al ldquoWeb-hacking datasetfor the cyber criminal profilingrdquo 2016 httpocslabhksecuritynetDatasetsweb-hacking-profiling

[16] M L Han H C Han A R Kang B I Kwak A Mohaisenand H K Kim ldquoWAHP web-hacking profiling using case-based reasoningrdquo in Proceedings of the 2016 IEEE Conference

Security and Communication Networks 19

on Communications and Network Security (CNS) pp 344-345Philadelphia PA USA October 2016

[17] A Aamodt and E Plaza ldquoCase-based reasoning foundationalissues methodological variations and system approachesrdquo AICommunications vol 7 no 1 pp 39ndash59 1994

[18] D M L Martins and F B D Lima Neto ldquoHybrid intelligentdecision support using a semiotic case-based reasoning andself-organizing mapsrdquo IEEE Transactions on Systems Manand Cybernetics Systems no 99 pp 1ndash8 2017

[19] H K Kim K H Im and S C Park ldquoDSS for computersecurity incident response applying CBR and collaborativeresponserdquo Expert Systems with Applications vol 37 no 1pp 852ndash870 2010

[20] J-B Lamy B Sekar G Guezennec J Bouaud andB Seroussi ldquoExplainable artificial intelligence for breastcancer a visual case-based reasoning approachrdquo ArtificialIntelligence in Medicine vol 94 pp 42ndash53 2019

[21] M Relich and P Pawlewski ldquoA case-based reasoning ap-proach to cost estimation of new product developmentrdquoNeurocomputing vol 272 pp 40ndash45 2018

[22] E R Reyes S Negny G C Robles et al ldquoImprovement ofonline adaptation knowledge acquisition and reuse in case-based reasoning application to process engineering designrdquoEngineering Applications of Artificial Intelligence vol 41pp 1ndash16 2015

[23] H K Kim S-K Kim and S-H Kim ldquoDecision supportsystem for zero-day attack responserdquo Applied Mathematicsand Information Sciences vol 6 no 1 pp 221Sndash241S 2012

[24] G Horsman C Laing and P Vickers ldquoA case-based rea-soning method for locating evidence during digital forensicdevice triagerdquo Decision Support Systems vol 61 pp 69ndash782014

[25] G Horsman C Laing and P Vickers ldquoA case based reasoningsystem for automated forensic examinationsrdquo in Proceedings ofthe PGNET 2011 the 12th Annual Postgraduate Symposium onthe Convergence of Telecommunications Networking andBroadcasting pp 26ndash31 Liverpool UK June 2011

[26] Z Yin Y Gao and B Chen ldquoOn development of supple-mentary criminal analysis system based on cbr and ontologyrdquoin Proceedings of the 2010 International Conference onComputer Application and System Modeling (ICCASM 2010)vol 14 Taiyuan China October 2010

[27] A J Pinizzotto and N J Finkel ldquoCriminal personality pro-filing an outcome and process studyrdquo Law and HumanBehavior vol 14 no 3 pp 215ndash233 1990

[28] P Chen and J Kurland ldquoTime place and modus operandi asimple apriori algorithm experiment for crime pattern de-tectionrdquo in Proceedings of the 2018 9th International Con-ference on Information Intelligence Systems and Applications(IISA) pp 1ndash3 Zakynthos Greece July 2018

[29] C J R Collie and K Shalev Greene ldquoExamining modusoperandi in stranger child abduction a comparison ofattempted and completed casesrdquo Journal of InvestigativePsychology and Offender Profiling vol 16 no 2 pp 91ndash1092019

[30] V Benjamin B Zhang J F Nunamaker Jr and H ChenldquoExamining hacker participation length in cybercriminalinternet-relay-chat communitiesrdquo Journal of ManagementInformation Systems vol 33 no 2 pp 482ndash510 2016

[31] V Benjamin and H Chen ldquoTime-to-event modeling forpredicting hacker IRC community participant trajectoryrdquo inProceedings of the 2014 IEEE Joint Intelligence and SecurityInformatics Conference pp 25ndash32 +e Hague +e Nether-lands September 2014

[32] K Veena and K Meena ldquoIdentification of cyber criminal byanalysing the users profilerdquo International Journal of NetworkSecurity vol 20 no 4 pp 738ndash745 2018

[33] F Iqbal B C M Fung M Debbabi R Batool andA Marrington ldquoWordnet-based criminal networks miningfor cybercrime investigationrdquo IEEE Access vol 7pp 22740ndash22755 2019

[34] N Qazi and B L W Wong ldquoAn interactive human centereddata science approach towards crime pattern analysisrdquo In-formation Processing ampManagement vol 56 no 6 p 1020662019

[35] N Jain P Sharma R Anchan et al ldquoComputerized forensicapproach using data mining techniquesrdquo in Proceedings of theACM Symposium on Women in Research 2016 pp 55ndash60ACM New York NY USA 2016

[36] P M Cozens G Saville and D Hillier ldquoCrime preventionthrough environmental design (cpted) a review and modernbibliographyrdquo Property Management vol 23 no 5pp 328ndash356 2005

[37] H Hassani X Huang E S Silva andM Ghodsi ldquoA review ofdata mining applications in crimerdquo Statistical Analysis andData Mining 9e ASA Data Science Journal vol 9 no 3pp 139ndash154 2016

[38] A Sharma and S Sharma ldquoAn intelligent analysis of webcrime data using data miningrdquo International Journal of En-gineering and Innovative Technology (IJEIT) vol 2 no 32012

[39] S-T Li S-C Kuo and F-C Tsai ldquoAn intelligent decision-support model using FSOM and rule extraction for crimepreventionrdquo Expert Systems with Applications vol 37 no 10pp 7108ndash7119 2010

[40] Y-H Tseng Z-P Ho K-S Yang and C-C Chen ldquoMiningterm networks from text collections for crime investigationrdquoExpert Systems with Applications vol 39 no 11 pp 10082ndash10090 2012

[41] A Malathi and S S Baboo ldquoAn enhanced algorithm topredict a future crime using data miningrdquo InternationalJournal of Computer Applications vol 21 no 1 2011

[42] S Kapetanakis A Filippoupolitis G Loukas et al ldquoProfilingcyber attackers using case-based reasoningrdquo in Proceedings ofthe 19th UK Workshop on Case-Based Reasoning (UKCBR2014) Cambridge UK December 2014

[43] R Al-Zaidy B C Fung A M Youssef et al ldquoMining criminalnetworks from unstructured text documentsrdquo Digital In-vestigation vol 8 no 3-4 pp 147ndash160 2012

[44] M Zulfadhilah Y Prayudi and I Riadi ldquoCyber profilingusing log analysis and k-means clusteringrdquo InternationalJournal of Advanced Computer Science and Applicationsvol 7 no 7 pp 430ndash435 2016

[45] S V Nath ldquoCrime pattern detection using data miningrdquo inProceedings of the 2006 IEEEWICACM International Con-ference on Web Intelligence and Intelligent Agent TechnologyWorkshops pp 41ndash44 Hong Kong China December 2006

[46] ITPnet ldquoSyria Egypt crises spur escalation of me cyber at-tacksrdquo 2013 httpwwwitpnet594742-syria-egypt-crises-spur-escalation-of-me-cyber-attack

[47] A McEnery and R Xiao ldquoCharacter encoding in corpusconstructionrdquo in Developing Linguistic Corpora A Guide toGood Practice Oxbow Books Ltd Oxford UK 2005

[48] B Bos T Ccedilelik I Hickson et al ldquoCascading style sheets level2 revision 1 (CSS 21) specificationrdquo W3C Working Draft2005 httpwwww3orgTRCSS21

20 Security and Communication Networks

[49] W Stuckey ldquoMassive sony breach sheds light on murkyhacker universerdquo 2018 httpamericaaljazeeracomarticles20141224sony-hacker-universehtml

[50] S Gallagher ldquoSony pictures malware tied to SeoulldquoShamoonrdquo cyber-attacksrdquo 2018 httpsarstechnicacominformation-technology201412sony-pictures-malware-tied-to-seoul-shamoon-cyber-attacks

[51] J Pagliery ldquoSony hack signs point to North Koreardquo 2018httpsmoneycnncom20141205technologysecuritysony-hack-north-korea-employeeindexhtml

[52] K Ketler ldquoCase-based reasoning an introductionrdquo ExpertSystems with Applications vol 6 no 1 pp 3ndash8 1993

[53] M Rosvall and C T Bergstrom ldquoMapping change in largenetworksrdquo PLoS One vol 5 no 1 Article ID e8694 2010

[54] OASIS ldquoSTIXTAXII standardsrdquo 2017-2018 httpsoasis-opengithubiocti-documentation

Security and Communication Networks 21

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 5: CBR-Based Decision Support Methodology for Cybercrime

CBR is also used to provide the reasoning power tosearch similar previous cases [25 42] However biased orimperfect collected data deteriorate the quality of the de-cision support provided by CBR +erefore in many casessetting the weight of the selected features is based on em-pirical knowledge which can be subsequently used to enablethe detection and analysis of crime patterns from thetemporal crime activity data Using clustering and classifi-cation techniques as well as speculativemodels for searchingsimilar crime cases in the past investigators can easily ex-tract useful information from the unstructured textualdataset [43] Hence investigators must collect and contin-uously update the comprehensive crime data

Clustering is the task of determining a similar group inthe data Clustering includes supervised learning typesZulfadhilah et al compared four types of clustering algo-rithms K-means hierarchical clustering SOM and Ex-pectation Maximization algorithm (EM clustering)mdashbasedon their performances +ey concluded that the K-meansalgorithm and the EM algorithm are better than the hier-archical clustering algorithm In general partitioning al-gorithms such as the K-means and EM algorithm are highlyrecommended for use in large-size data [44] In summarythe clustering algorithm can facilitate the investigator indetecting crimes patterns and accelerate crime solving +eweighting scheme for attributes can handle the limitations ofthe clustering techniques [45]

3 Methodology

In this section we present the detailed scheme of decisionsupport methodology for cybercrime investigation with thefocus on the website defacement cases A conceptualframework and its process are illustrated in Figure 3 +escheme is proceeded by the following three steps datapreprocessing case vector design and reasoning engineFirst we provide a brief outline of the dataset and describethe merits of the website defacement data Also we sum-marize the preprocessing for data parsing and cleaningregarding the collected data type Next we designed the casevector and chose the significant features to apply the rea-soning performance Finally the reasoning engine hasvarious functionalities and it is intended for the grouping(clustering) of cases based on their similarity

31 Preprocessing As part of the proposed analyticalframework we have developed a crawler to automaticallycollect 212093 website defacement cases from the zone-horg site Many website defacement cases are being dailyrecorded in the archive page of the zone-horg site Each caseregistered in the archive page provides information (ie IPaddress Domain Date OS Notifier and Web server) of thesame format through each mirror page First of all thecrawler collects all public information relevant to each case+ereafter on accessing the domain site it saves data in theraw format of the HTML source After crawling the webresources of raw data the data preprocessing is performed toamend incomplete improperly formatted or duplicate data

records More specifically there are various tag attributes inthe HTML source Encoding and Font data are extractedthrough the ltcharsetgt and ltfont-stylegt tag of the HTMLelements set between ltheadgt and ltheadgt tag in the HTMLsource Also image sound file and the linked site areextracted through the ltfont-familygt ltimggt and lthrefgt tagof them set between ltbodygt and ltbodygt tag in the HTMLsource +e web resources as original raw data were parsedand cleaned depending on the relevant case vector (seeFigure 4) After cleaning the data some significant data fieldswere selectively stored in the systemrsquos case database

+e selected data fields were related to the informationabout the website defacement date related IP address targetdomain target system OS and web server version theseaspects have proven to be useful for cyberattack in-vestigations [46] Specifically the encoding method and thefont whom the HTML source contains were necessary tospeculate on the attackerrsquos regional information For ex-ample if messages remaining in a defaced website arewritten in ISOIEC 8859 encoding we can subsequentlyinfer that the hackersrsquo language is German Spanish orSwedish Furthermore depending on whether all the mes-sages are written in the same encoding method the usedspecial characters such as β or ntilde or a can be used as a clue forguessing attackerrsquos origin In general encodings fromWindows-1250 to Windows-1258 are used in the centralEuropean languages as well as in Turkish Baltic languagesand Vietnamese By contrast GB encoding is used inChinese HKSCS encoding is used in Taiwanese and EUC-KR or ISO-2022-KR encoding is used in Korean [47] Inaddition to the font and encoding information the textimage audio and video found in the messages are alsonecessary parameters for the case identification

32 Case Vector Design We designed the case vector in twotypes concerning the similarity measure and clusteringprocessing +e case vector is summarized in Table 1 +efeatures of various aspects such as the font web serverthanks-to notifier (hackers or hacking groups) as well as thefeatures such as the encoding IP address domain attackdate and OS were extractable from the public archival sitezone-horg Generally more diverse features can be a sig-nificant factor for investigating relationships and associa-tions among hackers or hacking groups and the scale and thedensityintensity of the hacker community However such apremise has some shortcomings +e importance or theweight of all features may be different depending on thecriterion Also if all features are important machine-learning algorithms such as clustering or classification aredifficult to perform in reality because of the high compu-tational cost for analysing Despite having similar meaningssome of the features can be reperformed unnecessarily Tothis end the dimensionality reduction and the feature se-lection were performed in the present study paper After athorough review by security experts the significant featureswere selected for the case vector of website defacement cases+e detailed explanation of the dimensionality reductionand the feature selection is as follows

Security and Communication Networks 5

In theWindows operating system if a specific font is notdesignated as the tag inside the HTML code such as theltfont-familygt property the characters on a website pagemay appear as broken In particular some of the fontsamong the Chinese charactersrsquo cultural area depend on thecharacter encoding (eg font-family Gulim MingLiU andSTHeiti) [48] Similar to the encoding feature although thischaracteristic may be the key evidence to uncover a cor-relation between the victim and the attacker it is extremelyrare in each of the collected website defacement cases+erefore it is not suitable as a case vector for cybercrimeinvestigation Meanwhile in the case of a web server itprovides HTML CSS JavaScript etc when a client requestsa web page using the web server While the Apache and IISweb servers are primarily used in the Windows environ-ment the LiteSpeed web server is primarily used in the Linuxenvironment and the Enterprise web server is primarily usedin the UNIX environment +erefore the web server is

selectively dependent on the OS environment As with thefont feature described before since the web server featurecould not be found in the collected website defacement casesit was not suitable as a case vector for cybercrime in-vestigation Finally although the case vector concerningthanks-to and notifier can be used to analyse a hiddennetwork between the hackers and hacker groups the analysisof a network among hackers and hacking groups throughthem should be addressed in future research

As a result we defined the case vector by dividing intotwo types ie a version for the similarity measure and aversion for the clustering processing As the features of thecase vector the encoding IP address domain (ie servicename gTLD and ccTLD) attack date and OS were used inthe similarity measure However the encoding gTLDccTLD and OS were used in the clustering processing +eencoding is a case vector that provides decisive clues relatedto the attackerrsquos region information In the case of the IP

Case vector designPreprocessing

Clustering module

Cases-centric DB

Reasoning engine

It matches a new attack case with a former

attack case depending on

the defined case vector

It measuresthe similarity

score depending onthe weights and values

It calculates weights and

values

Similarity module

Data parsing

Data cleaning

Feature selection

Feature normalization

It performs the clustering processing through the EM

algorithm

It derives several clusterswhich exhibit similar patterns

Crawler

It gets the metadata and HTML source of

a website defacement case through

the mirror pageArchive page in the zone-horg site

Figure 3 Proposed analytical framework for the data-driven website defacement cases

Figure 4 Sorted dataset through the preprocessing

6 Security and Communication Networks

address and domain it gives clues related to the victimrsquoslocation and position Furthermore the attack date givesclues to the relation between the attacker and the victim+edetailed explanation of key features is provided in Table 1

+e normalization result of various feature elementsstored in the raw form of the HTML source is presented inFigure 5 In the case of encoding ISO series and MSWindows series are applied by normalizing depending onthe encoding used in each region or country In the case ofgTLD it was applied by normalizing depending on thegroups or organizations with similar characteristics In thecase of ccTLD it was applied by normalizing depending oneach continent Although the compression and normaliza-tion of features enable making the analysis such as clus-tering processing and similarity measure simple and clearon the contrary it may also bring about the loss of in-formation in the original data or make it more difficult toanalyse in detail

33 Reasoning Engine In the reasoning process the rea-soning engine first performs a similarity search based onCBR Discrete similarity scores are defined to calculate thedistance of nominal data (eg IP address and domain)Algorithm 1 shows how the similarity module operates bycomparing a retrieved website defacement case and all casesin the cases-centric DB on a case-by-case basis Sub-sequently the reasoning engine evaluates the similarity score

between the given new attack case vector and vectors ofother attack cases Next the reasoning engine performsclustering to group-abstracted crime cases into classes ofsimilar crime cases In crime investigation a cluster groupedas similar crime case subsets helps to infer crime patternsand speeds up the process of solving a crime due to a betterunderstanding of a complicated relationship or in terms of atimely response In the present study we implemented thereasoning engine consisting of two processing entities thesimilarity measure processing and the clustering algorithmprocessing (see below for further details)

331 Similarity Measure As the similarity measure based onthe CBR algorithm we proposed the similarity algorithmoperated by comparing a retrieved website defacement caseand all cases in the cases-centric DB To begin with if one ofthe retrieved cases (RC a new case) is given and there are ldquonrdquocases in the cases-centric DB (TCs all cases in the cases-centric DB) a comparison between RC and TCs are con-ducted as ldquonrdquo times We defined the extent of similaritybetween RC and TCs as a numeral value from ldquo0rdquo to ldquo1rdquowhere ldquo0rdquo means that RC and TC are unrelated and ldquo1rdquomeansthat RC and TC are identical Similarity score (0lt Slt 1)specifies the extent of similarity between RC and TC If thesimilarity score is much closer to ldquo1rdquo RC and TC are moreanalogous to each other In the event of multiple case vectorssimilarity can be expressed as a weighted sum of case vectors

Table 1 Case vector design highlighting two groups of features

Case vectorUsed in process

DescriptionS C

Encoding O O

It is used to represent the different types of languageinformation on the computer It determines the

usable characters and the methods to express them+e feature was normalized based on MS Windows

and the ISO character set

IP address O NA A unique number that allows devices on the networkto identify and communicate with each other

Domain

Service name O NA+e service name is individually made with a differentname depending on the service categories such as

gTLD or ccTLD

gTLD O O+e gTLD feature was normalized depending on theelement having the same meaning (eg go gob and

gobr feature were normalized into gov)

ccTLD O O

+e ccTLD is a unique code assigned to the domainname that represents the country specific region or

an international organization+e ccTLD normalized by the continent is used in theclustering process and the original ccTLD is used in

the similarity process

Date O NA +e attack date performed by the hacker or thehacking group

OS O OA part of a computer system that manages all

hardware and software (eg Windows Linux andUNIX)

S similarity measure C clustering processing

Security and Communication Networks 7

Similarity score 1113944

cv

i1distance RCcvTCcv( 1113857 times weightcv1113858 1113859

cv case vector(ie encoding IP address domain date andOS)

(1)

+ere are various approaches to set the weight of the casevector such as the heuristic method logistic regression anal-ysis and attribute weighting methods Furthermore theseweight values need to be periodically updated to be applied tothe study of recent attack trends However for the initialsetting it is difficult to set the exact numerical value for eachweight values in accordance with the case vector In our ex-periment we set the impact and the weight of the case vector ashighmedium and low according to their importance so that to

concretely categorize the attacker and the victim Above allsince encoding makes it possible to infer the static locatedinformation of the attacker we defined encoding as high-quality information IP address and domain were defined asmedium-quality information +ese case vectors enable theidentification and specification of the victim Finally the tar-geted date and OS were defined as low-quality information Tomeasure clustering and similarity all values of the case vectormixed as numbers and letters were normalized to have a valuefrom 0 to 1 Obviously since these values can be subjective inorder to prevent this subjective bias these values should beacquired and thoroughly reviewed by several experts +istechnique can be easily applied using expert knowledge ofinvestigation experts and is easy to understand from re-searchersrsquo viewpoint +e quantitative method for setting and

Arabic

Baltic

CentralEurope

Chinese

Cyrillic

Greek

Hebrew

Japanese

Korean

SouthernEurope

Taiwanese

Thailand

Turkish

Africa

Australia

CentralAsia

EastAsia

EasternEurope

NorthAmerica

NorthernEurope

SouthAmerica

SouthAsia

SoutheastAsia

SouthernEurope

WestAsia

WesternEurope

Linux-basedOS

MacOS

Unix-basedOS

Windows-basedOS

bull ISO-8859-6bull Windows-1256

bull ISO-2022-KRbull EUC-KR

bull GB2312bull GB18030 bull GBK

bull ISO-2022-JPbull EUC-JPbull ShiftJIS

bull ISO-8859-2bull Windows-1250

bull ISO-8859-13bull Windows-1257

bull ISO-8859-8bull Windows-1255

bull ISO-8859-7bull Windows-1253

bullbull

ISO-8859-5Windows-1251

bull Windows seriesbull Windows server series

bull Unixbull AIX bull Compaq Tru64 etc

bull MacOSbull MacOSX

bull Linux bull FreeBSD bull Avtech etc

bull combull cobull int

bull info

bull org bull or

bull coop

bull govbull gobull gob

bull edubull ac

bull net

bull mil

bull biz

bull fr ie be gl lube dk ad imnl uk je gg etc

bull br sr ar cl do ec fk gf py sr uy ve etc

bull sa ae kw bh az in ir jo kw lb om qa ye etc

bull no dk lv ltse ax fi glis no

bull us bz lc ai bmgd hn ky mx ni pa sv tt vi etc

bull gr mksm ad va ba es it ptrs hr si li bg etc

bull la bu vn kh th

bull in np bt pk lk id mn mo my np ph tl etc

bull kz uz tm tj kg af am tr

bull au pg nz ccck fj gu kinu sb vu wf etc

bull gn jm ke aobw cf ls mztz ug yt zw etc

bull ru by al lv ua pl sk hu ee md ro mk etc

bull kr cn jp twhk kp sg

Encoding gTLD ccTLD OS

com

edu

gov

org

biz

mil

net

coop

info

bull Windows-1253

bull ISO-8859-11bull Windows-874

bull Big5 bull EUC-TW bull Eten

bull ISO-8859-9bull Windows-1254bull IBM857

WestEurope

bull ISO-8859-1bull Windows-1252

Normalization

Figure 5 Normalization of each feature elements

8 Security and Communication Networks

Input TCs(Tested_DB)lowast +e Tested_DB indicates the cases-centric DB lowastRC (Retrieved_Case)⟵ Encodi ngRC IPRC DomainRC DateRC OSRClowast RC means one of the retrieved cases lowastW (Weight)⟵ Encodi ngW IPW DomainW DateW OSW

Output Similarity_score(1) TCEncodi ngTC IPTC DomainTC DateTC OSTC⟵TCs(2) While RC in TCs do(3) if Encodi ngRC Encodi ngTC then(4) Encoding_similarity_value⟵ 10(5) else(6) Encoding_similarity_value⟵ 00(7) end(8) IPRC Octet ARC Octet BRC Octet CRC Octet DRC IPTC Octet ATC Octet BTC Octet CTC Octet DTC(9) if (Octet ARC Octet ATC) (Octet BRC Octet BTC) (Octet CRC Octet CTC) (Octet DRC Octet DTC) then(10) IP_similarity_value⟵ 10(11) else if (Octet ARC Octet ATC) (Octet BRC Octet BTC) (Octet CRC Octet CTC) then(12) IP_similarity_value⟵ 075(13) else if (Octet ARC Octet ATC) (Octet BRC Octet BTC) then(14) IP_similarity_value⟵ 05(15) else if (Octet ARC Octet ATC) then(16) IP_similarity_value⟵ 025(17) else(18) IP_similarity_value⟵ 00(19) end(20) DomainRC ServiceNameRC gTLDRC ccTLDRC DomainTC ServiceNameTC gTLDTC ccTLDTC(21) if an identical domain then(22) Domain_similarity_value⟵ 10(23) else if (ServiceNameRC ServiceNameTC) (gTLDRC gTLDTC) (ccTLDRC ccTLDTC) then(24) Domain similarity_value⟵ 08(25) else if (gTLDRC gTLDTC) (ccTLDRC ccTLDTC) then(26) Domain_similarity_value⟵ 03(27) else if (ServiceNameRC ServiceNameTC) then(28) Domain_similarity_value⟵ 01(29) else if (ccTLDRC ccTLDTC) then(30) Domain_similarity_value⟵ 01(31) else if (gTLDRC gTLDTC) then(32) Domain_similarity_value⟵ 01(33) else(34) Domain_similarity_value⟵ 00(35) end(36) Date_variance⟵ |Da teRC―Da teTC|lowast It converts a date format year month and day (ie yyyy-mm-dd) into a day

calculated with numeric lowast(37) if 0leDate_variancele 365 then(38) Date_similarity_value⟵ 10(39) else if 365ltDate_variancele 1095 then(40) Date_similarity_value⟵ 075(41) else if 1095ltDate_variancele 1825 then(42) Date_similarity_value⟵ 05(43) else if 1825ltDate_variancele 2555 then(44) Date_similarity_value⟵ 025(45) else if 2555ltDate_variance then(46) Date_similarity_value⟵ 00(47) end(48) if OSRC OSTC then(49) OS_similarity_value⟵ 10(50) else(51) OS_similarity_value⟵ 00(52) end(53) Similarity_score⟵ (Encoding_similarity_valuetimes EncodingW) +

(IP_similarity_valuetimes IPW) + (Domain_similarity_valuetimes DomainW) +(Date_similarity_valuetimes DateW) + (OS_similarity_valuetimes OSW)

(54) return Similarity score between RC and TC(55) end while

ALGORITHM 1 Similarity measure module

Security and Communication Networks 9

updating the weight value is an issue worth addressing infurther research In the present study we set the weight valuesfor the case vector including the encoding IP address domainattack date and OS (see Table 2)

Some case vectorsrsquo distance cannot be directly estimatedas they have mixed numerical and nominal data (such as IPaddress range and domain name) For this reason to cal-culate the distance between the nominal data we defined thediscrete similarity measure +e similarity of IP addresseswas calculated by measuring the similarity among the sameoctet of two given IP addresses +e IP address space iscomposed of a number combination of four octets separatedby ldquordquo In the present study we compared if octets from the1st octet to the 4th octet of RC and TC were identicalSubsequently a similarity value was assigned to the IPaddress vector We suggested the discrete similarity valuebetween two IP addresses as visible in Table 2 +e proposedapproach is advantageous in that it enables the distancecalculation between the IP addresses efficiently

(i) IP address of RC zzz yyy xxx www

(ii) IP address of TC zzz yyy xxx www

Meanwhile the similarity between domains is calculatedaccording to their domain properties +e domain iscomposed of the gTLD ccTLD and service name+e gTLDrefers to a generic top-level domain in the domain rule Forinstance com and co are used for commercial companies ororganizations org and or are used for nonprofit organi-zations go and gov are used for government and stateagencies Besides ccTLD refers to a country code top-leveldomain in the domain rule and means a unique sign thatrepresents a specific region such as kr cn br and uk DNSmakes change in the IP address into a unique Domain Namewhich is easy to remember because it consists of a combi-nation of an alphabet letter and a number Among theDomain Name the service name is built corresponding withthe characteristics of the groups organizations or corpo-rations that the gTLD is intending and pursuing +e servicename has diverse and different names depending on thecategories of the gTLD such as educational institutionscommercial enterprises military organizations nonprofitorganizations and government and state agencies Unlikeother case vectors we set the rule for estimating the simi-larity of the domain as depicted in Table 2

Furthermore we defined the attack date similarity Similarto the offline criminal investigation case if the time of a crimeoccurrence is near we can analyse the cases as a similar crimewith a cross-analysis of the target area and the criminalsrsquopatterns +e similarity value depends on the period differencebetween a new case and existing cases As visible in Table 2 thesimilarity value is described according to the date gap of twocases that occurred on different dates In summary accordingto the similarity degree of a variation range of a section thesimilarity values of the attack IP address domain and attackdate were set to the similarity value between 0 and 1

332 Clustering Processing Merely sorting the data andvisually analysing them render it difficult for an investigator to

infer the correlations and similarity among the potentialfeatures of incidents Hence an advanced tool that wouldcapture the complex underlying structures and data prop-erties is required Accordingly in the present study weconducted the clustering process using the EM algorithmbased on the probability of the individual data attributes +isalgorithm does not restrict the number of clusters in theparameters but automatically generates a number of validclusters by cross-validation +ereafter the algorithm de-termines the probability that some data items existed in thecluster bymaximizing the correlation and dependence amongthe objectsWe applied practically the EM algorithm to 80948data items having the information of encoding gTLD ccTLDand OS from 212093 data for clustering +e characterencoding was normalized by a group of congenial cover codeunits (ISO-8859 MS Windows character set GB and EUCseries) We excluded the Unicode because it is too generalwhich accounts for themajority of the collected encoding datafor clustering In the case of the service name even if we canfind out similar combinations of alphabet letters or numbersit is not easy to find commonality or relevance between them+erefore it is not suitable for being used as the similaritymeasure of the reasoning engine Consequently character-istics and metadata concerning the 12 clusters were obtained(see Table 3) +ese clustering results are also visualized andstored in the database (see Figure 6)

+e donut charts include the different features fromoutside to inside (in order) with the corresponding share ofeach feature value separated by a different colour codewithin this same circle Each cluster consists of four circlesand the circle represents from the outside to the inside theencoding gTLD ccTLD and OS +e percentage in Table 3represents howmany cases one cluster contains among all ofwebsite defacement cases collected from the zone-horg site+e representative hacker represents a notable hacker orhacking group among the members of them in each clusterAs described in Figure 6 clusters of similar patterns werefound in the clusters +e most conspicuously similarclusters were 4 and 7 which had the feature of using Arabicand Chinese a feature of the attack against an industrialorganization whose headquarters are located in WesternEurope +e cases in Clusters 4 and 7 accounted for 4129percent among all of website defacement cases collectedfrom the zone-horg site+e results of the clustering processcontribute to the concretization of the similarity between thenew and existing cases A large number of new cases haveflowed in the database and then if the clustering process isperformed with the dataset a clustering result may take on adifferent pattern of course

4 Application

41 Experimental Results and Analysis Considering that theassumption that the attackers tend to use similar or uniqueattack methods is not always valid and it is difficult toevaluate the accuracy of the similarity mechanism As timeprogresses attackersrsquo hacking skills advance and in additionthe attack plan campaign purpose and target groups canchange depending on the situation +erefore in the present

10 Security and Communication Networks

Table 2 Value and the weight for the similarity score by the case vector All of the values of the similarity score are normalized to 0 or 1

Case vector Weight Impact +e similarity measure between a new case andexisting cases Value

Encoding 05 High mdash 0 or 1

IP address 02 Medium

If the same (eg 14324816 and 14324816) 1If the 1st 2nd and 3rd octet are matched (eg

14324816 and 14324818) 075

If the 1st and 2nd octet are matched (eg 14324816and 14324844) 05

Only the 1st octet is matched (eg 14324816 and1431324) 025

No common octet (eg 14324816 and 1631325) 0

Domain 015 Medium

An identical domain 1Service name is matched and one of the gTLD and

ccTLD is matched 08

gTLD and ccTLD is matched 03Service name is matched 01

ccTLD is matched 01gTLD is matched 01

Nonidentical domain 0

Date 01 Low

Period of about 6 months back and forth (1 year) 1Period of about 18 months back and forth (3 years) 075Period of about 30 months back and forth (5 years) 05Period of about 42 months back and forth (7 years) 025Over period of about 42 months (over 7 years) 0

OS 005 Low mdash 0 or 1

Table 3 Characteristics and metadata of several different clusters derived from the clustering processing

Cluster number Ratio () Description Representative hacker (group)

0 784+e group uses Central European languages +eyprincipally attacked against the profit organization

and Linux-based OS in Western EuropeJaMaYcKa Super2li

1 816

+e group uses Arabic and Cyrillic +ey principallyattacked against the organization that manages thenetwork and Linux-based and Unix-based OS +eirattack region is distributed throughout SouthernEurope South America Eastern Europe and

Southeast Asia

BI0S

2 1036

+e group uses Central European languages +eyprincipally attacked against the organization that

manages the network and nonprofit organizations inWestern Europe

JaMaYcKa

3 933+e group uses Central European languages +eyprincipally attacked against the profit organization

and Windows-based OS in Western Europe1923Turk

4 2536+e group uses Arabic and Chinese +ey principally

attacked against the profit organization andWindows-based OS in Western Europe

EL_MuHaMMeD federal-atackorg

5 173

+e group uses Central European languages +eyprincipally attacked against the profit organizationand Unix-based OS in Southern Europe and Eastern

Europe

d3bsimX SuSKuN

6 524

+e group uses Central European languages +eyprincipally attacked against the profit organizationthe educational institution the government and stateagencies and also Windows-based OS in East Asia

1923Turk

Security and Communication Networks 11

study rather than evaluating the accuracy of the similaritymechanism we tested the overall performance of the pro-posed methodology with the ratio of correctly identified

hackers +e developed testing procedures unfolded in thefollowing four steps and are depicted in detail in Figure 7where ldquoKrdquo presents all hackers within the database

Table 3 Continued

Cluster number Ratio () Description Representative hacker (group)

7 1593+e group uses Arabic Chinese and Turkish +eyprincipally attacked against the profit organization

and Linux-based OS in Western EuropeRya iskorpitx

8 911+e group uses Central European languages +eyprincipally attacked against the profit organization

and Windows-based OS in Western Europe1923Turk

9 363

+e group uses Central European languages +eyprincipally attacked against the profit organizationand Linux-based OS in South America and Eastern

Europe

Hmei7

10 139

+e group uses Central European languages +eyprincipally attacked against Windows-based OS inSouth America and Southeast Asia+eir attack target

is mostly the educational institution and thegovernment and state agencies

BHS F4keLive

11 192

+e group uses Arabic and Central Europeanlanguages+ey principally attacked against the profitorganization and Windows-based OS in Southern

Europe

EL_MuHaMMeD linuXploit_cre

Clustering 00

25

25

50

75

0100

25

50

75

0100

25

50

75

0100

25

50

75

0100

25

50

75

0100

25

50

75

0100

25

50

75

0100

25

50

75

0100

25

50

75

0100

25

50

75

0100

25

50

75

0100

50

75

0100

Clustering 01 Clustering 02 Clustering 03

Clustering 04 Clustering 05 Clustering 06 Clustering 07

Clustering 08 Clustering 09 Clustering 10 Clustering 11

West EuropeTurkishCentral EuropeArabicCyrillicChinese

com

net

org

gov

edu

mil

Western EuropeEast AsiaSouthern EuropeSouth AmericaEastern EuropeSoutheast Asia

WindowLinuxUnixMacOS

Encoding

gTLD

ccTLD

OS

Figure 6 Visualization of the 12 different clusters (00 through 11) in our data annotated with various features encoding gTLD ccTLD andOS and their corresponding share (legend on the right side)

12 Security and Communication Networks

Rk Count Casesmk( )Count Casesallk( )

(3)

where ldquomrdquo means the past cases which are within the denedscope concerning a randomly selected hacker ldquokrdquo

(i) Step 1 selection the measurement objects ie 100hackers were randomly selected from the database

(ii) Step 2 case labelling we retrieved all previous attackcases conducted by the randomly selected 100hackers in Step 1 and then subsequently labelled allprevious attack cases by each hacker name

(iii) Step 3 case extraction we selected the most recentcase among the cases labelled in Step 2 as an inputvalue shye similarity score was then estimated bycomparing themost recent case (ie RCmdashone of theretrieved cases) with all other cases in the database(ie TCsmdashall cases in the cases-centric DB)

(iv) Step 4 scoring similarity score was sorteddepending on the value and the weight for thesimilarity score by the case vector (see Table 2) inthe descending orderWhenever the similarity valuewas 0 it was not displayed on the scoring list of Step4 shye feasibility of the proposed methodology wasevaluated based on how many past cases of a hackerthere were in the N scope at the scoring list of Step 4that is regarding the ratio of the attack cases by eachhacker we checked whether the cases were includedat the top N scope (N scope from the top 1 percentto the top 30 percent)

NScope Count CasesScopeK( )Count CasesallK( )

times 100 (2)

First we randomly picked 100 hackers from the col-lected dataset (ie cases-centric DB) thereafter we re-trieved and extracted all past attack cases for each hackershye extracted past cases were labelled with the hackerrsquosname Figure 8 depicts the number of website defacementattack cases in the past for each hacker In Steps 3 and 4similarity between a retrieved case (ie the most recentcase) and all other stored website defacement cases weremeasured

Specically we checked whether the result (ie thesorted hackerrsquos past cases with a high similarity score)stemming from the similarity measurement was included atthe top N scope shyis process was meant to check based onthe similarity score how many past attack cases of randomlypicked 100 hackers were included in the dened topN scopeTo this end we divided the top N scope into eight criterionfactors from the top 1 percent to the top 30 percent and theratio R all the past attack cases for each hacker into sixcriterion factors from 50 percent to 100 percent (ie at 10percent intervals) As illustrated in equations (2) and (3) theN scope and the ratio R were categorized as ratios accordingto the dened measure rule More specically the criterionof the top N scope ie ldquotop N percentrdquo was based on theresult derived from the similarity measurement Attack caseswere sorted in order of high similarity score and thereforethe cases were within the range of topN scope (see Figure 9)Also in the case of the hacking case ratio of a randomly

Step 4 scoring

bullbullbull

Randomly selected100 hackers

from the database

Step 1 selection

bullbullbull

1 eBuGz2 Hmei73 3xp1r3

98 S3cure99 drm1st3r

100 Lulz53c

bullbullbull

Step 2 case labelling1 TheBuGz

100 Lulz53c

Step 3 case extraction

A retrieved case(the most recent case)

bullbullbull

1 TheBuGz

100 Lulz53c

bullbullbull

Cases-centricDB

Hackername Date Encoding IP address Domain OS Score

Hackername Date Encoding IP address Domain OS Score

bullbullbull

1 eBuGz2 Hmei73 3xp1r3

98 S3cure99 drm1st3r

100 Lulz53c

Case1 bullbullbull Casem

Case1 bullbullbull Casemprime

Case1 bullbullbull Casem

Case1 bullbullbull Casemprimei=1

cv[Distance (RCcv TCcv) times Weightcv]

Casemprime

Casemprime

Casem

Casem

Figure 7 shye developed testing procedures from step 1 to step 4

Security and Communication Networks 13

selected hacker some parts of the past attack cases (ie ratioR) concerning a hacker were within the defined N scope (seeFigure 9)

Figure 10 shows the number of an identified hacker fora retrieved case (ie the most recent case) among allhacking cases of each hacker +e X-axis in Figure 10 showsthe criterion of the topN scope including the eight criterionfactors () and of the ratio R including the six criterionfactors () +e Y-axis presents the number of an identifiedhacker in the top N scope among the randomly selected 100hackers in Step 1 As can be seen in Figure 10 the higher theratio R and the narrower the N scope the lower the numberof an identified hacker in the top N scope among therandomly selected 100 hackers On the other hand thelower the ratio R and the wider the N scope the higher thenumber of identified hackers in the top N scope among therandomly selected 100 hackers Consequently even ifhacking cases were caused by the same hacker as the hackeror hacking group which only attacked the same or similarobjects were rare it is impossible to draw results with a highsimilarity score for all cases of a hacker Nevertheless theresults demonstrated that the proposed CBR-based de-cision support methodology can successfully reduce thenumber of hackers and their cases and suggest potential topN percent candidates among hundreds of thousands ofcases

+erefore an investigator should consider the avail-ability and flexibility of data with respect to the data selectioncriteria for the similarity measurement As mentionedabove when a new attack occurs they can limit the searchrange of the data and determine the direction of the criminalinvestigation With such the reduction in the number ofcandidate-related cases the outcomes of our similaritymechanism are highly valuable in terms of reducing theinvestigation time to determine the potential suspect of agiven hacking incident

42 Case Study As mentioned above the accuracy of theCBR depends on the quality of the collected data and theoverall accuracy is difficult to evaluate Nevertheless al-though the data are insufficient to evaluate the proposedmethodology the DS and SPE cases include the ground-truth data with specific information related to the hacker orhacking groups Based on the public ground-truth data ofthe DS and SPE cases we found the most similar top threehackers or hacking groups to them and noticed theircharacteristic by the proposed similarity measure and theclustering processing

+e hackers of the DS cyberattack defaced the groupwarehomepage of LG U+ the 3rd largest telecommunicationcompany in South Korea and the English version of the

Step 4 scoring

1 TheBuGzTop N scope(1~30)

Ratio R (50~100)

Hackername Date Encoding IP address Domain OS Score

Case1 bullbullbull Casem

Casem

Figure 9 Scoring step on the top N scope and the ratio R

0

1000

2000

3000

4000

5000

0 25 50 75 100Hacker

Num

ber o

f cas

es

Figure 8 +e number of website defacement attack cases in the past of each hacker

14 Security and Communication Networks

Korean Broadcasting System (KBS) homepage +ey leftunique images and many messages on the defaced websites+e three Calaveras image (ie skull image) used in the LGU+rsquos defaced website appeared on many European websites+e character encoding set of the message was the WesternEuropean language system Based on these insights we couldinfer that the hackersrsquo background is European ldquoHASTATIrdquowas the word written on the KBS homepage meaning theforefront line of the Roman troops hinting that the DScyberattack could be a starting point rather than a transientattack it was a persistent one Even if we excluded otherimages and messages as well as other features from thesimilarity processes due to the unanticipated loss or absenceof data one could establish the similarity and intent of theattackers with reasonable confidence However given thesufficiently large hacker profiling source such abundant datacould support and enhance the accuracy of inference Fig-ure 11 shows the screenshots of the defaced websites at thattime

In the SPE case similarly to the DS case some imagesand messages were left on the computers of SPE Regardingcolour skulls image and misspellings the imagesFigure 11(c) used in the SPE cases took on the characteristicssimilar to those of the images Figure 11(b) used in the DScases As shown in Figure 11 the colour schemes in greenand red and the visual similarities seen in skull image areother crucial elements for crime tracing In both the DS andSPE cases the phrase such as ldquothis is the beginningrdquo andldquoyour datardquo were commonly found in the messages How-ever given the intentional hacking nature of forging orhiding their identity motivation and location some experts

say that these characteristics are not the conclusive proofthat Sony has been attacked by the same hacker [49ndash51]

For the evaluation of the results of the case study we firstmeasured the similarity between the new website de-facement cases (ie the DS and SPE cases) and the collectedexisting cases in the database +is approach coheres withthe CBR process used in cybercrime investigation (seeFigure 2) Two new website defacement cases the DS and theSPE were applied as RC and the similarity score for each ofthese two cases was computed using the similarity measure(see equation (1)) proposed in Section 331 Provided thatbecause the DS and SPE cases do the function of the targetcases as an input value we considered a direct comparisonbetween the DS and SPE cases for the similarity score wasnot appropriate [52]

+e similarity measure mentioned in the previousparagraph is based on the metadata released by an analysisreport of the DS and SPE real cases We summarized furtherthe characteristics and metadata associated with them inTable 4 +e similarity score was derived through com-parison between the presented metadata of the DS and SPEcases and all cases in the cases-centric DB We gave the mostsimilar top three cases among the result of the similarityscore (see the right side in Table) Notifier Hmei7 and d3b_Xare among the cases that belonged to Clusters 0 and 8 whichwere the two clusters that exhibited identical characteristicsIt can thus be understood that they used the encoding systempertinent to Central European languages based on the Latinlanguage system and typically launched attacks against aprofit organization located in Western Europe Notifieroaddah MTRiX and EL_MuHaMMeD were all classified

Top 1 Top 3 Top 5 Top 10 Top 15 Top 20 Top 25 Top 30Criterion of the top N ()

Num

ber o

f ide

ntifi

ed h

acke

rs

0

20

40

60

80

100

Ratio of the attack cases ()506070

8090100

Figure 10 +e number of identified hackers in the top N scope among the randomly selected 100 hackers

Security and Communication Networks 15

as the same cluster (Cluster 7) where the hackers of Cluster 7used the encoding system pertinent to Arabic and Chineselanguages and typically attacked against the profit organi-zation located in Western Europe

Next to ensure the objectivity of the similarity scorebased on the case study by the DS and SPE we computed thesimilarity score of any randomly selected pair from thewhole case Figure 12(a) shows the distribution of thesimilarity score of the randomly selected cases We took thedistribution of the similarity score using the central limittheorem which describes the average distribution of ran-dom samples extracted from a finite population +e dis-tribution shows that the calculation of the similarity score ofthe randomly selected two website defacement cases wasrepeatedly performed for 10000 times +e similarity scoresof any randomly selected pair of cases were typically dis-tributed around 03 +is result (Figure 12(a)) substantiatesthat the similarity scores are not low even if the similarityscores of the DS and SPE cases (Figure 12(b)) do not appear

numerically high Figure 12(b) shows the similarity scores ofthe DS and SPE cases+e top score of the similarity was 069in the DS case and all measured cases concentrated aroundthe similarity score (X-axis) of 00 to 015 and of 05 to 06 Inthe SPE case the top score of the similarity was 0615 and allmeasured cases concentrated around the similarity score (X-axis) of 00 to 02

Figure 13 shows the distribution of the similarity scorefor randomly selected 100 hackers mentioned in Section 41To know the mean value of the similarity score for eachhacker case we calculated the similarity score from thehackerrsquos own past cases Cases used for the similarity scoremeans not all cases in the cases-centric DB but just the pastcases conducted by the hacker in the cases-centric DB +emean value of the similarity scores in the hackers is 05233+e similarity scores of the tested cases in Table 4 is abovethe mean value +us the similarity scores for each hackeradequately underpin the similarity scores from the TCs inDS and SPE

(a) (b) (c)

Figure 11 A snippet of website defacement cases by a comparison of examples of the DS and SPE the defaced LGU+ groupware homepage(a) and KBS homepage (b) in the DS case and the defaced website in SPE case (c)

Table 4 Further characteristics and metadata associated with the DS and SPE cases

Retrieved case Tested cases

Case name NotifierDarkSeoul (DS) Hmei7 d3b_X StifLer

Encoding Windows-1252 Windows-1252 Windows-1252 ISO-8859-9IP address 203248195178 2038623868 2031243766 77921083Domain gyunggionnet21com httpwwwgarychengcom healthajkgovpk yapikimyasallaricomtrDate 20 Mar 2013 6 Feb 2014 4 Feb 2014 8 Jun 2013OS Windows Windows Windows WindowsSimilarity mdash 0690 0675 0665Cluster mdash 0 8 4

Retrieved case Tested casesCase name Notifier

Sony pictures Entertainment (SPE) Oaddah MTRiX EL_MuHaMMeDEncoding EUC-KR EUC-CN GB2312 GB2312 GB2312IP address 203131222102 2031241555 20829198 2081164534Domain httpwwwsonypicturesstockfootagecom httpwwwhzkcggcom daxdigitalromcom digitalairstripnetDate 24 Nov 2014 14 Jun 2012 16 Dec 2002 18 June 2009OS Windows Windows Windows WindowsSimilarity mdash 0615 0615 0600Cluster mdash 7 7 7+e metadata are arranged according to the defined case vector corresponding with the DS and SPE cases on the left side (shown in part in boldface type)

16 Security and Communication Networks

43 Follow-Up Investigation A case study is a researchmethod involving an in-depth and detailed investigationof a subject of study as well as its related contextualmethodology Hence we conducted follow-up in-vestigations of the most similar top three hackers asmentioned above in Table 4 According to the resultsspecifically over 93 percent of the hackerrsquos attacks weresimilar to the DS case that occurred in 2013 and 2014+eir major targets were com domain sites and theytargeted primarily Germany Italy New Zealand RussiaTurkey Taiwan and South Korea (see Table 5) Twohackers (ie Hmei7 and d3b_X) primarily attackedgovernment agencies Interestingly 20 percent of theattacks by the hackers named d3b_X targeted SouthKorea In the SPE incident the similar hackerrsquos attacksoccurred throughout the period from 2002 to 2014 +ehackers named MTRiX and EL_MuHaMMeD in-tensively executed such attacks in 2003 and 2009 +eirmajor targets were com (or co) and org domain sitesand they targeted primarily Brazil Canada DenmarkFrance Greece Hong Kong and Italy (see Table 5) Twohackers (ie MTRiX and EL_MuHaMMeD) primarilyattacked commercial agencies and additionally attackedthe public and network agencies As shown in Figure 14 to

describe the follow-up investigation more discernibly andto focus on the attack flow we used an alluvial diagramwhich is a type of Sankey diagram developed to representchanges in a network structure over time [53] It shows theinvestigation of the top three hackers with website de-facement cases most similar to the DS case and SPE case+e case vectors were based on the attack year ccTLD andgTLD+e thickness of the attack flow in this figure meansthe degree of attack +is network visualization methodcould support an investigator to understand the flow andcore of the crime clearly by listing the multidimensionalevidence that is complicatedly entangled or hidden suchthat it does not look presentable

5 Limitations and Discussion

+e CBR algorithm has the disadvantage that the perfor-mance evaluation may be degraded if the property de-scribing the case is inappropriate +erefore in order toobtain more accurate results cross-data analysis with othervarious data sources should be considered For examplecybercrime statistics data from law enforcement agenciesthreat intelligence data from malware analysis groups andvulnerability databases could be useful resources to

100806040200Similarity score

600

400

200

0

Freq

uenc

y

Mean = 02930 Var = 00866

(a)

100806040200

Freq

uenc

y e highest similarity score 0615on Sony Pictures Entertainment case

Similarity score100806040200

Similarity score

40000

30000

20000

10000

0

Freq

uenc

y

40000

30000

20000

10000

0

Mean = 0114 Var = 01500

e highest similarity score 069on DarkSeoul case

Mean = 0063Var = 00370

A B

(b)

Figure 12 (a) Probability distribution of the similarity score for any pair of randomly selected cases (b) distribution of the similarity valuebetween the collected website defacement cases with the DS case (A) and the distribution of the similarity value between the collectedwebsite defacement cases with the SPE case (B) +e similarity was calculated between each studied case and all other cases in our system

0

2

4

6

000 025 050 075

Freq

uenc

y

Mean value of the similarity score000 025 050 075

Figure 13 Distribution of the similarity score for randomly selected 100 hackers

Security and Communication Networks 17

improve the accuracy and usability of our proposedmethodology However at the time of writing the presentpaper we did not have access to open and public dataconcerning cybercrime

For that reason we tried to demonstrate the practica-bility of the proposed methodology as a proof of concept+erefore we focused on the dataset of the zone-horg thatincludes a large number of website defacement cases Al-though the zone-horg provides an extensive dataset on thepast incident events not all incidents can be included in ourstudy +erefore if a hacker penetrated some target orga-nizations by APT attacks and performed stealthy activitiessuch hacking activities would not be reported in the datasetof the zone-horg and the proposed methodology would notbe able to detect similar cases with reasonable confidence

6 Conclusion and Future Work

In this study the similarity of website defacement caseswas assessed through the similarity measure and theclustering processing using the CBR as a methodology+e collected raw data of the defaced web sitesrsquo resourceswas sanitized via data parsing and data cleaning processAlso based on the large size of real dataset data-drivenanalysis for the hacker profiling is achieved To this endthe case vector was designed and the significant featureswere chosen for applying to the case-based reasoning Fora successful cybercrime investigation hacker profiling viaclustering analysis is the most basic and importantprocess in order to find out the relevant incident casesand significant data on some prime incidents data-driven

Table 5 Follow-up investigation on the top three hackers with website defacement cases most similar to the DS case and SPE case +e casevector value means the hackerrsquos attack rate

DomainDS case SPE case

Hmei7 d3b_X StifLer Oaddah MTRiX EL_MuHaMMeDCom 7832 8581 10000 10000 8627 8298Edu 162 096 mdash mdash 176 191Net 340 320 mdash mdash 546 574Gov 1216 651 mdash mdash 106 mdashYear Hmei7 d3b_X StifLer Oaddah MTRiX EL_MuHaMMeD2002 mdash mdash mdash mdash 1074 mdash2003 mdash mdash mdash mdash 8908 mdash2006 mdash mdash mdash mdash mdash mdash2007 009 mdash mdash mdash 018 mdash2008 mdash mdash mdash mdash mdash mdash2009 315 mdash mdash mdash mdash 99572010 009 mdash mdash mdash mdash mdash2011 034 mdash mdash mdash mdash mdash2012 340 mdash mdash 10000 mdash mdash2013 3486 3917 10000 mdash mdash mdash2014 5808 5977 mdash mdash mdash 0432015 mdash 107 mdash mdash mdash mdash

d3b~x

Hmei7

StifLer

2009

2012

2013

2014

AustraliaBrazilFrance

Germany

IndonesiaItaly

KoreaNetherlandsNewZealand

PolandRussia

Thailand

Turkey

Unknown

com

gov

net

org

Unknown

No

Yes

Hacker Year ccTLD gTLD Attack

(a)

EL_M

uHaM

MeD

MTRiX

oaddah

2002

2003

2009

2012

BrazilCanada

DenmarkFranceGreece

HongKongItaly

Unknown

com

net

org

Unknown

No

Yes

Hacker Year ccTLD gTLD Attack

(b)

Figure 14 Follow-up investigation on the top three hackers with website defacement cases that are most similar to the DS case (a) and SPEcase (b)

18 Security and Communication Networks

and evidence-driven decision making should be thecritical process Also reducing the amount of data andtime to be analysed are important factors to deliver thehigh value of intelligence data

Although the obtained results appear to be sound andmeaningful it is difficult to evaluate the accuracy of theresults unless the attacker is captured Naturally theground-truth data with specific information about theinvolved hacking groups for verification are rare (ie noadversary claimed that the two attacks were the result oftheir actions) However it is noteworthy that our meth-odology provides a meaningful insight into the confidentialand undercover network of cybercrime as well especiallywhen there is a lack of information Also the proposedmethodology contributes to facilitate the analysis and re-ducing the time required for searching for possible suspectsof cybercrime We believe that the proposed system ismeaningful for further exploration and correlation ofvarious website defacement cases

As mentioned in Discussion and Limitations a cross-data analysis with other various data sources should bereviewed Said differently the use of additional online oroffline information acquired by human intelligence(HUMINT) or different types of signal intelligence(SIGINT) and sources may also help to reason compo-sition requirements of crime and reduce the category ofinvestigation Furthermore the proposed methodologycan be expanded into incident information for compat-ibility and information exchangeability with othercyberthreat intelligence system as the Structured +reatInformation eXpression (STIX) and Trusted AutomatedeXchange of Indicator Information (TAXII) which arekey strategic elements of the information-sharingsystem [54]

+ere are features such as the particular messages (iethanks-to notifier nationality religion and anniversary)or image and mp3 file in the web resources which aregathered from the zone-horg site Although these featuresare limited to only a small number of hackers of the webresources in future research we will try to study a close-knit network among them such as the hub hacking groupkey player and followers Furthermore we also plan tomore definitely classify and systemize the hackersrsquo intentsusing text mining and mood detection techniques +efindings of this prospective study will contribute mean-ingful insights to trace hackersrsquo behavioural patterns and toestimate their primary purpose and intent

Data Availability

+e web-hacking dataset applied to our paper can bedownloaded from the linked site below httpocslabhksecuritynetDatasetsweb-hacking-profiling

Conflicts of Interest

+e authors declare that there are no conflicts of interestregarding the publication of this paper

Acknowledgments

+is work was supported under the framework of internationalcooperation program managed by the National ResearchFoundation of Korea (No 2017K1A3A1A17092614)

References

[1] S S Response ldquoSwift attackersrsquo malware linked to more fi-nancial attacksrdquo 2016 httpswwwsymanteccomconnectblogsswift-attackers-malware-linked-more-financial-attacks

[2] S S Response ldquoWannacry ransomware attacks show strong linksto lazarus grouprdquo 2017 httpswwwsymanteccomconnectblogswannacry-ransomware-attacks-show-strong-links-lazarus-group

[3] K lab ldquoLazarus under the hoodrdquo 2018 httpsmediakasperskycontenthubcomwp-contentuploadssites4320180307180244Lazarus_Under_+e_Hood_PDF_finalpdf

[4] Operation Blockbuster ldquoDestructive malware reportrdquo 2016httpswwwoperationblockbustercomwp-contentuploads201602Operation-Blockbuster-Destructive-Malware-Reportpdf

[5] D Martin and SANS Institute InfoSec Reading Room ldquoTracingthe lineage of DarkSeoulrdquo 2016 httpswwwsansorgreading-roomwhitepaperscriticaltracing-lineage-darkseoul-36787

[6] D S C T U T Intelligence ldquoWiper malware threatanalysisrdquo 2013 httpswwwsecureworkscomresearchwiper-malware-analysis-attacking-korean-financial-sector

[7] R Sherstobitoff M L Itai Liba and O O T C James WalterldquoDissecting operation troy cyberespionage in South Koreardquo2013 httpswwwmcafeecomenterpriseen-usassetswhite-paperswp-dissecting-operation-troypdf

[8] N Horton andA DeSimone ldquoSonyrsquos nightmare before christmasthe 2014 North Korean cyber attack on Sony and lessons for USgovernment actions in cyberspacerdquo 2018 httpswwwjhuapleduContentdocumentsSonyNightmareBeforeChristmaspdf

[9] I K Lee and S R Ramsey 9e Korean Language StateUniversity of New York Albany NY USA 2000

[10] V Benjamin and H Chen ldquoSecuring cyberspace identifyingkey actors in hacker communitiesrdquo in Proceedings of the 2012IEEE International Conference on Intelligence and SecurityInformatics pp 24ndash29 Arlington VA USA June 2012

[11] Y Lu X Luo M Polgar et al ldquoSocial network analysis of acriminal hacker communityrdquo Journal of Computer In-formation Systems vol 51 no 2 pp 31ndash41 2010

[12] J-W Jang H Kang J Woo A Mohaisen and H K KimldquoAndro-autopsy anti-malware system based on similaritymatching of malware and malware creator-centric in-formationrdquo Digital Investigation vol 14 pp 17ndash35 2015

[13] J W Jang and H K Kim ldquoFunction-orientedmobile malwareanalysis as first aidrdquo Mobile Information Systems vol 2016Article ID 6707524 11 pages 2016

[14] Y Ki E Kim and H K Kim ldquoA novel approach to detectmalware based on api call sequence analysisrdquo InternationalJournal of Distributed Sensor Networks vol 11 no 6 ArticleID 659101 2015

[15] M L Han H C Han A R Kang et al ldquoWeb-hacking datasetfor the cyber criminal profilingrdquo 2016 httpocslabhksecuritynetDatasetsweb-hacking-profiling

[16] M L Han H C Han A R Kang B I Kwak A Mohaisenand H K Kim ldquoWAHP web-hacking profiling using case-based reasoningrdquo in Proceedings of the 2016 IEEE Conference

Security and Communication Networks 19

on Communications and Network Security (CNS) pp 344-345Philadelphia PA USA October 2016

[17] A Aamodt and E Plaza ldquoCase-based reasoning foundationalissues methodological variations and system approachesrdquo AICommunications vol 7 no 1 pp 39ndash59 1994

[18] D M L Martins and F B D Lima Neto ldquoHybrid intelligentdecision support using a semiotic case-based reasoning andself-organizing mapsrdquo IEEE Transactions on Systems Manand Cybernetics Systems no 99 pp 1ndash8 2017

[19] H K Kim K H Im and S C Park ldquoDSS for computersecurity incident response applying CBR and collaborativeresponserdquo Expert Systems with Applications vol 37 no 1pp 852ndash870 2010

[20] J-B Lamy B Sekar G Guezennec J Bouaud andB Seroussi ldquoExplainable artificial intelligence for breastcancer a visual case-based reasoning approachrdquo ArtificialIntelligence in Medicine vol 94 pp 42ndash53 2019

[21] M Relich and P Pawlewski ldquoA case-based reasoning ap-proach to cost estimation of new product developmentrdquoNeurocomputing vol 272 pp 40ndash45 2018

[22] E R Reyes S Negny G C Robles et al ldquoImprovement ofonline adaptation knowledge acquisition and reuse in case-based reasoning application to process engineering designrdquoEngineering Applications of Artificial Intelligence vol 41pp 1ndash16 2015

[23] H K Kim S-K Kim and S-H Kim ldquoDecision supportsystem for zero-day attack responserdquo Applied Mathematicsand Information Sciences vol 6 no 1 pp 221Sndash241S 2012

[24] G Horsman C Laing and P Vickers ldquoA case-based rea-soning method for locating evidence during digital forensicdevice triagerdquo Decision Support Systems vol 61 pp 69ndash782014

[25] G Horsman C Laing and P Vickers ldquoA case based reasoningsystem for automated forensic examinationsrdquo in Proceedings ofthe PGNET 2011 the 12th Annual Postgraduate Symposium onthe Convergence of Telecommunications Networking andBroadcasting pp 26ndash31 Liverpool UK June 2011

[26] Z Yin Y Gao and B Chen ldquoOn development of supple-mentary criminal analysis system based on cbr and ontologyrdquoin Proceedings of the 2010 International Conference onComputer Application and System Modeling (ICCASM 2010)vol 14 Taiyuan China October 2010

[27] A J Pinizzotto and N J Finkel ldquoCriminal personality pro-filing an outcome and process studyrdquo Law and HumanBehavior vol 14 no 3 pp 215ndash233 1990

[28] P Chen and J Kurland ldquoTime place and modus operandi asimple apriori algorithm experiment for crime pattern de-tectionrdquo in Proceedings of the 2018 9th International Con-ference on Information Intelligence Systems and Applications(IISA) pp 1ndash3 Zakynthos Greece July 2018

[29] C J R Collie and K Shalev Greene ldquoExamining modusoperandi in stranger child abduction a comparison ofattempted and completed casesrdquo Journal of InvestigativePsychology and Offender Profiling vol 16 no 2 pp 91ndash1092019

[30] V Benjamin B Zhang J F Nunamaker Jr and H ChenldquoExamining hacker participation length in cybercriminalinternet-relay-chat communitiesrdquo Journal of ManagementInformation Systems vol 33 no 2 pp 482ndash510 2016

[31] V Benjamin and H Chen ldquoTime-to-event modeling forpredicting hacker IRC community participant trajectoryrdquo inProceedings of the 2014 IEEE Joint Intelligence and SecurityInformatics Conference pp 25ndash32 +e Hague +e Nether-lands September 2014

[32] K Veena and K Meena ldquoIdentification of cyber criminal byanalysing the users profilerdquo International Journal of NetworkSecurity vol 20 no 4 pp 738ndash745 2018

[33] F Iqbal B C M Fung M Debbabi R Batool andA Marrington ldquoWordnet-based criminal networks miningfor cybercrime investigationrdquo IEEE Access vol 7pp 22740ndash22755 2019

[34] N Qazi and B L W Wong ldquoAn interactive human centereddata science approach towards crime pattern analysisrdquo In-formation Processing ampManagement vol 56 no 6 p 1020662019

[35] N Jain P Sharma R Anchan et al ldquoComputerized forensicapproach using data mining techniquesrdquo in Proceedings of theACM Symposium on Women in Research 2016 pp 55ndash60ACM New York NY USA 2016

[36] P M Cozens G Saville and D Hillier ldquoCrime preventionthrough environmental design (cpted) a review and modernbibliographyrdquo Property Management vol 23 no 5pp 328ndash356 2005

[37] H Hassani X Huang E S Silva andM Ghodsi ldquoA review ofdata mining applications in crimerdquo Statistical Analysis andData Mining 9e ASA Data Science Journal vol 9 no 3pp 139ndash154 2016

[38] A Sharma and S Sharma ldquoAn intelligent analysis of webcrime data using data miningrdquo International Journal of En-gineering and Innovative Technology (IJEIT) vol 2 no 32012

[39] S-T Li S-C Kuo and F-C Tsai ldquoAn intelligent decision-support model using FSOM and rule extraction for crimepreventionrdquo Expert Systems with Applications vol 37 no 10pp 7108ndash7119 2010

[40] Y-H Tseng Z-P Ho K-S Yang and C-C Chen ldquoMiningterm networks from text collections for crime investigationrdquoExpert Systems with Applications vol 39 no 11 pp 10082ndash10090 2012

[41] A Malathi and S S Baboo ldquoAn enhanced algorithm topredict a future crime using data miningrdquo InternationalJournal of Computer Applications vol 21 no 1 2011

[42] S Kapetanakis A Filippoupolitis G Loukas et al ldquoProfilingcyber attackers using case-based reasoningrdquo in Proceedings ofthe 19th UK Workshop on Case-Based Reasoning (UKCBR2014) Cambridge UK December 2014

[43] R Al-Zaidy B C Fung A M Youssef et al ldquoMining criminalnetworks from unstructured text documentsrdquo Digital In-vestigation vol 8 no 3-4 pp 147ndash160 2012

[44] M Zulfadhilah Y Prayudi and I Riadi ldquoCyber profilingusing log analysis and k-means clusteringrdquo InternationalJournal of Advanced Computer Science and Applicationsvol 7 no 7 pp 430ndash435 2016

[45] S V Nath ldquoCrime pattern detection using data miningrdquo inProceedings of the 2006 IEEEWICACM International Con-ference on Web Intelligence and Intelligent Agent TechnologyWorkshops pp 41ndash44 Hong Kong China December 2006

[46] ITPnet ldquoSyria Egypt crises spur escalation of me cyber at-tacksrdquo 2013 httpwwwitpnet594742-syria-egypt-crises-spur-escalation-of-me-cyber-attack

[47] A McEnery and R Xiao ldquoCharacter encoding in corpusconstructionrdquo in Developing Linguistic Corpora A Guide toGood Practice Oxbow Books Ltd Oxford UK 2005

[48] B Bos T Ccedilelik I Hickson et al ldquoCascading style sheets level2 revision 1 (CSS 21) specificationrdquo W3C Working Draft2005 httpwwww3orgTRCSS21

20 Security and Communication Networks

[49] W Stuckey ldquoMassive sony breach sheds light on murkyhacker universerdquo 2018 httpamericaaljazeeracomarticles20141224sony-hacker-universehtml

[50] S Gallagher ldquoSony pictures malware tied to SeoulldquoShamoonrdquo cyber-attacksrdquo 2018 httpsarstechnicacominformation-technology201412sony-pictures-malware-tied-to-seoul-shamoon-cyber-attacks

[51] J Pagliery ldquoSony hack signs point to North Koreardquo 2018httpsmoneycnncom20141205technologysecuritysony-hack-north-korea-employeeindexhtml

[52] K Ketler ldquoCase-based reasoning an introductionrdquo ExpertSystems with Applications vol 6 no 1 pp 3ndash8 1993

[53] M Rosvall and C T Bergstrom ldquoMapping change in largenetworksrdquo PLoS One vol 5 no 1 Article ID e8694 2010

[54] OASIS ldquoSTIXTAXII standardsrdquo 2017-2018 httpsoasis-opengithubiocti-documentation

Security and Communication Networks 21

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 6: CBR-Based Decision Support Methodology for Cybercrime

In theWindows operating system if a specific font is notdesignated as the tag inside the HTML code such as theltfont-familygt property the characters on a website pagemay appear as broken In particular some of the fontsamong the Chinese charactersrsquo cultural area depend on thecharacter encoding (eg font-family Gulim MingLiU andSTHeiti) [48] Similar to the encoding feature although thischaracteristic may be the key evidence to uncover a cor-relation between the victim and the attacker it is extremelyrare in each of the collected website defacement cases+erefore it is not suitable as a case vector for cybercrimeinvestigation Meanwhile in the case of a web server itprovides HTML CSS JavaScript etc when a client requestsa web page using the web server While the Apache and IISweb servers are primarily used in the Windows environ-ment the LiteSpeed web server is primarily used in the Linuxenvironment and the Enterprise web server is primarily usedin the UNIX environment +erefore the web server is

selectively dependent on the OS environment As with thefont feature described before since the web server featurecould not be found in the collected website defacement casesit was not suitable as a case vector for cybercrime in-vestigation Finally although the case vector concerningthanks-to and notifier can be used to analyse a hiddennetwork between the hackers and hacker groups the analysisof a network among hackers and hacking groups throughthem should be addressed in future research

As a result we defined the case vector by dividing intotwo types ie a version for the similarity measure and aversion for the clustering processing As the features of thecase vector the encoding IP address domain (ie servicename gTLD and ccTLD) attack date and OS were used inthe similarity measure However the encoding gTLDccTLD and OS were used in the clustering processing +eencoding is a case vector that provides decisive clues relatedto the attackerrsquos region information In the case of the IP

Case vector designPreprocessing

Clustering module

Cases-centric DB

Reasoning engine

It matches a new attack case with a former

attack case depending on

the defined case vector

It measuresthe similarity

score depending onthe weights and values

It calculates weights and

values

Similarity module

Data parsing

Data cleaning

Feature selection

Feature normalization

It performs the clustering processing through the EM

algorithm

It derives several clusterswhich exhibit similar patterns

Crawler

It gets the metadata and HTML source of

a website defacement case through

the mirror pageArchive page in the zone-horg site

Figure 3 Proposed analytical framework for the data-driven website defacement cases

Figure 4 Sorted dataset through the preprocessing

6 Security and Communication Networks

address and domain it gives clues related to the victimrsquoslocation and position Furthermore the attack date givesclues to the relation between the attacker and the victim+edetailed explanation of key features is provided in Table 1

+e normalization result of various feature elementsstored in the raw form of the HTML source is presented inFigure 5 In the case of encoding ISO series and MSWindows series are applied by normalizing depending onthe encoding used in each region or country In the case ofgTLD it was applied by normalizing depending on thegroups or organizations with similar characteristics In thecase of ccTLD it was applied by normalizing depending oneach continent Although the compression and normaliza-tion of features enable making the analysis such as clus-tering processing and similarity measure simple and clearon the contrary it may also bring about the loss of in-formation in the original data or make it more difficult toanalyse in detail

33 Reasoning Engine In the reasoning process the rea-soning engine first performs a similarity search based onCBR Discrete similarity scores are defined to calculate thedistance of nominal data (eg IP address and domain)Algorithm 1 shows how the similarity module operates bycomparing a retrieved website defacement case and all casesin the cases-centric DB on a case-by-case basis Sub-sequently the reasoning engine evaluates the similarity score

between the given new attack case vector and vectors ofother attack cases Next the reasoning engine performsclustering to group-abstracted crime cases into classes ofsimilar crime cases In crime investigation a cluster groupedas similar crime case subsets helps to infer crime patternsand speeds up the process of solving a crime due to a betterunderstanding of a complicated relationship or in terms of atimely response In the present study we implemented thereasoning engine consisting of two processing entities thesimilarity measure processing and the clustering algorithmprocessing (see below for further details)

331 Similarity Measure As the similarity measure based onthe CBR algorithm we proposed the similarity algorithmoperated by comparing a retrieved website defacement caseand all cases in the cases-centric DB To begin with if one ofthe retrieved cases (RC a new case) is given and there are ldquonrdquocases in the cases-centric DB (TCs all cases in the cases-centric DB) a comparison between RC and TCs are con-ducted as ldquonrdquo times We defined the extent of similaritybetween RC and TCs as a numeral value from ldquo0rdquo to ldquo1rdquowhere ldquo0rdquo means that RC and TC are unrelated and ldquo1rdquomeansthat RC and TC are identical Similarity score (0lt Slt 1)specifies the extent of similarity between RC and TC If thesimilarity score is much closer to ldquo1rdquo RC and TC are moreanalogous to each other In the event of multiple case vectorssimilarity can be expressed as a weighted sum of case vectors

Table 1 Case vector design highlighting two groups of features

Case vectorUsed in process

DescriptionS C

Encoding O O

It is used to represent the different types of languageinformation on the computer It determines the

usable characters and the methods to express them+e feature was normalized based on MS Windows

and the ISO character set

IP address O NA A unique number that allows devices on the networkto identify and communicate with each other

Domain

Service name O NA+e service name is individually made with a differentname depending on the service categories such as

gTLD or ccTLD

gTLD O O+e gTLD feature was normalized depending on theelement having the same meaning (eg go gob and

gobr feature were normalized into gov)

ccTLD O O

+e ccTLD is a unique code assigned to the domainname that represents the country specific region or

an international organization+e ccTLD normalized by the continent is used in theclustering process and the original ccTLD is used in

the similarity process

Date O NA +e attack date performed by the hacker or thehacking group

OS O OA part of a computer system that manages all

hardware and software (eg Windows Linux andUNIX)

S similarity measure C clustering processing

Security and Communication Networks 7

Similarity score 1113944

cv

i1distance RCcvTCcv( 1113857 times weightcv1113858 1113859

cv case vector(ie encoding IP address domain date andOS)

(1)

+ere are various approaches to set the weight of the casevector such as the heuristic method logistic regression anal-ysis and attribute weighting methods Furthermore theseweight values need to be periodically updated to be applied tothe study of recent attack trends However for the initialsetting it is difficult to set the exact numerical value for eachweight values in accordance with the case vector In our ex-periment we set the impact and the weight of the case vector ashighmedium and low according to their importance so that to

concretely categorize the attacker and the victim Above allsince encoding makes it possible to infer the static locatedinformation of the attacker we defined encoding as high-quality information IP address and domain were defined asmedium-quality information +ese case vectors enable theidentification and specification of the victim Finally the tar-geted date and OS were defined as low-quality information Tomeasure clustering and similarity all values of the case vectormixed as numbers and letters were normalized to have a valuefrom 0 to 1 Obviously since these values can be subjective inorder to prevent this subjective bias these values should beacquired and thoroughly reviewed by several experts +istechnique can be easily applied using expert knowledge ofinvestigation experts and is easy to understand from re-searchersrsquo viewpoint +e quantitative method for setting and

Arabic

Baltic

CentralEurope

Chinese

Cyrillic

Greek

Hebrew

Japanese

Korean

SouthernEurope

Taiwanese

Thailand

Turkish

Africa

Australia

CentralAsia

EastAsia

EasternEurope

NorthAmerica

NorthernEurope

SouthAmerica

SouthAsia

SoutheastAsia

SouthernEurope

WestAsia

WesternEurope

Linux-basedOS

MacOS

Unix-basedOS

Windows-basedOS

bull ISO-8859-6bull Windows-1256

bull ISO-2022-KRbull EUC-KR

bull GB2312bull GB18030 bull GBK

bull ISO-2022-JPbull EUC-JPbull ShiftJIS

bull ISO-8859-2bull Windows-1250

bull ISO-8859-13bull Windows-1257

bull ISO-8859-8bull Windows-1255

bull ISO-8859-7bull Windows-1253

bullbull

ISO-8859-5Windows-1251

bull Windows seriesbull Windows server series

bull Unixbull AIX bull Compaq Tru64 etc

bull MacOSbull MacOSX

bull Linux bull FreeBSD bull Avtech etc

bull combull cobull int

bull info

bull org bull or

bull coop

bull govbull gobull gob

bull edubull ac

bull net

bull mil

bull biz

bull fr ie be gl lube dk ad imnl uk je gg etc

bull br sr ar cl do ec fk gf py sr uy ve etc

bull sa ae kw bh az in ir jo kw lb om qa ye etc

bull no dk lv ltse ax fi glis no

bull us bz lc ai bmgd hn ky mx ni pa sv tt vi etc

bull gr mksm ad va ba es it ptrs hr si li bg etc

bull la bu vn kh th

bull in np bt pk lk id mn mo my np ph tl etc

bull kz uz tm tj kg af am tr

bull au pg nz ccck fj gu kinu sb vu wf etc

bull gn jm ke aobw cf ls mztz ug yt zw etc

bull ru by al lv ua pl sk hu ee md ro mk etc

bull kr cn jp twhk kp sg

Encoding gTLD ccTLD OS

com

edu

gov

org

biz

mil

net

coop

info

bull Windows-1253

bull ISO-8859-11bull Windows-874

bull Big5 bull EUC-TW bull Eten

bull ISO-8859-9bull Windows-1254bull IBM857

WestEurope

bull ISO-8859-1bull Windows-1252

Normalization

Figure 5 Normalization of each feature elements

8 Security and Communication Networks

Input TCs(Tested_DB)lowast +e Tested_DB indicates the cases-centric DB lowastRC (Retrieved_Case)⟵ Encodi ngRC IPRC DomainRC DateRC OSRClowast RC means one of the retrieved cases lowastW (Weight)⟵ Encodi ngW IPW DomainW DateW OSW

Output Similarity_score(1) TCEncodi ngTC IPTC DomainTC DateTC OSTC⟵TCs(2) While RC in TCs do(3) if Encodi ngRC Encodi ngTC then(4) Encoding_similarity_value⟵ 10(5) else(6) Encoding_similarity_value⟵ 00(7) end(8) IPRC Octet ARC Octet BRC Octet CRC Octet DRC IPTC Octet ATC Octet BTC Octet CTC Octet DTC(9) if (Octet ARC Octet ATC) (Octet BRC Octet BTC) (Octet CRC Octet CTC) (Octet DRC Octet DTC) then(10) IP_similarity_value⟵ 10(11) else if (Octet ARC Octet ATC) (Octet BRC Octet BTC) (Octet CRC Octet CTC) then(12) IP_similarity_value⟵ 075(13) else if (Octet ARC Octet ATC) (Octet BRC Octet BTC) then(14) IP_similarity_value⟵ 05(15) else if (Octet ARC Octet ATC) then(16) IP_similarity_value⟵ 025(17) else(18) IP_similarity_value⟵ 00(19) end(20) DomainRC ServiceNameRC gTLDRC ccTLDRC DomainTC ServiceNameTC gTLDTC ccTLDTC(21) if an identical domain then(22) Domain_similarity_value⟵ 10(23) else if (ServiceNameRC ServiceNameTC) (gTLDRC gTLDTC) (ccTLDRC ccTLDTC) then(24) Domain similarity_value⟵ 08(25) else if (gTLDRC gTLDTC) (ccTLDRC ccTLDTC) then(26) Domain_similarity_value⟵ 03(27) else if (ServiceNameRC ServiceNameTC) then(28) Domain_similarity_value⟵ 01(29) else if (ccTLDRC ccTLDTC) then(30) Domain_similarity_value⟵ 01(31) else if (gTLDRC gTLDTC) then(32) Domain_similarity_value⟵ 01(33) else(34) Domain_similarity_value⟵ 00(35) end(36) Date_variance⟵ |Da teRC―Da teTC|lowast It converts a date format year month and day (ie yyyy-mm-dd) into a day

calculated with numeric lowast(37) if 0leDate_variancele 365 then(38) Date_similarity_value⟵ 10(39) else if 365ltDate_variancele 1095 then(40) Date_similarity_value⟵ 075(41) else if 1095ltDate_variancele 1825 then(42) Date_similarity_value⟵ 05(43) else if 1825ltDate_variancele 2555 then(44) Date_similarity_value⟵ 025(45) else if 2555ltDate_variance then(46) Date_similarity_value⟵ 00(47) end(48) if OSRC OSTC then(49) OS_similarity_value⟵ 10(50) else(51) OS_similarity_value⟵ 00(52) end(53) Similarity_score⟵ (Encoding_similarity_valuetimes EncodingW) +

(IP_similarity_valuetimes IPW) + (Domain_similarity_valuetimes DomainW) +(Date_similarity_valuetimes DateW) + (OS_similarity_valuetimes OSW)

(54) return Similarity score between RC and TC(55) end while

ALGORITHM 1 Similarity measure module

Security and Communication Networks 9

updating the weight value is an issue worth addressing infurther research In the present study we set the weight valuesfor the case vector including the encoding IP address domainattack date and OS (see Table 2)

Some case vectorsrsquo distance cannot be directly estimatedas they have mixed numerical and nominal data (such as IPaddress range and domain name) For this reason to cal-culate the distance between the nominal data we defined thediscrete similarity measure +e similarity of IP addresseswas calculated by measuring the similarity among the sameoctet of two given IP addresses +e IP address space iscomposed of a number combination of four octets separatedby ldquordquo In the present study we compared if octets from the1st octet to the 4th octet of RC and TC were identicalSubsequently a similarity value was assigned to the IPaddress vector We suggested the discrete similarity valuebetween two IP addresses as visible in Table 2 +e proposedapproach is advantageous in that it enables the distancecalculation between the IP addresses efficiently

(i) IP address of RC zzz yyy xxx www

(ii) IP address of TC zzz yyy xxx www

Meanwhile the similarity between domains is calculatedaccording to their domain properties +e domain iscomposed of the gTLD ccTLD and service name+e gTLDrefers to a generic top-level domain in the domain rule Forinstance com and co are used for commercial companies ororganizations org and or are used for nonprofit organi-zations go and gov are used for government and stateagencies Besides ccTLD refers to a country code top-leveldomain in the domain rule and means a unique sign thatrepresents a specific region such as kr cn br and uk DNSmakes change in the IP address into a unique Domain Namewhich is easy to remember because it consists of a combi-nation of an alphabet letter and a number Among theDomain Name the service name is built corresponding withthe characteristics of the groups organizations or corpo-rations that the gTLD is intending and pursuing +e servicename has diverse and different names depending on thecategories of the gTLD such as educational institutionscommercial enterprises military organizations nonprofitorganizations and government and state agencies Unlikeother case vectors we set the rule for estimating the simi-larity of the domain as depicted in Table 2

Furthermore we defined the attack date similarity Similarto the offline criminal investigation case if the time of a crimeoccurrence is near we can analyse the cases as a similar crimewith a cross-analysis of the target area and the criminalsrsquopatterns +e similarity value depends on the period differencebetween a new case and existing cases As visible in Table 2 thesimilarity value is described according to the date gap of twocases that occurred on different dates In summary accordingto the similarity degree of a variation range of a section thesimilarity values of the attack IP address domain and attackdate were set to the similarity value between 0 and 1

332 Clustering Processing Merely sorting the data andvisually analysing them render it difficult for an investigator to

infer the correlations and similarity among the potentialfeatures of incidents Hence an advanced tool that wouldcapture the complex underlying structures and data prop-erties is required Accordingly in the present study weconducted the clustering process using the EM algorithmbased on the probability of the individual data attributes +isalgorithm does not restrict the number of clusters in theparameters but automatically generates a number of validclusters by cross-validation +ereafter the algorithm de-termines the probability that some data items existed in thecluster bymaximizing the correlation and dependence amongthe objectsWe applied practically the EM algorithm to 80948data items having the information of encoding gTLD ccTLDand OS from 212093 data for clustering +e characterencoding was normalized by a group of congenial cover codeunits (ISO-8859 MS Windows character set GB and EUCseries) We excluded the Unicode because it is too generalwhich accounts for themajority of the collected encoding datafor clustering In the case of the service name even if we canfind out similar combinations of alphabet letters or numbersit is not easy to find commonality or relevance between them+erefore it is not suitable for being used as the similaritymeasure of the reasoning engine Consequently character-istics and metadata concerning the 12 clusters were obtained(see Table 3) +ese clustering results are also visualized andstored in the database (see Figure 6)

+e donut charts include the different features fromoutside to inside (in order) with the corresponding share ofeach feature value separated by a different colour codewithin this same circle Each cluster consists of four circlesand the circle represents from the outside to the inside theencoding gTLD ccTLD and OS +e percentage in Table 3represents howmany cases one cluster contains among all ofwebsite defacement cases collected from the zone-horg site+e representative hacker represents a notable hacker orhacking group among the members of them in each clusterAs described in Figure 6 clusters of similar patterns werefound in the clusters +e most conspicuously similarclusters were 4 and 7 which had the feature of using Arabicand Chinese a feature of the attack against an industrialorganization whose headquarters are located in WesternEurope +e cases in Clusters 4 and 7 accounted for 4129percent among all of website defacement cases collectedfrom the zone-horg site+e results of the clustering processcontribute to the concretization of the similarity between thenew and existing cases A large number of new cases haveflowed in the database and then if the clustering process isperformed with the dataset a clustering result may take on adifferent pattern of course

4 Application

41 Experimental Results and Analysis Considering that theassumption that the attackers tend to use similar or uniqueattack methods is not always valid and it is difficult toevaluate the accuracy of the similarity mechanism As timeprogresses attackersrsquo hacking skills advance and in additionthe attack plan campaign purpose and target groups canchange depending on the situation +erefore in the present

10 Security and Communication Networks

Table 2 Value and the weight for the similarity score by the case vector All of the values of the similarity score are normalized to 0 or 1

Case vector Weight Impact +e similarity measure between a new case andexisting cases Value

Encoding 05 High mdash 0 or 1

IP address 02 Medium

If the same (eg 14324816 and 14324816) 1If the 1st 2nd and 3rd octet are matched (eg

14324816 and 14324818) 075

If the 1st and 2nd octet are matched (eg 14324816and 14324844) 05

Only the 1st octet is matched (eg 14324816 and1431324) 025

No common octet (eg 14324816 and 1631325) 0

Domain 015 Medium

An identical domain 1Service name is matched and one of the gTLD and

ccTLD is matched 08

gTLD and ccTLD is matched 03Service name is matched 01

ccTLD is matched 01gTLD is matched 01

Nonidentical domain 0

Date 01 Low

Period of about 6 months back and forth (1 year) 1Period of about 18 months back and forth (3 years) 075Period of about 30 months back and forth (5 years) 05Period of about 42 months back and forth (7 years) 025Over period of about 42 months (over 7 years) 0

OS 005 Low mdash 0 or 1

Table 3 Characteristics and metadata of several different clusters derived from the clustering processing

Cluster number Ratio () Description Representative hacker (group)

0 784+e group uses Central European languages +eyprincipally attacked against the profit organization

and Linux-based OS in Western EuropeJaMaYcKa Super2li

1 816

+e group uses Arabic and Cyrillic +ey principallyattacked against the organization that manages thenetwork and Linux-based and Unix-based OS +eirattack region is distributed throughout SouthernEurope South America Eastern Europe and

Southeast Asia

BI0S

2 1036

+e group uses Central European languages +eyprincipally attacked against the organization that

manages the network and nonprofit organizations inWestern Europe

JaMaYcKa

3 933+e group uses Central European languages +eyprincipally attacked against the profit organization

and Windows-based OS in Western Europe1923Turk

4 2536+e group uses Arabic and Chinese +ey principally

attacked against the profit organization andWindows-based OS in Western Europe

EL_MuHaMMeD federal-atackorg

5 173

+e group uses Central European languages +eyprincipally attacked against the profit organizationand Unix-based OS in Southern Europe and Eastern

Europe

d3bsimX SuSKuN

6 524

+e group uses Central European languages +eyprincipally attacked against the profit organizationthe educational institution the government and stateagencies and also Windows-based OS in East Asia

1923Turk

Security and Communication Networks 11

study rather than evaluating the accuracy of the similaritymechanism we tested the overall performance of the pro-posed methodology with the ratio of correctly identified

hackers +e developed testing procedures unfolded in thefollowing four steps and are depicted in detail in Figure 7where ldquoKrdquo presents all hackers within the database

Table 3 Continued

Cluster number Ratio () Description Representative hacker (group)

7 1593+e group uses Arabic Chinese and Turkish +eyprincipally attacked against the profit organization

and Linux-based OS in Western EuropeRya iskorpitx

8 911+e group uses Central European languages +eyprincipally attacked against the profit organization

and Windows-based OS in Western Europe1923Turk

9 363

+e group uses Central European languages +eyprincipally attacked against the profit organizationand Linux-based OS in South America and Eastern

Europe

Hmei7

10 139

+e group uses Central European languages +eyprincipally attacked against Windows-based OS inSouth America and Southeast Asia+eir attack target

is mostly the educational institution and thegovernment and state agencies

BHS F4keLive

11 192

+e group uses Arabic and Central Europeanlanguages+ey principally attacked against the profitorganization and Windows-based OS in Southern

Europe

EL_MuHaMMeD linuXploit_cre

Clustering 00

25

25

50

75

0100

25

50

75

0100

25

50

75

0100

25

50

75

0100

25

50

75

0100

25

50

75

0100

25

50

75

0100

25

50

75

0100

25

50

75

0100

25

50

75

0100

25

50

75

0100

50

75

0100

Clustering 01 Clustering 02 Clustering 03

Clustering 04 Clustering 05 Clustering 06 Clustering 07

Clustering 08 Clustering 09 Clustering 10 Clustering 11

West EuropeTurkishCentral EuropeArabicCyrillicChinese

com

net

org

gov

edu

mil

Western EuropeEast AsiaSouthern EuropeSouth AmericaEastern EuropeSoutheast Asia

WindowLinuxUnixMacOS

Encoding

gTLD

ccTLD

OS

Figure 6 Visualization of the 12 different clusters (00 through 11) in our data annotated with various features encoding gTLD ccTLD andOS and their corresponding share (legend on the right side)

12 Security and Communication Networks

Rk Count Casesmk( )Count Casesallk( )

(3)

where ldquomrdquo means the past cases which are within the denedscope concerning a randomly selected hacker ldquokrdquo

(i) Step 1 selection the measurement objects ie 100hackers were randomly selected from the database

(ii) Step 2 case labelling we retrieved all previous attackcases conducted by the randomly selected 100hackers in Step 1 and then subsequently labelled allprevious attack cases by each hacker name

(iii) Step 3 case extraction we selected the most recentcase among the cases labelled in Step 2 as an inputvalue shye similarity score was then estimated bycomparing themost recent case (ie RCmdashone of theretrieved cases) with all other cases in the database(ie TCsmdashall cases in the cases-centric DB)

(iv) Step 4 scoring similarity score was sorteddepending on the value and the weight for thesimilarity score by the case vector (see Table 2) inthe descending orderWhenever the similarity valuewas 0 it was not displayed on the scoring list of Step4 shye feasibility of the proposed methodology wasevaluated based on how many past cases of a hackerthere were in the N scope at the scoring list of Step 4that is regarding the ratio of the attack cases by eachhacker we checked whether the cases were includedat the top N scope (N scope from the top 1 percentto the top 30 percent)

NScope Count CasesScopeK( )Count CasesallK( )

times 100 (2)

First we randomly picked 100 hackers from the col-lected dataset (ie cases-centric DB) thereafter we re-trieved and extracted all past attack cases for each hackershye extracted past cases were labelled with the hackerrsquosname Figure 8 depicts the number of website defacementattack cases in the past for each hacker In Steps 3 and 4similarity between a retrieved case (ie the most recentcase) and all other stored website defacement cases weremeasured

Specically we checked whether the result (ie thesorted hackerrsquos past cases with a high similarity score)stemming from the similarity measurement was included atthe top N scope shyis process was meant to check based onthe similarity score how many past attack cases of randomlypicked 100 hackers were included in the dened topN scopeTo this end we divided the top N scope into eight criterionfactors from the top 1 percent to the top 30 percent and theratio R all the past attack cases for each hacker into sixcriterion factors from 50 percent to 100 percent (ie at 10percent intervals) As illustrated in equations (2) and (3) theN scope and the ratio R were categorized as ratios accordingto the dened measure rule More specically the criterionof the top N scope ie ldquotop N percentrdquo was based on theresult derived from the similarity measurement Attack caseswere sorted in order of high similarity score and thereforethe cases were within the range of topN scope (see Figure 9)Also in the case of the hacking case ratio of a randomly

Step 4 scoring

bullbullbull

Randomly selected100 hackers

from the database

Step 1 selection

bullbullbull

1 eBuGz2 Hmei73 3xp1r3

98 S3cure99 drm1st3r

100 Lulz53c

bullbullbull

Step 2 case labelling1 TheBuGz

100 Lulz53c

Step 3 case extraction

A retrieved case(the most recent case)

bullbullbull

1 TheBuGz

100 Lulz53c

bullbullbull

Cases-centricDB

Hackername Date Encoding IP address Domain OS Score

Hackername Date Encoding IP address Domain OS Score

bullbullbull

1 eBuGz2 Hmei73 3xp1r3

98 S3cure99 drm1st3r

100 Lulz53c

Case1 bullbullbull Casem

Case1 bullbullbull Casemprime

Case1 bullbullbull Casem

Case1 bullbullbull Casemprimei=1

cv[Distance (RCcv TCcv) times Weightcv]

Casemprime

Casemprime

Casem

Casem

Figure 7 shye developed testing procedures from step 1 to step 4

Security and Communication Networks 13

selected hacker some parts of the past attack cases (ie ratioR) concerning a hacker were within the defined N scope (seeFigure 9)

Figure 10 shows the number of an identified hacker fora retrieved case (ie the most recent case) among allhacking cases of each hacker +e X-axis in Figure 10 showsthe criterion of the topN scope including the eight criterionfactors () and of the ratio R including the six criterionfactors () +e Y-axis presents the number of an identifiedhacker in the top N scope among the randomly selected 100hackers in Step 1 As can be seen in Figure 10 the higher theratio R and the narrower the N scope the lower the numberof an identified hacker in the top N scope among therandomly selected 100 hackers On the other hand thelower the ratio R and the wider the N scope the higher thenumber of identified hackers in the top N scope among therandomly selected 100 hackers Consequently even ifhacking cases were caused by the same hacker as the hackeror hacking group which only attacked the same or similarobjects were rare it is impossible to draw results with a highsimilarity score for all cases of a hacker Nevertheless theresults demonstrated that the proposed CBR-based de-cision support methodology can successfully reduce thenumber of hackers and their cases and suggest potential topN percent candidates among hundreds of thousands ofcases

+erefore an investigator should consider the avail-ability and flexibility of data with respect to the data selectioncriteria for the similarity measurement As mentionedabove when a new attack occurs they can limit the searchrange of the data and determine the direction of the criminalinvestigation With such the reduction in the number ofcandidate-related cases the outcomes of our similaritymechanism are highly valuable in terms of reducing theinvestigation time to determine the potential suspect of agiven hacking incident

42 Case Study As mentioned above the accuracy of theCBR depends on the quality of the collected data and theoverall accuracy is difficult to evaluate Nevertheless al-though the data are insufficient to evaluate the proposedmethodology the DS and SPE cases include the ground-truth data with specific information related to the hacker orhacking groups Based on the public ground-truth data ofthe DS and SPE cases we found the most similar top threehackers or hacking groups to them and noticed theircharacteristic by the proposed similarity measure and theclustering processing

+e hackers of the DS cyberattack defaced the groupwarehomepage of LG U+ the 3rd largest telecommunicationcompany in South Korea and the English version of the

Step 4 scoring

1 TheBuGzTop N scope(1~30)

Ratio R (50~100)

Hackername Date Encoding IP address Domain OS Score

Case1 bullbullbull Casem

Casem

Figure 9 Scoring step on the top N scope and the ratio R

0

1000

2000

3000

4000

5000

0 25 50 75 100Hacker

Num

ber o

f cas

es

Figure 8 +e number of website defacement attack cases in the past of each hacker

14 Security and Communication Networks

Korean Broadcasting System (KBS) homepage +ey leftunique images and many messages on the defaced websites+e three Calaveras image (ie skull image) used in the LGU+rsquos defaced website appeared on many European websites+e character encoding set of the message was the WesternEuropean language system Based on these insights we couldinfer that the hackersrsquo background is European ldquoHASTATIrdquowas the word written on the KBS homepage meaning theforefront line of the Roman troops hinting that the DScyberattack could be a starting point rather than a transientattack it was a persistent one Even if we excluded otherimages and messages as well as other features from thesimilarity processes due to the unanticipated loss or absenceof data one could establish the similarity and intent of theattackers with reasonable confidence However given thesufficiently large hacker profiling source such abundant datacould support and enhance the accuracy of inference Fig-ure 11 shows the screenshots of the defaced websites at thattime

In the SPE case similarly to the DS case some imagesand messages were left on the computers of SPE Regardingcolour skulls image and misspellings the imagesFigure 11(c) used in the SPE cases took on the characteristicssimilar to those of the images Figure 11(b) used in the DScases As shown in Figure 11 the colour schemes in greenand red and the visual similarities seen in skull image areother crucial elements for crime tracing In both the DS andSPE cases the phrase such as ldquothis is the beginningrdquo andldquoyour datardquo were commonly found in the messages How-ever given the intentional hacking nature of forging orhiding their identity motivation and location some experts

say that these characteristics are not the conclusive proofthat Sony has been attacked by the same hacker [49ndash51]

For the evaluation of the results of the case study we firstmeasured the similarity between the new website de-facement cases (ie the DS and SPE cases) and the collectedexisting cases in the database +is approach coheres withthe CBR process used in cybercrime investigation (seeFigure 2) Two new website defacement cases the DS and theSPE were applied as RC and the similarity score for each ofthese two cases was computed using the similarity measure(see equation (1)) proposed in Section 331 Provided thatbecause the DS and SPE cases do the function of the targetcases as an input value we considered a direct comparisonbetween the DS and SPE cases for the similarity score wasnot appropriate [52]

+e similarity measure mentioned in the previousparagraph is based on the metadata released by an analysisreport of the DS and SPE real cases We summarized furtherthe characteristics and metadata associated with them inTable 4 +e similarity score was derived through com-parison between the presented metadata of the DS and SPEcases and all cases in the cases-centric DB We gave the mostsimilar top three cases among the result of the similarityscore (see the right side in Table) Notifier Hmei7 and d3b_Xare among the cases that belonged to Clusters 0 and 8 whichwere the two clusters that exhibited identical characteristicsIt can thus be understood that they used the encoding systempertinent to Central European languages based on the Latinlanguage system and typically launched attacks against aprofit organization located in Western Europe Notifieroaddah MTRiX and EL_MuHaMMeD were all classified

Top 1 Top 3 Top 5 Top 10 Top 15 Top 20 Top 25 Top 30Criterion of the top N ()

Num

ber o

f ide

ntifi

ed h

acke

rs

0

20

40

60

80

100

Ratio of the attack cases ()506070

8090100

Figure 10 +e number of identified hackers in the top N scope among the randomly selected 100 hackers

Security and Communication Networks 15

as the same cluster (Cluster 7) where the hackers of Cluster 7used the encoding system pertinent to Arabic and Chineselanguages and typically attacked against the profit organi-zation located in Western Europe

Next to ensure the objectivity of the similarity scorebased on the case study by the DS and SPE we computed thesimilarity score of any randomly selected pair from thewhole case Figure 12(a) shows the distribution of thesimilarity score of the randomly selected cases We took thedistribution of the similarity score using the central limittheorem which describes the average distribution of ran-dom samples extracted from a finite population +e dis-tribution shows that the calculation of the similarity score ofthe randomly selected two website defacement cases wasrepeatedly performed for 10000 times +e similarity scoresof any randomly selected pair of cases were typically dis-tributed around 03 +is result (Figure 12(a)) substantiatesthat the similarity scores are not low even if the similarityscores of the DS and SPE cases (Figure 12(b)) do not appear

numerically high Figure 12(b) shows the similarity scores ofthe DS and SPE cases+e top score of the similarity was 069in the DS case and all measured cases concentrated aroundthe similarity score (X-axis) of 00 to 015 and of 05 to 06 Inthe SPE case the top score of the similarity was 0615 and allmeasured cases concentrated around the similarity score (X-axis) of 00 to 02

Figure 13 shows the distribution of the similarity scorefor randomly selected 100 hackers mentioned in Section 41To know the mean value of the similarity score for eachhacker case we calculated the similarity score from thehackerrsquos own past cases Cases used for the similarity scoremeans not all cases in the cases-centric DB but just the pastcases conducted by the hacker in the cases-centric DB +emean value of the similarity scores in the hackers is 05233+e similarity scores of the tested cases in Table 4 is abovethe mean value +us the similarity scores for each hackeradequately underpin the similarity scores from the TCs inDS and SPE

(a) (b) (c)

Figure 11 A snippet of website defacement cases by a comparison of examples of the DS and SPE the defaced LGU+ groupware homepage(a) and KBS homepage (b) in the DS case and the defaced website in SPE case (c)

Table 4 Further characteristics and metadata associated with the DS and SPE cases

Retrieved case Tested cases

Case name NotifierDarkSeoul (DS) Hmei7 d3b_X StifLer

Encoding Windows-1252 Windows-1252 Windows-1252 ISO-8859-9IP address 203248195178 2038623868 2031243766 77921083Domain gyunggionnet21com httpwwwgarychengcom healthajkgovpk yapikimyasallaricomtrDate 20 Mar 2013 6 Feb 2014 4 Feb 2014 8 Jun 2013OS Windows Windows Windows WindowsSimilarity mdash 0690 0675 0665Cluster mdash 0 8 4

Retrieved case Tested casesCase name Notifier

Sony pictures Entertainment (SPE) Oaddah MTRiX EL_MuHaMMeDEncoding EUC-KR EUC-CN GB2312 GB2312 GB2312IP address 203131222102 2031241555 20829198 2081164534Domain httpwwwsonypicturesstockfootagecom httpwwwhzkcggcom daxdigitalromcom digitalairstripnetDate 24 Nov 2014 14 Jun 2012 16 Dec 2002 18 June 2009OS Windows Windows Windows WindowsSimilarity mdash 0615 0615 0600Cluster mdash 7 7 7+e metadata are arranged according to the defined case vector corresponding with the DS and SPE cases on the left side (shown in part in boldface type)

16 Security and Communication Networks

43 Follow-Up Investigation A case study is a researchmethod involving an in-depth and detailed investigationof a subject of study as well as its related contextualmethodology Hence we conducted follow-up in-vestigations of the most similar top three hackers asmentioned above in Table 4 According to the resultsspecifically over 93 percent of the hackerrsquos attacks weresimilar to the DS case that occurred in 2013 and 2014+eir major targets were com domain sites and theytargeted primarily Germany Italy New Zealand RussiaTurkey Taiwan and South Korea (see Table 5) Twohackers (ie Hmei7 and d3b_X) primarily attackedgovernment agencies Interestingly 20 percent of theattacks by the hackers named d3b_X targeted SouthKorea In the SPE incident the similar hackerrsquos attacksoccurred throughout the period from 2002 to 2014 +ehackers named MTRiX and EL_MuHaMMeD in-tensively executed such attacks in 2003 and 2009 +eirmajor targets were com (or co) and org domain sitesand they targeted primarily Brazil Canada DenmarkFrance Greece Hong Kong and Italy (see Table 5) Twohackers (ie MTRiX and EL_MuHaMMeD) primarilyattacked commercial agencies and additionally attackedthe public and network agencies As shown in Figure 14 to

describe the follow-up investigation more discernibly andto focus on the attack flow we used an alluvial diagramwhich is a type of Sankey diagram developed to representchanges in a network structure over time [53] It shows theinvestigation of the top three hackers with website de-facement cases most similar to the DS case and SPE case+e case vectors were based on the attack year ccTLD andgTLD+e thickness of the attack flow in this figure meansthe degree of attack +is network visualization methodcould support an investigator to understand the flow andcore of the crime clearly by listing the multidimensionalevidence that is complicatedly entangled or hidden suchthat it does not look presentable

5 Limitations and Discussion

+e CBR algorithm has the disadvantage that the perfor-mance evaluation may be degraded if the property de-scribing the case is inappropriate +erefore in order toobtain more accurate results cross-data analysis with othervarious data sources should be considered For examplecybercrime statistics data from law enforcement agenciesthreat intelligence data from malware analysis groups andvulnerability databases could be useful resources to

100806040200Similarity score

600

400

200

0

Freq

uenc

y

Mean = 02930 Var = 00866

(a)

100806040200

Freq

uenc

y e highest similarity score 0615on Sony Pictures Entertainment case

Similarity score100806040200

Similarity score

40000

30000

20000

10000

0

Freq

uenc

y

40000

30000

20000

10000

0

Mean = 0114 Var = 01500

e highest similarity score 069on DarkSeoul case

Mean = 0063Var = 00370

A B

(b)

Figure 12 (a) Probability distribution of the similarity score for any pair of randomly selected cases (b) distribution of the similarity valuebetween the collected website defacement cases with the DS case (A) and the distribution of the similarity value between the collectedwebsite defacement cases with the SPE case (B) +e similarity was calculated between each studied case and all other cases in our system

0

2

4

6

000 025 050 075

Freq

uenc

y

Mean value of the similarity score000 025 050 075

Figure 13 Distribution of the similarity score for randomly selected 100 hackers

Security and Communication Networks 17

improve the accuracy and usability of our proposedmethodology However at the time of writing the presentpaper we did not have access to open and public dataconcerning cybercrime

For that reason we tried to demonstrate the practica-bility of the proposed methodology as a proof of concept+erefore we focused on the dataset of the zone-horg thatincludes a large number of website defacement cases Al-though the zone-horg provides an extensive dataset on thepast incident events not all incidents can be included in ourstudy +erefore if a hacker penetrated some target orga-nizations by APT attacks and performed stealthy activitiessuch hacking activities would not be reported in the datasetof the zone-horg and the proposed methodology would notbe able to detect similar cases with reasonable confidence

6 Conclusion and Future Work

In this study the similarity of website defacement caseswas assessed through the similarity measure and theclustering processing using the CBR as a methodology+e collected raw data of the defaced web sitesrsquo resourceswas sanitized via data parsing and data cleaning processAlso based on the large size of real dataset data-drivenanalysis for the hacker profiling is achieved To this endthe case vector was designed and the significant featureswere chosen for applying to the case-based reasoning Fora successful cybercrime investigation hacker profiling viaclustering analysis is the most basic and importantprocess in order to find out the relevant incident casesand significant data on some prime incidents data-driven

Table 5 Follow-up investigation on the top three hackers with website defacement cases most similar to the DS case and SPE case +e casevector value means the hackerrsquos attack rate

DomainDS case SPE case

Hmei7 d3b_X StifLer Oaddah MTRiX EL_MuHaMMeDCom 7832 8581 10000 10000 8627 8298Edu 162 096 mdash mdash 176 191Net 340 320 mdash mdash 546 574Gov 1216 651 mdash mdash 106 mdashYear Hmei7 d3b_X StifLer Oaddah MTRiX EL_MuHaMMeD2002 mdash mdash mdash mdash 1074 mdash2003 mdash mdash mdash mdash 8908 mdash2006 mdash mdash mdash mdash mdash mdash2007 009 mdash mdash mdash 018 mdash2008 mdash mdash mdash mdash mdash mdash2009 315 mdash mdash mdash mdash 99572010 009 mdash mdash mdash mdash mdash2011 034 mdash mdash mdash mdash mdash2012 340 mdash mdash 10000 mdash mdash2013 3486 3917 10000 mdash mdash mdash2014 5808 5977 mdash mdash mdash 0432015 mdash 107 mdash mdash mdash mdash

d3b~x

Hmei7

StifLer

2009

2012

2013

2014

AustraliaBrazilFrance

Germany

IndonesiaItaly

KoreaNetherlandsNewZealand

PolandRussia

Thailand

Turkey

Unknown

com

gov

net

org

Unknown

No

Yes

Hacker Year ccTLD gTLD Attack

(a)

EL_M

uHaM

MeD

MTRiX

oaddah

2002

2003

2009

2012

BrazilCanada

DenmarkFranceGreece

HongKongItaly

Unknown

com

net

org

Unknown

No

Yes

Hacker Year ccTLD gTLD Attack

(b)

Figure 14 Follow-up investigation on the top three hackers with website defacement cases that are most similar to the DS case (a) and SPEcase (b)

18 Security and Communication Networks

and evidence-driven decision making should be thecritical process Also reducing the amount of data andtime to be analysed are important factors to deliver thehigh value of intelligence data

Although the obtained results appear to be sound andmeaningful it is difficult to evaluate the accuracy of theresults unless the attacker is captured Naturally theground-truth data with specific information about theinvolved hacking groups for verification are rare (ie noadversary claimed that the two attacks were the result oftheir actions) However it is noteworthy that our meth-odology provides a meaningful insight into the confidentialand undercover network of cybercrime as well especiallywhen there is a lack of information Also the proposedmethodology contributes to facilitate the analysis and re-ducing the time required for searching for possible suspectsof cybercrime We believe that the proposed system ismeaningful for further exploration and correlation ofvarious website defacement cases

As mentioned in Discussion and Limitations a cross-data analysis with other various data sources should bereviewed Said differently the use of additional online oroffline information acquired by human intelligence(HUMINT) or different types of signal intelligence(SIGINT) and sources may also help to reason compo-sition requirements of crime and reduce the category ofinvestigation Furthermore the proposed methodologycan be expanded into incident information for compat-ibility and information exchangeability with othercyberthreat intelligence system as the Structured +reatInformation eXpression (STIX) and Trusted AutomatedeXchange of Indicator Information (TAXII) which arekey strategic elements of the information-sharingsystem [54]

+ere are features such as the particular messages (iethanks-to notifier nationality religion and anniversary)or image and mp3 file in the web resources which aregathered from the zone-horg site Although these featuresare limited to only a small number of hackers of the webresources in future research we will try to study a close-knit network among them such as the hub hacking groupkey player and followers Furthermore we also plan tomore definitely classify and systemize the hackersrsquo intentsusing text mining and mood detection techniques +efindings of this prospective study will contribute mean-ingful insights to trace hackersrsquo behavioural patterns and toestimate their primary purpose and intent

Data Availability

+e web-hacking dataset applied to our paper can bedownloaded from the linked site below httpocslabhksecuritynetDatasetsweb-hacking-profiling

Conflicts of Interest

+e authors declare that there are no conflicts of interestregarding the publication of this paper

Acknowledgments

+is work was supported under the framework of internationalcooperation program managed by the National ResearchFoundation of Korea (No 2017K1A3A1A17092614)

References

[1] S S Response ldquoSwift attackersrsquo malware linked to more fi-nancial attacksrdquo 2016 httpswwwsymanteccomconnectblogsswift-attackers-malware-linked-more-financial-attacks

[2] S S Response ldquoWannacry ransomware attacks show strong linksto lazarus grouprdquo 2017 httpswwwsymanteccomconnectblogswannacry-ransomware-attacks-show-strong-links-lazarus-group

[3] K lab ldquoLazarus under the hoodrdquo 2018 httpsmediakasperskycontenthubcomwp-contentuploadssites4320180307180244Lazarus_Under_+e_Hood_PDF_finalpdf

[4] Operation Blockbuster ldquoDestructive malware reportrdquo 2016httpswwwoperationblockbustercomwp-contentuploads201602Operation-Blockbuster-Destructive-Malware-Reportpdf

[5] D Martin and SANS Institute InfoSec Reading Room ldquoTracingthe lineage of DarkSeoulrdquo 2016 httpswwwsansorgreading-roomwhitepaperscriticaltracing-lineage-darkseoul-36787

[6] D S C T U T Intelligence ldquoWiper malware threatanalysisrdquo 2013 httpswwwsecureworkscomresearchwiper-malware-analysis-attacking-korean-financial-sector

[7] R Sherstobitoff M L Itai Liba and O O T C James WalterldquoDissecting operation troy cyberespionage in South Koreardquo2013 httpswwwmcafeecomenterpriseen-usassetswhite-paperswp-dissecting-operation-troypdf

[8] N Horton andA DeSimone ldquoSonyrsquos nightmare before christmasthe 2014 North Korean cyber attack on Sony and lessons for USgovernment actions in cyberspacerdquo 2018 httpswwwjhuapleduContentdocumentsSonyNightmareBeforeChristmaspdf

[9] I K Lee and S R Ramsey 9e Korean Language StateUniversity of New York Albany NY USA 2000

[10] V Benjamin and H Chen ldquoSecuring cyberspace identifyingkey actors in hacker communitiesrdquo in Proceedings of the 2012IEEE International Conference on Intelligence and SecurityInformatics pp 24ndash29 Arlington VA USA June 2012

[11] Y Lu X Luo M Polgar et al ldquoSocial network analysis of acriminal hacker communityrdquo Journal of Computer In-formation Systems vol 51 no 2 pp 31ndash41 2010

[12] J-W Jang H Kang J Woo A Mohaisen and H K KimldquoAndro-autopsy anti-malware system based on similaritymatching of malware and malware creator-centric in-formationrdquo Digital Investigation vol 14 pp 17ndash35 2015

[13] J W Jang and H K Kim ldquoFunction-orientedmobile malwareanalysis as first aidrdquo Mobile Information Systems vol 2016Article ID 6707524 11 pages 2016

[14] Y Ki E Kim and H K Kim ldquoA novel approach to detectmalware based on api call sequence analysisrdquo InternationalJournal of Distributed Sensor Networks vol 11 no 6 ArticleID 659101 2015

[15] M L Han H C Han A R Kang et al ldquoWeb-hacking datasetfor the cyber criminal profilingrdquo 2016 httpocslabhksecuritynetDatasetsweb-hacking-profiling

[16] M L Han H C Han A R Kang B I Kwak A Mohaisenand H K Kim ldquoWAHP web-hacking profiling using case-based reasoningrdquo in Proceedings of the 2016 IEEE Conference

Security and Communication Networks 19

on Communications and Network Security (CNS) pp 344-345Philadelphia PA USA October 2016

[17] A Aamodt and E Plaza ldquoCase-based reasoning foundationalissues methodological variations and system approachesrdquo AICommunications vol 7 no 1 pp 39ndash59 1994

[18] D M L Martins and F B D Lima Neto ldquoHybrid intelligentdecision support using a semiotic case-based reasoning andself-organizing mapsrdquo IEEE Transactions on Systems Manand Cybernetics Systems no 99 pp 1ndash8 2017

[19] H K Kim K H Im and S C Park ldquoDSS for computersecurity incident response applying CBR and collaborativeresponserdquo Expert Systems with Applications vol 37 no 1pp 852ndash870 2010

[20] J-B Lamy B Sekar G Guezennec J Bouaud andB Seroussi ldquoExplainable artificial intelligence for breastcancer a visual case-based reasoning approachrdquo ArtificialIntelligence in Medicine vol 94 pp 42ndash53 2019

[21] M Relich and P Pawlewski ldquoA case-based reasoning ap-proach to cost estimation of new product developmentrdquoNeurocomputing vol 272 pp 40ndash45 2018

[22] E R Reyes S Negny G C Robles et al ldquoImprovement ofonline adaptation knowledge acquisition and reuse in case-based reasoning application to process engineering designrdquoEngineering Applications of Artificial Intelligence vol 41pp 1ndash16 2015

[23] H K Kim S-K Kim and S-H Kim ldquoDecision supportsystem for zero-day attack responserdquo Applied Mathematicsand Information Sciences vol 6 no 1 pp 221Sndash241S 2012

[24] G Horsman C Laing and P Vickers ldquoA case-based rea-soning method for locating evidence during digital forensicdevice triagerdquo Decision Support Systems vol 61 pp 69ndash782014

[25] G Horsman C Laing and P Vickers ldquoA case based reasoningsystem for automated forensic examinationsrdquo in Proceedings ofthe PGNET 2011 the 12th Annual Postgraduate Symposium onthe Convergence of Telecommunications Networking andBroadcasting pp 26ndash31 Liverpool UK June 2011

[26] Z Yin Y Gao and B Chen ldquoOn development of supple-mentary criminal analysis system based on cbr and ontologyrdquoin Proceedings of the 2010 International Conference onComputer Application and System Modeling (ICCASM 2010)vol 14 Taiyuan China October 2010

[27] A J Pinizzotto and N J Finkel ldquoCriminal personality pro-filing an outcome and process studyrdquo Law and HumanBehavior vol 14 no 3 pp 215ndash233 1990

[28] P Chen and J Kurland ldquoTime place and modus operandi asimple apriori algorithm experiment for crime pattern de-tectionrdquo in Proceedings of the 2018 9th International Con-ference on Information Intelligence Systems and Applications(IISA) pp 1ndash3 Zakynthos Greece July 2018

[29] C J R Collie and K Shalev Greene ldquoExamining modusoperandi in stranger child abduction a comparison ofattempted and completed casesrdquo Journal of InvestigativePsychology and Offender Profiling vol 16 no 2 pp 91ndash1092019

[30] V Benjamin B Zhang J F Nunamaker Jr and H ChenldquoExamining hacker participation length in cybercriminalinternet-relay-chat communitiesrdquo Journal of ManagementInformation Systems vol 33 no 2 pp 482ndash510 2016

[31] V Benjamin and H Chen ldquoTime-to-event modeling forpredicting hacker IRC community participant trajectoryrdquo inProceedings of the 2014 IEEE Joint Intelligence and SecurityInformatics Conference pp 25ndash32 +e Hague +e Nether-lands September 2014

[32] K Veena and K Meena ldquoIdentification of cyber criminal byanalysing the users profilerdquo International Journal of NetworkSecurity vol 20 no 4 pp 738ndash745 2018

[33] F Iqbal B C M Fung M Debbabi R Batool andA Marrington ldquoWordnet-based criminal networks miningfor cybercrime investigationrdquo IEEE Access vol 7pp 22740ndash22755 2019

[34] N Qazi and B L W Wong ldquoAn interactive human centereddata science approach towards crime pattern analysisrdquo In-formation Processing ampManagement vol 56 no 6 p 1020662019

[35] N Jain P Sharma R Anchan et al ldquoComputerized forensicapproach using data mining techniquesrdquo in Proceedings of theACM Symposium on Women in Research 2016 pp 55ndash60ACM New York NY USA 2016

[36] P M Cozens G Saville and D Hillier ldquoCrime preventionthrough environmental design (cpted) a review and modernbibliographyrdquo Property Management vol 23 no 5pp 328ndash356 2005

[37] H Hassani X Huang E S Silva andM Ghodsi ldquoA review ofdata mining applications in crimerdquo Statistical Analysis andData Mining 9e ASA Data Science Journal vol 9 no 3pp 139ndash154 2016

[38] A Sharma and S Sharma ldquoAn intelligent analysis of webcrime data using data miningrdquo International Journal of En-gineering and Innovative Technology (IJEIT) vol 2 no 32012

[39] S-T Li S-C Kuo and F-C Tsai ldquoAn intelligent decision-support model using FSOM and rule extraction for crimepreventionrdquo Expert Systems with Applications vol 37 no 10pp 7108ndash7119 2010

[40] Y-H Tseng Z-P Ho K-S Yang and C-C Chen ldquoMiningterm networks from text collections for crime investigationrdquoExpert Systems with Applications vol 39 no 11 pp 10082ndash10090 2012

[41] A Malathi and S S Baboo ldquoAn enhanced algorithm topredict a future crime using data miningrdquo InternationalJournal of Computer Applications vol 21 no 1 2011

[42] S Kapetanakis A Filippoupolitis G Loukas et al ldquoProfilingcyber attackers using case-based reasoningrdquo in Proceedings ofthe 19th UK Workshop on Case-Based Reasoning (UKCBR2014) Cambridge UK December 2014

[43] R Al-Zaidy B C Fung A M Youssef et al ldquoMining criminalnetworks from unstructured text documentsrdquo Digital In-vestigation vol 8 no 3-4 pp 147ndash160 2012

[44] M Zulfadhilah Y Prayudi and I Riadi ldquoCyber profilingusing log analysis and k-means clusteringrdquo InternationalJournal of Advanced Computer Science and Applicationsvol 7 no 7 pp 430ndash435 2016

[45] S V Nath ldquoCrime pattern detection using data miningrdquo inProceedings of the 2006 IEEEWICACM International Con-ference on Web Intelligence and Intelligent Agent TechnologyWorkshops pp 41ndash44 Hong Kong China December 2006

[46] ITPnet ldquoSyria Egypt crises spur escalation of me cyber at-tacksrdquo 2013 httpwwwitpnet594742-syria-egypt-crises-spur-escalation-of-me-cyber-attack

[47] A McEnery and R Xiao ldquoCharacter encoding in corpusconstructionrdquo in Developing Linguistic Corpora A Guide toGood Practice Oxbow Books Ltd Oxford UK 2005

[48] B Bos T Ccedilelik I Hickson et al ldquoCascading style sheets level2 revision 1 (CSS 21) specificationrdquo W3C Working Draft2005 httpwwww3orgTRCSS21

20 Security and Communication Networks

[49] W Stuckey ldquoMassive sony breach sheds light on murkyhacker universerdquo 2018 httpamericaaljazeeracomarticles20141224sony-hacker-universehtml

[50] S Gallagher ldquoSony pictures malware tied to SeoulldquoShamoonrdquo cyber-attacksrdquo 2018 httpsarstechnicacominformation-technology201412sony-pictures-malware-tied-to-seoul-shamoon-cyber-attacks

[51] J Pagliery ldquoSony hack signs point to North Koreardquo 2018httpsmoneycnncom20141205technologysecuritysony-hack-north-korea-employeeindexhtml

[52] K Ketler ldquoCase-based reasoning an introductionrdquo ExpertSystems with Applications vol 6 no 1 pp 3ndash8 1993

[53] M Rosvall and C T Bergstrom ldquoMapping change in largenetworksrdquo PLoS One vol 5 no 1 Article ID e8694 2010

[54] OASIS ldquoSTIXTAXII standardsrdquo 2017-2018 httpsoasis-opengithubiocti-documentation

Security and Communication Networks 21

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 7: CBR-Based Decision Support Methodology for Cybercrime

address and domain it gives clues related to the victimrsquoslocation and position Furthermore the attack date givesclues to the relation between the attacker and the victim+edetailed explanation of key features is provided in Table 1

+e normalization result of various feature elementsstored in the raw form of the HTML source is presented inFigure 5 In the case of encoding ISO series and MSWindows series are applied by normalizing depending onthe encoding used in each region or country In the case ofgTLD it was applied by normalizing depending on thegroups or organizations with similar characteristics In thecase of ccTLD it was applied by normalizing depending oneach continent Although the compression and normaliza-tion of features enable making the analysis such as clus-tering processing and similarity measure simple and clearon the contrary it may also bring about the loss of in-formation in the original data or make it more difficult toanalyse in detail

33 Reasoning Engine In the reasoning process the rea-soning engine first performs a similarity search based onCBR Discrete similarity scores are defined to calculate thedistance of nominal data (eg IP address and domain)Algorithm 1 shows how the similarity module operates bycomparing a retrieved website defacement case and all casesin the cases-centric DB on a case-by-case basis Sub-sequently the reasoning engine evaluates the similarity score

between the given new attack case vector and vectors ofother attack cases Next the reasoning engine performsclustering to group-abstracted crime cases into classes ofsimilar crime cases In crime investigation a cluster groupedas similar crime case subsets helps to infer crime patternsand speeds up the process of solving a crime due to a betterunderstanding of a complicated relationship or in terms of atimely response In the present study we implemented thereasoning engine consisting of two processing entities thesimilarity measure processing and the clustering algorithmprocessing (see below for further details)

331 Similarity Measure As the similarity measure based onthe CBR algorithm we proposed the similarity algorithmoperated by comparing a retrieved website defacement caseand all cases in the cases-centric DB To begin with if one ofthe retrieved cases (RC a new case) is given and there are ldquonrdquocases in the cases-centric DB (TCs all cases in the cases-centric DB) a comparison between RC and TCs are con-ducted as ldquonrdquo times We defined the extent of similaritybetween RC and TCs as a numeral value from ldquo0rdquo to ldquo1rdquowhere ldquo0rdquo means that RC and TC are unrelated and ldquo1rdquomeansthat RC and TC are identical Similarity score (0lt Slt 1)specifies the extent of similarity between RC and TC If thesimilarity score is much closer to ldquo1rdquo RC and TC are moreanalogous to each other In the event of multiple case vectorssimilarity can be expressed as a weighted sum of case vectors

Table 1 Case vector design highlighting two groups of features

Case vectorUsed in process

DescriptionS C

Encoding O O

It is used to represent the different types of languageinformation on the computer It determines the

usable characters and the methods to express them+e feature was normalized based on MS Windows

and the ISO character set

IP address O NA A unique number that allows devices on the networkto identify and communicate with each other

Domain

Service name O NA+e service name is individually made with a differentname depending on the service categories such as

gTLD or ccTLD

gTLD O O+e gTLD feature was normalized depending on theelement having the same meaning (eg go gob and

gobr feature were normalized into gov)

ccTLD O O

+e ccTLD is a unique code assigned to the domainname that represents the country specific region or

an international organization+e ccTLD normalized by the continent is used in theclustering process and the original ccTLD is used in

the similarity process

Date O NA +e attack date performed by the hacker or thehacking group

OS O OA part of a computer system that manages all

hardware and software (eg Windows Linux andUNIX)

S similarity measure C clustering processing

Security and Communication Networks 7

Similarity score 1113944

cv

i1distance RCcvTCcv( 1113857 times weightcv1113858 1113859

cv case vector(ie encoding IP address domain date andOS)

(1)

+ere are various approaches to set the weight of the casevector such as the heuristic method logistic regression anal-ysis and attribute weighting methods Furthermore theseweight values need to be periodically updated to be applied tothe study of recent attack trends However for the initialsetting it is difficult to set the exact numerical value for eachweight values in accordance with the case vector In our ex-periment we set the impact and the weight of the case vector ashighmedium and low according to their importance so that to

concretely categorize the attacker and the victim Above allsince encoding makes it possible to infer the static locatedinformation of the attacker we defined encoding as high-quality information IP address and domain were defined asmedium-quality information +ese case vectors enable theidentification and specification of the victim Finally the tar-geted date and OS were defined as low-quality information Tomeasure clustering and similarity all values of the case vectormixed as numbers and letters were normalized to have a valuefrom 0 to 1 Obviously since these values can be subjective inorder to prevent this subjective bias these values should beacquired and thoroughly reviewed by several experts +istechnique can be easily applied using expert knowledge ofinvestigation experts and is easy to understand from re-searchersrsquo viewpoint +e quantitative method for setting and

Arabic

Baltic

CentralEurope

Chinese

Cyrillic

Greek

Hebrew

Japanese

Korean

SouthernEurope

Taiwanese

Thailand

Turkish

Africa

Australia

CentralAsia

EastAsia

EasternEurope

NorthAmerica

NorthernEurope

SouthAmerica

SouthAsia

SoutheastAsia

SouthernEurope

WestAsia

WesternEurope

Linux-basedOS

MacOS

Unix-basedOS

Windows-basedOS

bull ISO-8859-6bull Windows-1256

bull ISO-2022-KRbull EUC-KR

bull GB2312bull GB18030 bull GBK

bull ISO-2022-JPbull EUC-JPbull ShiftJIS

bull ISO-8859-2bull Windows-1250

bull ISO-8859-13bull Windows-1257

bull ISO-8859-8bull Windows-1255

bull ISO-8859-7bull Windows-1253

bullbull

ISO-8859-5Windows-1251

bull Windows seriesbull Windows server series

bull Unixbull AIX bull Compaq Tru64 etc

bull MacOSbull MacOSX

bull Linux bull FreeBSD bull Avtech etc

bull combull cobull int

bull info

bull org bull or

bull coop

bull govbull gobull gob

bull edubull ac

bull net

bull mil

bull biz

bull fr ie be gl lube dk ad imnl uk je gg etc

bull br sr ar cl do ec fk gf py sr uy ve etc

bull sa ae kw bh az in ir jo kw lb om qa ye etc

bull no dk lv ltse ax fi glis no

bull us bz lc ai bmgd hn ky mx ni pa sv tt vi etc

bull gr mksm ad va ba es it ptrs hr si li bg etc

bull la bu vn kh th

bull in np bt pk lk id mn mo my np ph tl etc

bull kz uz tm tj kg af am tr

bull au pg nz ccck fj gu kinu sb vu wf etc

bull gn jm ke aobw cf ls mztz ug yt zw etc

bull ru by al lv ua pl sk hu ee md ro mk etc

bull kr cn jp twhk kp sg

Encoding gTLD ccTLD OS

com

edu

gov

org

biz

mil

net

coop

info

bull Windows-1253

bull ISO-8859-11bull Windows-874

bull Big5 bull EUC-TW bull Eten

bull ISO-8859-9bull Windows-1254bull IBM857

WestEurope

bull ISO-8859-1bull Windows-1252

Normalization

Figure 5 Normalization of each feature elements

8 Security and Communication Networks

Input TCs(Tested_DB)lowast +e Tested_DB indicates the cases-centric DB lowastRC (Retrieved_Case)⟵ Encodi ngRC IPRC DomainRC DateRC OSRClowast RC means one of the retrieved cases lowastW (Weight)⟵ Encodi ngW IPW DomainW DateW OSW

Output Similarity_score(1) TCEncodi ngTC IPTC DomainTC DateTC OSTC⟵TCs(2) While RC in TCs do(3) if Encodi ngRC Encodi ngTC then(4) Encoding_similarity_value⟵ 10(5) else(6) Encoding_similarity_value⟵ 00(7) end(8) IPRC Octet ARC Octet BRC Octet CRC Octet DRC IPTC Octet ATC Octet BTC Octet CTC Octet DTC(9) if (Octet ARC Octet ATC) (Octet BRC Octet BTC) (Octet CRC Octet CTC) (Octet DRC Octet DTC) then(10) IP_similarity_value⟵ 10(11) else if (Octet ARC Octet ATC) (Octet BRC Octet BTC) (Octet CRC Octet CTC) then(12) IP_similarity_value⟵ 075(13) else if (Octet ARC Octet ATC) (Octet BRC Octet BTC) then(14) IP_similarity_value⟵ 05(15) else if (Octet ARC Octet ATC) then(16) IP_similarity_value⟵ 025(17) else(18) IP_similarity_value⟵ 00(19) end(20) DomainRC ServiceNameRC gTLDRC ccTLDRC DomainTC ServiceNameTC gTLDTC ccTLDTC(21) if an identical domain then(22) Domain_similarity_value⟵ 10(23) else if (ServiceNameRC ServiceNameTC) (gTLDRC gTLDTC) (ccTLDRC ccTLDTC) then(24) Domain similarity_value⟵ 08(25) else if (gTLDRC gTLDTC) (ccTLDRC ccTLDTC) then(26) Domain_similarity_value⟵ 03(27) else if (ServiceNameRC ServiceNameTC) then(28) Domain_similarity_value⟵ 01(29) else if (ccTLDRC ccTLDTC) then(30) Domain_similarity_value⟵ 01(31) else if (gTLDRC gTLDTC) then(32) Domain_similarity_value⟵ 01(33) else(34) Domain_similarity_value⟵ 00(35) end(36) Date_variance⟵ |Da teRC―Da teTC|lowast It converts a date format year month and day (ie yyyy-mm-dd) into a day

calculated with numeric lowast(37) if 0leDate_variancele 365 then(38) Date_similarity_value⟵ 10(39) else if 365ltDate_variancele 1095 then(40) Date_similarity_value⟵ 075(41) else if 1095ltDate_variancele 1825 then(42) Date_similarity_value⟵ 05(43) else if 1825ltDate_variancele 2555 then(44) Date_similarity_value⟵ 025(45) else if 2555ltDate_variance then(46) Date_similarity_value⟵ 00(47) end(48) if OSRC OSTC then(49) OS_similarity_value⟵ 10(50) else(51) OS_similarity_value⟵ 00(52) end(53) Similarity_score⟵ (Encoding_similarity_valuetimes EncodingW) +

(IP_similarity_valuetimes IPW) + (Domain_similarity_valuetimes DomainW) +(Date_similarity_valuetimes DateW) + (OS_similarity_valuetimes OSW)

(54) return Similarity score between RC and TC(55) end while

ALGORITHM 1 Similarity measure module

Security and Communication Networks 9

updating the weight value is an issue worth addressing infurther research In the present study we set the weight valuesfor the case vector including the encoding IP address domainattack date and OS (see Table 2)

Some case vectorsrsquo distance cannot be directly estimatedas they have mixed numerical and nominal data (such as IPaddress range and domain name) For this reason to cal-culate the distance between the nominal data we defined thediscrete similarity measure +e similarity of IP addresseswas calculated by measuring the similarity among the sameoctet of two given IP addresses +e IP address space iscomposed of a number combination of four octets separatedby ldquordquo In the present study we compared if octets from the1st octet to the 4th octet of RC and TC were identicalSubsequently a similarity value was assigned to the IPaddress vector We suggested the discrete similarity valuebetween two IP addresses as visible in Table 2 +e proposedapproach is advantageous in that it enables the distancecalculation between the IP addresses efficiently

(i) IP address of RC zzz yyy xxx www

(ii) IP address of TC zzz yyy xxx www

Meanwhile the similarity between domains is calculatedaccording to their domain properties +e domain iscomposed of the gTLD ccTLD and service name+e gTLDrefers to a generic top-level domain in the domain rule Forinstance com and co are used for commercial companies ororganizations org and or are used for nonprofit organi-zations go and gov are used for government and stateagencies Besides ccTLD refers to a country code top-leveldomain in the domain rule and means a unique sign thatrepresents a specific region such as kr cn br and uk DNSmakes change in the IP address into a unique Domain Namewhich is easy to remember because it consists of a combi-nation of an alphabet letter and a number Among theDomain Name the service name is built corresponding withthe characteristics of the groups organizations or corpo-rations that the gTLD is intending and pursuing +e servicename has diverse and different names depending on thecategories of the gTLD such as educational institutionscommercial enterprises military organizations nonprofitorganizations and government and state agencies Unlikeother case vectors we set the rule for estimating the simi-larity of the domain as depicted in Table 2

Furthermore we defined the attack date similarity Similarto the offline criminal investigation case if the time of a crimeoccurrence is near we can analyse the cases as a similar crimewith a cross-analysis of the target area and the criminalsrsquopatterns +e similarity value depends on the period differencebetween a new case and existing cases As visible in Table 2 thesimilarity value is described according to the date gap of twocases that occurred on different dates In summary accordingto the similarity degree of a variation range of a section thesimilarity values of the attack IP address domain and attackdate were set to the similarity value between 0 and 1

332 Clustering Processing Merely sorting the data andvisually analysing them render it difficult for an investigator to

infer the correlations and similarity among the potentialfeatures of incidents Hence an advanced tool that wouldcapture the complex underlying structures and data prop-erties is required Accordingly in the present study weconducted the clustering process using the EM algorithmbased on the probability of the individual data attributes +isalgorithm does not restrict the number of clusters in theparameters but automatically generates a number of validclusters by cross-validation +ereafter the algorithm de-termines the probability that some data items existed in thecluster bymaximizing the correlation and dependence amongthe objectsWe applied practically the EM algorithm to 80948data items having the information of encoding gTLD ccTLDand OS from 212093 data for clustering +e characterencoding was normalized by a group of congenial cover codeunits (ISO-8859 MS Windows character set GB and EUCseries) We excluded the Unicode because it is too generalwhich accounts for themajority of the collected encoding datafor clustering In the case of the service name even if we canfind out similar combinations of alphabet letters or numbersit is not easy to find commonality or relevance between them+erefore it is not suitable for being used as the similaritymeasure of the reasoning engine Consequently character-istics and metadata concerning the 12 clusters were obtained(see Table 3) +ese clustering results are also visualized andstored in the database (see Figure 6)

+e donut charts include the different features fromoutside to inside (in order) with the corresponding share ofeach feature value separated by a different colour codewithin this same circle Each cluster consists of four circlesand the circle represents from the outside to the inside theencoding gTLD ccTLD and OS +e percentage in Table 3represents howmany cases one cluster contains among all ofwebsite defacement cases collected from the zone-horg site+e representative hacker represents a notable hacker orhacking group among the members of them in each clusterAs described in Figure 6 clusters of similar patterns werefound in the clusters +e most conspicuously similarclusters were 4 and 7 which had the feature of using Arabicand Chinese a feature of the attack against an industrialorganization whose headquarters are located in WesternEurope +e cases in Clusters 4 and 7 accounted for 4129percent among all of website defacement cases collectedfrom the zone-horg site+e results of the clustering processcontribute to the concretization of the similarity between thenew and existing cases A large number of new cases haveflowed in the database and then if the clustering process isperformed with the dataset a clustering result may take on adifferent pattern of course

4 Application

41 Experimental Results and Analysis Considering that theassumption that the attackers tend to use similar or uniqueattack methods is not always valid and it is difficult toevaluate the accuracy of the similarity mechanism As timeprogresses attackersrsquo hacking skills advance and in additionthe attack plan campaign purpose and target groups canchange depending on the situation +erefore in the present

10 Security and Communication Networks

Table 2 Value and the weight for the similarity score by the case vector All of the values of the similarity score are normalized to 0 or 1

Case vector Weight Impact +e similarity measure between a new case andexisting cases Value

Encoding 05 High mdash 0 or 1

IP address 02 Medium

If the same (eg 14324816 and 14324816) 1If the 1st 2nd and 3rd octet are matched (eg

14324816 and 14324818) 075

If the 1st and 2nd octet are matched (eg 14324816and 14324844) 05

Only the 1st octet is matched (eg 14324816 and1431324) 025

No common octet (eg 14324816 and 1631325) 0

Domain 015 Medium

An identical domain 1Service name is matched and one of the gTLD and

ccTLD is matched 08

gTLD and ccTLD is matched 03Service name is matched 01

ccTLD is matched 01gTLD is matched 01

Nonidentical domain 0

Date 01 Low

Period of about 6 months back and forth (1 year) 1Period of about 18 months back and forth (3 years) 075Period of about 30 months back and forth (5 years) 05Period of about 42 months back and forth (7 years) 025Over period of about 42 months (over 7 years) 0

OS 005 Low mdash 0 or 1

Table 3 Characteristics and metadata of several different clusters derived from the clustering processing

Cluster number Ratio () Description Representative hacker (group)

0 784+e group uses Central European languages +eyprincipally attacked against the profit organization

and Linux-based OS in Western EuropeJaMaYcKa Super2li

1 816

+e group uses Arabic and Cyrillic +ey principallyattacked against the organization that manages thenetwork and Linux-based and Unix-based OS +eirattack region is distributed throughout SouthernEurope South America Eastern Europe and

Southeast Asia

BI0S

2 1036

+e group uses Central European languages +eyprincipally attacked against the organization that

manages the network and nonprofit organizations inWestern Europe

JaMaYcKa

3 933+e group uses Central European languages +eyprincipally attacked against the profit organization

and Windows-based OS in Western Europe1923Turk

4 2536+e group uses Arabic and Chinese +ey principally

attacked against the profit organization andWindows-based OS in Western Europe

EL_MuHaMMeD federal-atackorg

5 173

+e group uses Central European languages +eyprincipally attacked against the profit organizationand Unix-based OS in Southern Europe and Eastern

Europe

d3bsimX SuSKuN

6 524

+e group uses Central European languages +eyprincipally attacked against the profit organizationthe educational institution the government and stateagencies and also Windows-based OS in East Asia

1923Turk

Security and Communication Networks 11

study rather than evaluating the accuracy of the similaritymechanism we tested the overall performance of the pro-posed methodology with the ratio of correctly identified

hackers +e developed testing procedures unfolded in thefollowing four steps and are depicted in detail in Figure 7where ldquoKrdquo presents all hackers within the database

Table 3 Continued

Cluster number Ratio () Description Representative hacker (group)

7 1593+e group uses Arabic Chinese and Turkish +eyprincipally attacked against the profit organization

and Linux-based OS in Western EuropeRya iskorpitx

8 911+e group uses Central European languages +eyprincipally attacked against the profit organization

and Windows-based OS in Western Europe1923Turk

9 363

+e group uses Central European languages +eyprincipally attacked against the profit organizationand Linux-based OS in South America and Eastern

Europe

Hmei7

10 139

+e group uses Central European languages +eyprincipally attacked against Windows-based OS inSouth America and Southeast Asia+eir attack target

is mostly the educational institution and thegovernment and state agencies

BHS F4keLive

11 192

+e group uses Arabic and Central Europeanlanguages+ey principally attacked against the profitorganization and Windows-based OS in Southern

Europe

EL_MuHaMMeD linuXploit_cre

Clustering 00

25

25

50

75

0100

25

50

75

0100

25

50

75

0100

25

50

75

0100

25

50

75

0100

25

50

75

0100

25

50

75

0100

25

50

75

0100

25

50

75

0100

25

50

75

0100

25

50

75

0100

50

75

0100

Clustering 01 Clustering 02 Clustering 03

Clustering 04 Clustering 05 Clustering 06 Clustering 07

Clustering 08 Clustering 09 Clustering 10 Clustering 11

West EuropeTurkishCentral EuropeArabicCyrillicChinese

com

net

org

gov

edu

mil

Western EuropeEast AsiaSouthern EuropeSouth AmericaEastern EuropeSoutheast Asia

WindowLinuxUnixMacOS

Encoding

gTLD

ccTLD

OS

Figure 6 Visualization of the 12 different clusters (00 through 11) in our data annotated with various features encoding gTLD ccTLD andOS and their corresponding share (legend on the right side)

12 Security and Communication Networks

Rk Count Casesmk( )Count Casesallk( )

(3)

where ldquomrdquo means the past cases which are within the denedscope concerning a randomly selected hacker ldquokrdquo

(i) Step 1 selection the measurement objects ie 100hackers were randomly selected from the database

(ii) Step 2 case labelling we retrieved all previous attackcases conducted by the randomly selected 100hackers in Step 1 and then subsequently labelled allprevious attack cases by each hacker name

(iii) Step 3 case extraction we selected the most recentcase among the cases labelled in Step 2 as an inputvalue shye similarity score was then estimated bycomparing themost recent case (ie RCmdashone of theretrieved cases) with all other cases in the database(ie TCsmdashall cases in the cases-centric DB)

(iv) Step 4 scoring similarity score was sorteddepending on the value and the weight for thesimilarity score by the case vector (see Table 2) inthe descending orderWhenever the similarity valuewas 0 it was not displayed on the scoring list of Step4 shye feasibility of the proposed methodology wasevaluated based on how many past cases of a hackerthere were in the N scope at the scoring list of Step 4that is regarding the ratio of the attack cases by eachhacker we checked whether the cases were includedat the top N scope (N scope from the top 1 percentto the top 30 percent)

NScope Count CasesScopeK( )Count CasesallK( )

times 100 (2)

First we randomly picked 100 hackers from the col-lected dataset (ie cases-centric DB) thereafter we re-trieved and extracted all past attack cases for each hackershye extracted past cases were labelled with the hackerrsquosname Figure 8 depicts the number of website defacementattack cases in the past for each hacker In Steps 3 and 4similarity between a retrieved case (ie the most recentcase) and all other stored website defacement cases weremeasured

Specically we checked whether the result (ie thesorted hackerrsquos past cases with a high similarity score)stemming from the similarity measurement was included atthe top N scope shyis process was meant to check based onthe similarity score how many past attack cases of randomlypicked 100 hackers were included in the dened topN scopeTo this end we divided the top N scope into eight criterionfactors from the top 1 percent to the top 30 percent and theratio R all the past attack cases for each hacker into sixcriterion factors from 50 percent to 100 percent (ie at 10percent intervals) As illustrated in equations (2) and (3) theN scope and the ratio R were categorized as ratios accordingto the dened measure rule More specically the criterionof the top N scope ie ldquotop N percentrdquo was based on theresult derived from the similarity measurement Attack caseswere sorted in order of high similarity score and thereforethe cases were within the range of topN scope (see Figure 9)Also in the case of the hacking case ratio of a randomly

Step 4 scoring

bullbullbull

Randomly selected100 hackers

from the database

Step 1 selection

bullbullbull

1 eBuGz2 Hmei73 3xp1r3

98 S3cure99 drm1st3r

100 Lulz53c

bullbullbull

Step 2 case labelling1 TheBuGz

100 Lulz53c

Step 3 case extraction

A retrieved case(the most recent case)

bullbullbull

1 TheBuGz

100 Lulz53c

bullbullbull

Cases-centricDB

Hackername Date Encoding IP address Domain OS Score

Hackername Date Encoding IP address Domain OS Score

bullbullbull

1 eBuGz2 Hmei73 3xp1r3

98 S3cure99 drm1st3r

100 Lulz53c

Case1 bullbullbull Casem

Case1 bullbullbull Casemprime

Case1 bullbullbull Casem

Case1 bullbullbull Casemprimei=1

cv[Distance (RCcv TCcv) times Weightcv]

Casemprime

Casemprime

Casem

Casem

Figure 7 shye developed testing procedures from step 1 to step 4

Security and Communication Networks 13

selected hacker some parts of the past attack cases (ie ratioR) concerning a hacker were within the defined N scope (seeFigure 9)

Figure 10 shows the number of an identified hacker fora retrieved case (ie the most recent case) among allhacking cases of each hacker +e X-axis in Figure 10 showsthe criterion of the topN scope including the eight criterionfactors () and of the ratio R including the six criterionfactors () +e Y-axis presents the number of an identifiedhacker in the top N scope among the randomly selected 100hackers in Step 1 As can be seen in Figure 10 the higher theratio R and the narrower the N scope the lower the numberof an identified hacker in the top N scope among therandomly selected 100 hackers On the other hand thelower the ratio R and the wider the N scope the higher thenumber of identified hackers in the top N scope among therandomly selected 100 hackers Consequently even ifhacking cases were caused by the same hacker as the hackeror hacking group which only attacked the same or similarobjects were rare it is impossible to draw results with a highsimilarity score for all cases of a hacker Nevertheless theresults demonstrated that the proposed CBR-based de-cision support methodology can successfully reduce thenumber of hackers and their cases and suggest potential topN percent candidates among hundreds of thousands ofcases

+erefore an investigator should consider the avail-ability and flexibility of data with respect to the data selectioncriteria for the similarity measurement As mentionedabove when a new attack occurs they can limit the searchrange of the data and determine the direction of the criminalinvestigation With such the reduction in the number ofcandidate-related cases the outcomes of our similaritymechanism are highly valuable in terms of reducing theinvestigation time to determine the potential suspect of agiven hacking incident

42 Case Study As mentioned above the accuracy of theCBR depends on the quality of the collected data and theoverall accuracy is difficult to evaluate Nevertheless al-though the data are insufficient to evaluate the proposedmethodology the DS and SPE cases include the ground-truth data with specific information related to the hacker orhacking groups Based on the public ground-truth data ofthe DS and SPE cases we found the most similar top threehackers or hacking groups to them and noticed theircharacteristic by the proposed similarity measure and theclustering processing

+e hackers of the DS cyberattack defaced the groupwarehomepage of LG U+ the 3rd largest telecommunicationcompany in South Korea and the English version of the

Step 4 scoring

1 TheBuGzTop N scope(1~30)

Ratio R (50~100)

Hackername Date Encoding IP address Domain OS Score

Case1 bullbullbull Casem

Casem

Figure 9 Scoring step on the top N scope and the ratio R

0

1000

2000

3000

4000

5000

0 25 50 75 100Hacker

Num

ber o

f cas

es

Figure 8 +e number of website defacement attack cases in the past of each hacker

14 Security and Communication Networks

Korean Broadcasting System (KBS) homepage +ey leftunique images and many messages on the defaced websites+e three Calaveras image (ie skull image) used in the LGU+rsquos defaced website appeared on many European websites+e character encoding set of the message was the WesternEuropean language system Based on these insights we couldinfer that the hackersrsquo background is European ldquoHASTATIrdquowas the word written on the KBS homepage meaning theforefront line of the Roman troops hinting that the DScyberattack could be a starting point rather than a transientattack it was a persistent one Even if we excluded otherimages and messages as well as other features from thesimilarity processes due to the unanticipated loss or absenceof data one could establish the similarity and intent of theattackers with reasonable confidence However given thesufficiently large hacker profiling source such abundant datacould support and enhance the accuracy of inference Fig-ure 11 shows the screenshots of the defaced websites at thattime

In the SPE case similarly to the DS case some imagesand messages were left on the computers of SPE Regardingcolour skulls image and misspellings the imagesFigure 11(c) used in the SPE cases took on the characteristicssimilar to those of the images Figure 11(b) used in the DScases As shown in Figure 11 the colour schemes in greenand red and the visual similarities seen in skull image areother crucial elements for crime tracing In both the DS andSPE cases the phrase such as ldquothis is the beginningrdquo andldquoyour datardquo were commonly found in the messages How-ever given the intentional hacking nature of forging orhiding their identity motivation and location some experts

say that these characteristics are not the conclusive proofthat Sony has been attacked by the same hacker [49ndash51]

For the evaluation of the results of the case study we firstmeasured the similarity between the new website de-facement cases (ie the DS and SPE cases) and the collectedexisting cases in the database +is approach coheres withthe CBR process used in cybercrime investigation (seeFigure 2) Two new website defacement cases the DS and theSPE were applied as RC and the similarity score for each ofthese two cases was computed using the similarity measure(see equation (1)) proposed in Section 331 Provided thatbecause the DS and SPE cases do the function of the targetcases as an input value we considered a direct comparisonbetween the DS and SPE cases for the similarity score wasnot appropriate [52]

+e similarity measure mentioned in the previousparagraph is based on the metadata released by an analysisreport of the DS and SPE real cases We summarized furtherthe characteristics and metadata associated with them inTable 4 +e similarity score was derived through com-parison between the presented metadata of the DS and SPEcases and all cases in the cases-centric DB We gave the mostsimilar top three cases among the result of the similarityscore (see the right side in Table) Notifier Hmei7 and d3b_Xare among the cases that belonged to Clusters 0 and 8 whichwere the two clusters that exhibited identical characteristicsIt can thus be understood that they used the encoding systempertinent to Central European languages based on the Latinlanguage system and typically launched attacks against aprofit organization located in Western Europe Notifieroaddah MTRiX and EL_MuHaMMeD were all classified

Top 1 Top 3 Top 5 Top 10 Top 15 Top 20 Top 25 Top 30Criterion of the top N ()

Num

ber o

f ide

ntifi

ed h

acke

rs

0

20

40

60

80

100

Ratio of the attack cases ()506070

8090100

Figure 10 +e number of identified hackers in the top N scope among the randomly selected 100 hackers

Security and Communication Networks 15

as the same cluster (Cluster 7) where the hackers of Cluster 7used the encoding system pertinent to Arabic and Chineselanguages and typically attacked against the profit organi-zation located in Western Europe

Next to ensure the objectivity of the similarity scorebased on the case study by the DS and SPE we computed thesimilarity score of any randomly selected pair from thewhole case Figure 12(a) shows the distribution of thesimilarity score of the randomly selected cases We took thedistribution of the similarity score using the central limittheorem which describes the average distribution of ran-dom samples extracted from a finite population +e dis-tribution shows that the calculation of the similarity score ofthe randomly selected two website defacement cases wasrepeatedly performed for 10000 times +e similarity scoresof any randomly selected pair of cases were typically dis-tributed around 03 +is result (Figure 12(a)) substantiatesthat the similarity scores are not low even if the similarityscores of the DS and SPE cases (Figure 12(b)) do not appear

numerically high Figure 12(b) shows the similarity scores ofthe DS and SPE cases+e top score of the similarity was 069in the DS case and all measured cases concentrated aroundthe similarity score (X-axis) of 00 to 015 and of 05 to 06 Inthe SPE case the top score of the similarity was 0615 and allmeasured cases concentrated around the similarity score (X-axis) of 00 to 02

Figure 13 shows the distribution of the similarity scorefor randomly selected 100 hackers mentioned in Section 41To know the mean value of the similarity score for eachhacker case we calculated the similarity score from thehackerrsquos own past cases Cases used for the similarity scoremeans not all cases in the cases-centric DB but just the pastcases conducted by the hacker in the cases-centric DB +emean value of the similarity scores in the hackers is 05233+e similarity scores of the tested cases in Table 4 is abovethe mean value +us the similarity scores for each hackeradequately underpin the similarity scores from the TCs inDS and SPE

(a) (b) (c)

Figure 11 A snippet of website defacement cases by a comparison of examples of the DS and SPE the defaced LGU+ groupware homepage(a) and KBS homepage (b) in the DS case and the defaced website in SPE case (c)

Table 4 Further characteristics and metadata associated with the DS and SPE cases

Retrieved case Tested cases

Case name NotifierDarkSeoul (DS) Hmei7 d3b_X StifLer

Encoding Windows-1252 Windows-1252 Windows-1252 ISO-8859-9IP address 203248195178 2038623868 2031243766 77921083Domain gyunggionnet21com httpwwwgarychengcom healthajkgovpk yapikimyasallaricomtrDate 20 Mar 2013 6 Feb 2014 4 Feb 2014 8 Jun 2013OS Windows Windows Windows WindowsSimilarity mdash 0690 0675 0665Cluster mdash 0 8 4

Retrieved case Tested casesCase name Notifier

Sony pictures Entertainment (SPE) Oaddah MTRiX EL_MuHaMMeDEncoding EUC-KR EUC-CN GB2312 GB2312 GB2312IP address 203131222102 2031241555 20829198 2081164534Domain httpwwwsonypicturesstockfootagecom httpwwwhzkcggcom daxdigitalromcom digitalairstripnetDate 24 Nov 2014 14 Jun 2012 16 Dec 2002 18 June 2009OS Windows Windows Windows WindowsSimilarity mdash 0615 0615 0600Cluster mdash 7 7 7+e metadata are arranged according to the defined case vector corresponding with the DS and SPE cases on the left side (shown in part in boldface type)

16 Security and Communication Networks

43 Follow-Up Investigation A case study is a researchmethod involving an in-depth and detailed investigationof a subject of study as well as its related contextualmethodology Hence we conducted follow-up in-vestigations of the most similar top three hackers asmentioned above in Table 4 According to the resultsspecifically over 93 percent of the hackerrsquos attacks weresimilar to the DS case that occurred in 2013 and 2014+eir major targets were com domain sites and theytargeted primarily Germany Italy New Zealand RussiaTurkey Taiwan and South Korea (see Table 5) Twohackers (ie Hmei7 and d3b_X) primarily attackedgovernment agencies Interestingly 20 percent of theattacks by the hackers named d3b_X targeted SouthKorea In the SPE incident the similar hackerrsquos attacksoccurred throughout the period from 2002 to 2014 +ehackers named MTRiX and EL_MuHaMMeD in-tensively executed such attacks in 2003 and 2009 +eirmajor targets were com (or co) and org domain sitesand they targeted primarily Brazil Canada DenmarkFrance Greece Hong Kong and Italy (see Table 5) Twohackers (ie MTRiX and EL_MuHaMMeD) primarilyattacked commercial agencies and additionally attackedthe public and network agencies As shown in Figure 14 to

describe the follow-up investigation more discernibly andto focus on the attack flow we used an alluvial diagramwhich is a type of Sankey diagram developed to representchanges in a network structure over time [53] It shows theinvestigation of the top three hackers with website de-facement cases most similar to the DS case and SPE case+e case vectors were based on the attack year ccTLD andgTLD+e thickness of the attack flow in this figure meansthe degree of attack +is network visualization methodcould support an investigator to understand the flow andcore of the crime clearly by listing the multidimensionalevidence that is complicatedly entangled or hidden suchthat it does not look presentable

5 Limitations and Discussion

+e CBR algorithm has the disadvantage that the perfor-mance evaluation may be degraded if the property de-scribing the case is inappropriate +erefore in order toobtain more accurate results cross-data analysis with othervarious data sources should be considered For examplecybercrime statistics data from law enforcement agenciesthreat intelligence data from malware analysis groups andvulnerability databases could be useful resources to

100806040200Similarity score

600

400

200

0

Freq

uenc

y

Mean = 02930 Var = 00866

(a)

100806040200

Freq

uenc

y e highest similarity score 0615on Sony Pictures Entertainment case

Similarity score100806040200

Similarity score

40000

30000

20000

10000

0

Freq

uenc

y

40000

30000

20000

10000

0

Mean = 0114 Var = 01500

e highest similarity score 069on DarkSeoul case

Mean = 0063Var = 00370

A B

(b)

Figure 12 (a) Probability distribution of the similarity score for any pair of randomly selected cases (b) distribution of the similarity valuebetween the collected website defacement cases with the DS case (A) and the distribution of the similarity value between the collectedwebsite defacement cases with the SPE case (B) +e similarity was calculated between each studied case and all other cases in our system

0

2

4

6

000 025 050 075

Freq

uenc

y

Mean value of the similarity score000 025 050 075

Figure 13 Distribution of the similarity score for randomly selected 100 hackers

Security and Communication Networks 17

improve the accuracy and usability of our proposedmethodology However at the time of writing the presentpaper we did not have access to open and public dataconcerning cybercrime

For that reason we tried to demonstrate the practica-bility of the proposed methodology as a proof of concept+erefore we focused on the dataset of the zone-horg thatincludes a large number of website defacement cases Al-though the zone-horg provides an extensive dataset on thepast incident events not all incidents can be included in ourstudy +erefore if a hacker penetrated some target orga-nizations by APT attacks and performed stealthy activitiessuch hacking activities would not be reported in the datasetof the zone-horg and the proposed methodology would notbe able to detect similar cases with reasonable confidence

6 Conclusion and Future Work

In this study the similarity of website defacement caseswas assessed through the similarity measure and theclustering processing using the CBR as a methodology+e collected raw data of the defaced web sitesrsquo resourceswas sanitized via data parsing and data cleaning processAlso based on the large size of real dataset data-drivenanalysis for the hacker profiling is achieved To this endthe case vector was designed and the significant featureswere chosen for applying to the case-based reasoning Fora successful cybercrime investigation hacker profiling viaclustering analysis is the most basic and importantprocess in order to find out the relevant incident casesand significant data on some prime incidents data-driven

Table 5 Follow-up investigation on the top three hackers with website defacement cases most similar to the DS case and SPE case +e casevector value means the hackerrsquos attack rate

DomainDS case SPE case

Hmei7 d3b_X StifLer Oaddah MTRiX EL_MuHaMMeDCom 7832 8581 10000 10000 8627 8298Edu 162 096 mdash mdash 176 191Net 340 320 mdash mdash 546 574Gov 1216 651 mdash mdash 106 mdashYear Hmei7 d3b_X StifLer Oaddah MTRiX EL_MuHaMMeD2002 mdash mdash mdash mdash 1074 mdash2003 mdash mdash mdash mdash 8908 mdash2006 mdash mdash mdash mdash mdash mdash2007 009 mdash mdash mdash 018 mdash2008 mdash mdash mdash mdash mdash mdash2009 315 mdash mdash mdash mdash 99572010 009 mdash mdash mdash mdash mdash2011 034 mdash mdash mdash mdash mdash2012 340 mdash mdash 10000 mdash mdash2013 3486 3917 10000 mdash mdash mdash2014 5808 5977 mdash mdash mdash 0432015 mdash 107 mdash mdash mdash mdash

d3b~x

Hmei7

StifLer

2009

2012

2013

2014

AustraliaBrazilFrance

Germany

IndonesiaItaly

KoreaNetherlandsNewZealand

PolandRussia

Thailand

Turkey

Unknown

com

gov

net

org

Unknown

No

Yes

Hacker Year ccTLD gTLD Attack

(a)

EL_M

uHaM

MeD

MTRiX

oaddah

2002

2003

2009

2012

BrazilCanada

DenmarkFranceGreece

HongKongItaly

Unknown

com

net

org

Unknown

No

Yes

Hacker Year ccTLD gTLD Attack

(b)

Figure 14 Follow-up investigation on the top three hackers with website defacement cases that are most similar to the DS case (a) and SPEcase (b)

18 Security and Communication Networks

and evidence-driven decision making should be thecritical process Also reducing the amount of data andtime to be analysed are important factors to deliver thehigh value of intelligence data

Although the obtained results appear to be sound andmeaningful it is difficult to evaluate the accuracy of theresults unless the attacker is captured Naturally theground-truth data with specific information about theinvolved hacking groups for verification are rare (ie noadversary claimed that the two attacks were the result oftheir actions) However it is noteworthy that our meth-odology provides a meaningful insight into the confidentialand undercover network of cybercrime as well especiallywhen there is a lack of information Also the proposedmethodology contributes to facilitate the analysis and re-ducing the time required for searching for possible suspectsof cybercrime We believe that the proposed system ismeaningful for further exploration and correlation ofvarious website defacement cases

As mentioned in Discussion and Limitations a cross-data analysis with other various data sources should bereviewed Said differently the use of additional online oroffline information acquired by human intelligence(HUMINT) or different types of signal intelligence(SIGINT) and sources may also help to reason compo-sition requirements of crime and reduce the category ofinvestigation Furthermore the proposed methodologycan be expanded into incident information for compat-ibility and information exchangeability with othercyberthreat intelligence system as the Structured +reatInformation eXpression (STIX) and Trusted AutomatedeXchange of Indicator Information (TAXII) which arekey strategic elements of the information-sharingsystem [54]

+ere are features such as the particular messages (iethanks-to notifier nationality religion and anniversary)or image and mp3 file in the web resources which aregathered from the zone-horg site Although these featuresare limited to only a small number of hackers of the webresources in future research we will try to study a close-knit network among them such as the hub hacking groupkey player and followers Furthermore we also plan tomore definitely classify and systemize the hackersrsquo intentsusing text mining and mood detection techniques +efindings of this prospective study will contribute mean-ingful insights to trace hackersrsquo behavioural patterns and toestimate their primary purpose and intent

Data Availability

+e web-hacking dataset applied to our paper can bedownloaded from the linked site below httpocslabhksecuritynetDatasetsweb-hacking-profiling

Conflicts of Interest

+e authors declare that there are no conflicts of interestregarding the publication of this paper

Acknowledgments

+is work was supported under the framework of internationalcooperation program managed by the National ResearchFoundation of Korea (No 2017K1A3A1A17092614)

References

[1] S S Response ldquoSwift attackersrsquo malware linked to more fi-nancial attacksrdquo 2016 httpswwwsymanteccomconnectblogsswift-attackers-malware-linked-more-financial-attacks

[2] S S Response ldquoWannacry ransomware attacks show strong linksto lazarus grouprdquo 2017 httpswwwsymanteccomconnectblogswannacry-ransomware-attacks-show-strong-links-lazarus-group

[3] K lab ldquoLazarus under the hoodrdquo 2018 httpsmediakasperskycontenthubcomwp-contentuploadssites4320180307180244Lazarus_Under_+e_Hood_PDF_finalpdf

[4] Operation Blockbuster ldquoDestructive malware reportrdquo 2016httpswwwoperationblockbustercomwp-contentuploads201602Operation-Blockbuster-Destructive-Malware-Reportpdf

[5] D Martin and SANS Institute InfoSec Reading Room ldquoTracingthe lineage of DarkSeoulrdquo 2016 httpswwwsansorgreading-roomwhitepaperscriticaltracing-lineage-darkseoul-36787

[6] D S C T U T Intelligence ldquoWiper malware threatanalysisrdquo 2013 httpswwwsecureworkscomresearchwiper-malware-analysis-attacking-korean-financial-sector

[7] R Sherstobitoff M L Itai Liba and O O T C James WalterldquoDissecting operation troy cyberespionage in South Koreardquo2013 httpswwwmcafeecomenterpriseen-usassetswhite-paperswp-dissecting-operation-troypdf

[8] N Horton andA DeSimone ldquoSonyrsquos nightmare before christmasthe 2014 North Korean cyber attack on Sony and lessons for USgovernment actions in cyberspacerdquo 2018 httpswwwjhuapleduContentdocumentsSonyNightmareBeforeChristmaspdf

[9] I K Lee and S R Ramsey 9e Korean Language StateUniversity of New York Albany NY USA 2000

[10] V Benjamin and H Chen ldquoSecuring cyberspace identifyingkey actors in hacker communitiesrdquo in Proceedings of the 2012IEEE International Conference on Intelligence and SecurityInformatics pp 24ndash29 Arlington VA USA June 2012

[11] Y Lu X Luo M Polgar et al ldquoSocial network analysis of acriminal hacker communityrdquo Journal of Computer In-formation Systems vol 51 no 2 pp 31ndash41 2010

[12] J-W Jang H Kang J Woo A Mohaisen and H K KimldquoAndro-autopsy anti-malware system based on similaritymatching of malware and malware creator-centric in-formationrdquo Digital Investigation vol 14 pp 17ndash35 2015

[13] J W Jang and H K Kim ldquoFunction-orientedmobile malwareanalysis as first aidrdquo Mobile Information Systems vol 2016Article ID 6707524 11 pages 2016

[14] Y Ki E Kim and H K Kim ldquoA novel approach to detectmalware based on api call sequence analysisrdquo InternationalJournal of Distributed Sensor Networks vol 11 no 6 ArticleID 659101 2015

[15] M L Han H C Han A R Kang et al ldquoWeb-hacking datasetfor the cyber criminal profilingrdquo 2016 httpocslabhksecuritynetDatasetsweb-hacking-profiling

[16] M L Han H C Han A R Kang B I Kwak A Mohaisenand H K Kim ldquoWAHP web-hacking profiling using case-based reasoningrdquo in Proceedings of the 2016 IEEE Conference

Security and Communication Networks 19

on Communications and Network Security (CNS) pp 344-345Philadelphia PA USA October 2016

[17] A Aamodt and E Plaza ldquoCase-based reasoning foundationalissues methodological variations and system approachesrdquo AICommunications vol 7 no 1 pp 39ndash59 1994

[18] D M L Martins and F B D Lima Neto ldquoHybrid intelligentdecision support using a semiotic case-based reasoning andself-organizing mapsrdquo IEEE Transactions on Systems Manand Cybernetics Systems no 99 pp 1ndash8 2017

[19] H K Kim K H Im and S C Park ldquoDSS for computersecurity incident response applying CBR and collaborativeresponserdquo Expert Systems with Applications vol 37 no 1pp 852ndash870 2010

[20] J-B Lamy B Sekar G Guezennec J Bouaud andB Seroussi ldquoExplainable artificial intelligence for breastcancer a visual case-based reasoning approachrdquo ArtificialIntelligence in Medicine vol 94 pp 42ndash53 2019

[21] M Relich and P Pawlewski ldquoA case-based reasoning ap-proach to cost estimation of new product developmentrdquoNeurocomputing vol 272 pp 40ndash45 2018

[22] E R Reyes S Negny G C Robles et al ldquoImprovement ofonline adaptation knowledge acquisition and reuse in case-based reasoning application to process engineering designrdquoEngineering Applications of Artificial Intelligence vol 41pp 1ndash16 2015

[23] H K Kim S-K Kim and S-H Kim ldquoDecision supportsystem for zero-day attack responserdquo Applied Mathematicsand Information Sciences vol 6 no 1 pp 221Sndash241S 2012

[24] G Horsman C Laing and P Vickers ldquoA case-based rea-soning method for locating evidence during digital forensicdevice triagerdquo Decision Support Systems vol 61 pp 69ndash782014

[25] G Horsman C Laing and P Vickers ldquoA case based reasoningsystem for automated forensic examinationsrdquo in Proceedings ofthe PGNET 2011 the 12th Annual Postgraduate Symposium onthe Convergence of Telecommunications Networking andBroadcasting pp 26ndash31 Liverpool UK June 2011

[26] Z Yin Y Gao and B Chen ldquoOn development of supple-mentary criminal analysis system based on cbr and ontologyrdquoin Proceedings of the 2010 International Conference onComputer Application and System Modeling (ICCASM 2010)vol 14 Taiyuan China October 2010

[27] A J Pinizzotto and N J Finkel ldquoCriminal personality pro-filing an outcome and process studyrdquo Law and HumanBehavior vol 14 no 3 pp 215ndash233 1990

[28] P Chen and J Kurland ldquoTime place and modus operandi asimple apriori algorithm experiment for crime pattern de-tectionrdquo in Proceedings of the 2018 9th International Con-ference on Information Intelligence Systems and Applications(IISA) pp 1ndash3 Zakynthos Greece July 2018

[29] C J R Collie and K Shalev Greene ldquoExamining modusoperandi in stranger child abduction a comparison ofattempted and completed casesrdquo Journal of InvestigativePsychology and Offender Profiling vol 16 no 2 pp 91ndash1092019

[30] V Benjamin B Zhang J F Nunamaker Jr and H ChenldquoExamining hacker participation length in cybercriminalinternet-relay-chat communitiesrdquo Journal of ManagementInformation Systems vol 33 no 2 pp 482ndash510 2016

[31] V Benjamin and H Chen ldquoTime-to-event modeling forpredicting hacker IRC community participant trajectoryrdquo inProceedings of the 2014 IEEE Joint Intelligence and SecurityInformatics Conference pp 25ndash32 +e Hague +e Nether-lands September 2014

[32] K Veena and K Meena ldquoIdentification of cyber criminal byanalysing the users profilerdquo International Journal of NetworkSecurity vol 20 no 4 pp 738ndash745 2018

[33] F Iqbal B C M Fung M Debbabi R Batool andA Marrington ldquoWordnet-based criminal networks miningfor cybercrime investigationrdquo IEEE Access vol 7pp 22740ndash22755 2019

[34] N Qazi and B L W Wong ldquoAn interactive human centereddata science approach towards crime pattern analysisrdquo In-formation Processing ampManagement vol 56 no 6 p 1020662019

[35] N Jain P Sharma R Anchan et al ldquoComputerized forensicapproach using data mining techniquesrdquo in Proceedings of theACM Symposium on Women in Research 2016 pp 55ndash60ACM New York NY USA 2016

[36] P M Cozens G Saville and D Hillier ldquoCrime preventionthrough environmental design (cpted) a review and modernbibliographyrdquo Property Management vol 23 no 5pp 328ndash356 2005

[37] H Hassani X Huang E S Silva andM Ghodsi ldquoA review ofdata mining applications in crimerdquo Statistical Analysis andData Mining 9e ASA Data Science Journal vol 9 no 3pp 139ndash154 2016

[38] A Sharma and S Sharma ldquoAn intelligent analysis of webcrime data using data miningrdquo International Journal of En-gineering and Innovative Technology (IJEIT) vol 2 no 32012

[39] S-T Li S-C Kuo and F-C Tsai ldquoAn intelligent decision-support model using FSOM and rule extraction for crimepreventionrdquo Expert Systems with Applications vol 37 no 10pp 7108ndash7119 2010

[40] Y-H Tseng Z-P Ho K-S Yang and C-C Chen ldquoMiningterm networks from text collections for crime investigationrdquoExpert Systems with Applications vol 39 no 11 pp 10082ndash10090 2012

[41] A Malathi and S S Baboo ldquoAn enhanced algorithm topredict a future crime using data miningrdquo InternationalJournal of Computer Applications vol 21 no 1 2011

[42] S Kapetanakis A Filippoupolitis G Loukas et al ldquoProfilingcyber attackers using case-based reasoningrdquo in Proceedings ofthe 19th UK Workshop on Case-Based Reasoning (UKCBR2014) Cambridge UK December 2014

[43] R Al-Zaidy B C Fung A M Youssef et al ldquoMining criminalnetworks from unstructured text documentsrdquo Digital In-vestigation vol 8 no 3-4 pp 147ndash160 2012

[44] M Zulfadhilah Y Prayudi and I Riadi ldquoCyber profilingusing log analysis and k-means clusteringrdquo InternationalJournal of Advanced Computer Science and Applicationsvol 7 no 7 pp 430ndash435 2016

[45] S V Nath ldquoCrime pattern detection using data miningrdquo inProceedings of the 2006 IEEEWICACM International Con-ference on Web Intelligence and Intelligent Agent TechnologyWorkshops pp 41ndash44 Hong Kong China December 2006

[46] ITPnet ldquoSyria Egypt crises spur escalation of me cyber at-tacksrdquo 2013 httpwwwitpnet594742-syria-egypt-crises-spur-escalation-of-me-cyber-attack

[47] A McEnery and R Xiao ldquoCharacter encoding in corpusconstructionrdquo in Developing Linguistic Corpora A Guide toGood Practice Oxbow Books Ltd Oxford UK 2005

[48] B Bos T Ccedilelik I Hickson et al ldquoCascading style sheets level2 revision 1 (CSS 21) specificationrdquo W3C Working Draft2005 httpwwww3orgTRCSS21

20 Security and Communication Networks

[49] W Stuckey ldquoMassive sony breach sheds light on murkyhacker universerdquo 2018 httpamericaaljazeeracomarticles20141224sony-hacker-universehtml

[50] S Gallagher ldquoSony pictures malware tied to SeoulldquoShamoonrdquo cyber-attacksrdquo 2018 httpsarstechnicacominformation-technology201412sony-pictures-malware-tied-to-seoul-shamoon-cyber-attacks

[51] J Pagliery ldquoSony hack signs point to North Koreardquo 2018httpsmoneycnncom20141205technologysecuritysony-hack-north-korea-employeeindexhtml

[52] K Ketler ldquoCase-based reasoning an introductionrdquo ExpertSystems with Applications vol 6 no 1 pp 3ndash8 1993

[53] M Rosvall and C T Bergstrom ldquoMapping change in largenetworksrdquo PLoS One vol 5 no 1 Article ID e8694 2010

[54] OASIS ldquoSTIXTAXII standardsrdquo 2017-2018 httpsoasis-opengithubiocti-documentation

Security and Communication Networks 21

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 8: CBR-Based Decision Support Methodology for Cybercrime

Similarity score 1113944

cv

i1distance RCcvTCcv( 1113857 times weightcv1113858 1113859

cv case vector(ie encoding IP address domain date andOS)

(1)

+ere are various approaches to set the weight of the casevector such as the heuristic method logistic regression anal-ysis and attribute weighting methods Furthermore theseweight values need to be periodically updated to be applied tothe study of recent attack trends However for the initialsetting it is difficult to set the exact numerical value for eachweight values in accordance with the case vector In our ex-periment we set the impact and the weight of the case vector ashighmedium and low according to their importance so that to

concretely categorize the attacker and the victim Above allsince encoding makes it possible to infer the static locatedinformation of the attacker we defined encoding as high-quality information IP address and domain were defined asmedium-quality information +ese case vectors enable theidentification and specification of the victim Finally the tar-geted date and OS were defined as low-quality information Tomeasure clustering and similarity all values of the case vectormixed as numbers and letters were normalized to have a valuefrom 0 to 1 Obviously since these values can be subjective inorder to prevent this subjective bias these values should beacquired and thoroughly reviewed by several experts +istechnique can be easily applied using expert knowledge ofinvestigation experts and is easy to understand from re-searchersrsquo viewpoint +e quantitative method for setting and

Arabic

Baltic

CentralEurope

Chinese

Cyrillic

Greek

Hebrew

Japanese

Korean

SouthernEurope

Taiwanese

Thailand

Turkish

Africa

Australia

CentralAsia

EastAsia

EasternEurope

NorthAmerica

NorthernEurope

SouthAmerica

SouthAsia

SoutheastAsia

SouthernEurope

WestAsia

WesternEurope

Linux-basedOS

MacOS

Unix-basedOS

Windows-basedOS

bull ISO-8859-6bull Windows-1256

bull ISO-2022-KRbull EUC-KR

bull GB2312bull GB18030 bull GBK

bull ISO-2022-JPbull EUC-JPbull ShiftJIS

bull ISO-8859-2bull Windows-1250

bull ISO-8859-13bull Windows-1257

bull ISO-8859-8bull Windows-1255

bull ISO-8859-7bull Windows-1253

bullbull

ISO-8859-5Windows-1251

bull Windows seriesbull Windows server series

bull Unixbull AIX bull Compaq Tru64 etc

bull MacOSbull MacOSX

bull Linux bull FreeBSD bull Avtech etc

bull combull cobull int

bull info

bull org bull or

bull coop

bull govbull gobull gob

bull edubull ac

bull net

bull mil

bull biz

bull fr ie be gl lube dk ad imnl uk je gg etc

bull br sr ar cl do ec fk gf py sr uy ve etc

bull sa ae kw bh az in ir jo kw lb om qa ye etc

bull no dk lv ltse ax fi glis no

bull us bz lc ai bmgd hn ky mx ni pa sv tt vi etc

bull gr mksm ad va ba es it ptrs hr si li bg etc

bull la bu vn kh th

bull in np bt pk lk id mn mo my np ph tl etc

bull kz uz tm tj kg af am tr

bull au pg nz ccck fj gu kinu sb vu wf etc

bull gn jm ke aobw cf ls mztz ug yt zw etc

bull ru by al lv ua pl sk hu ee md ro mk etc

bull kr cn jp twhk kp sg

Encoding gTLD ccTLD OS

com

edu

gov

org

biz

mil

net

coop

info

bull Windows-1253

bull ISO-8859-11bull Windows-874

bull Big5 bull EUC-TW bull Eten

bull ISO-8859-9bull Windows-1254bull IBM857

WestEurope

bull ISO-8859-1bull Windows-1252

Normalization

Figure 5 Normalization of each feature elements

8 Security and Communication Networks

Input TCs(Tested_DB)lowast +e Tested_DB indicates the cases-centric DB lowastRC (Retrieved_Case)⟵ Encodi ngRC IPRC DomainRC DateRC OSRClowast RC means one of the retrieved cases lowastW (Weight)⟵ Encodi ngW IPW DomainW DateW OSW

Output Similarity_score(1) TCEncodi ngTC IPTC DomainTC DateTC OSTC⟵TCs(2) While RC in TCs do(3) if Encodi ngRC Encodi ngTC then(4) Encoding_similarity_value⟵ 10(5) else(6) Encoding_similarity_value⟵ 00(7) end(8) IPRC Octet ARC Octet BRC Octet CRC Octet DRC IPTC Octet ATC Octet BTC Octet CTC Octet DTC(9) if (Octet ARC Octet ATC) (Octet BRC Octet BTC) (Octet CRC Octet CTC) (Octet DRC Octet DTC) then(10) IP_similarity_value⟵ 10(11) else if (Octet ARC Octet ATC) (Octet BRC Octet BTC) (Octet CRC Octet CTC) then(12) IP_similarity_value⟵ 075(13) else if (Octet ARC Octet ATC) (Octet BRC Octet BTC) then(14) IP_similarity_value⟵ 05(15) else if (Octet ARC Octet ATC) then(16) IP_similarity_value⟵ 025(17) else(18) IP_similarity_value⟵ 00(19) end(20) DomainRC ServiceNameRC gTLDRC ccTLDRC DomainTC ServiceNameTC gTLDTC ccTLDTC(21) if an identical domain then(22) Domain_similarity_value⟵ 10(23) else if (ServiceNameRC ServiceNameTC) (gTLDRC gTLDTC) (ccTLDRC ccTLDTC) then(24) Domain similarity_value⟵ 08(25) else if (gTLDRC gTLDTC) (ccTLDRC ccTLDTC) then(26) Domain_similarity_value⟵ 03(27) else if (ServiceNameRC ServiceNameTC) then(28) Domain_similarity_value⟵ 01(29) else if (ccTLDRC ccTLDTC) then(30) Domain_similarity_value⟵ 01(31) else if (gTLDRC gTLDTC) then(32) Domain_similarity_value⟵ 01(33) else(34) Domain_similarity_value⟵ 00(35) end(36) Date_variance⟵ |Da teRC―Da teTC|lowast It converts a date format year month and day (ie yyyy-mm-dd) into a day

calculated with numeric lowast(37) if 0leDate_variancele 365 then(38) Date_similarity_value⟵ 10(39) else if 365ltDate_variancele 1095 then(40) Date_similarity_value⟵ 075(41) else if 1095ltDate_variancele 1825 then(42) Date_similarity_value⟵ 05(43) else if 1825ltDate_variancele 2555 then(44) Date_similarity_value⟵ 025(45) else if 2555ltDate_variance then(46) Date_similarity_value⟵ 00(47) end(48) if OSRC OSTC then(49) OS_similarity_value⟵ 10(50) else(51) OS_similarity_value⟵ 00(52) end(53) Similarity_score⟵ (Encoding_similarity_valuetimes EncodingW) +

(IP_similarity_valuetimes IPW) + (Domain_similarity_valuetimes DomainW) +(Date_similarity_valuetimes DateW) + (OS_similarity_valuetimes OSW)

(54) return Similarity score between RC and TC(55) end while

ALGORITHM 1 Similarity measure module

Security and Communication Networks 9

updating the weight value is an issue worth addressing infurther research In the present study we set the weight valuesfor the case vector including the encoding IP address domainattack date and OS (see Table 2)

Some case vectorsrsquo distance cannot be directly estimatedas they have mixed numerical and nominal data (such as IPaddress range and domain name) For this reason to cal-culate the distance between the nominal data we defined thediscrete similarity measure +e similarity of IP addresseswas calculated by measuring the similarity among the sameoctet of two given IP addresses +e IP address space iscomposed of a number combination of four octets separatedby ldquordquo In the present study we compared if octets from the1st octet to the 4th octet of RC and TC were identicalSubsequently a similarity value was assigned to the IPaddress vector We suggested the discrete similarity valuebetween two IP addresses as visible in Table 2 +e proposedapproach is advantageous in that it enables the distancecalculation between the IP addresses efficiently

(i) IP address of RC zzz yyy xxx www

(ii) IP address of TC zzz yyy xxx www

Meanwhile the similarity between domains is calculatedaccording to their domain properties +e domain iscomposed of the gTLD ccTLD and service name+e gTLDrefers to a generic top-level domain in the domain rule Forinstance com and co are used for commercial companies ororganizations org and or are used for nonprofit organi-zations go and gov are used for government and stateagencies Besides ccTLD refers to a country code top-leveldomain in the domain rule and means a unique sign thatrepresents a specific region such as kr cn br and uk DNSmakes change in the IP address into a unique Domain Namewhich is easy to remember because it consists of a combi-nation of an alphabet letter and a number Among theDomain Name the service name is built corresponding withthe characteristics of the groups organizations or corpo-rations that the gTLD is intending and pursuing +e servicename has diverse and different names depending on thecategories of the gTLD such as educational institutionscommercial enterprises military organizations nonprofitorganizations and government and state agencies Unlikeother case vectors we set the rule for estimating the simi-larity of the domain as depicted in Table 2

Furthermore we defined the attack date similarity Similarto the offline criminal investigation case if the time of a crimeoccurrence is near we can analyse the cases as a similar crimewith a cross-analysis of the target area and the criminalsrsquopatterns +e similarity value depends on the period differencebetween a new case and existing cases As visible in Table 2 thesimilarity value is described according to the date gap of twocases that occurred on different dates In summary accordingto the similarity degree of a variation range of a section thesimilarity values of the attack IP address domain and attackdate were set to the similarity value between 0 and 1

332 Clustering Processing Merely sorting the data andvisually analysing them render it difficult for an investigator to

infer the correlations and similarity among the potentialfeatures of incidents Hence an advanced tool that wouldcapture the complex underlying structures and data prop-erties is required Accordingly in the present study weconducted the clustering process using the EM algorithmbased on the probability of the individual data attributes +isalgorithm does not restrict the number of clusters in theparameters but automatically generates a number of validclusters by cross-validation +ereafter the algorithm de-termines the probability that some data items existed in thecluster bymaximizing the correlation and dependence amongthe objectsWe applied practically the EM algorithm to 80948data items having the information of encoding gTLD ccTLDand OS from 212093 data for clustering +e characterencoding was normalized by a group of congenial cover codeunits (ISO-8859 MS Windows character set GB and EUCseries) We excluded the Unicode because it is too generalwhich accounts for themajority of the collected encoding datafor clustering In the case of the service name even if we canfind out similar combinations of alphabet letters or numbersit is not easy to find commonality or relevance between them+erefore it is not suitable for being used as the similaritymeasure of the reasoning engine Consequently character-istics and metadata concerning the 12 clusters were obtained(see Table 3) +ese clustering results are also visualized andstored in the database (see Figure 6)

+e donut charts include the different features fromoutside to inside (in order) with the corresponding share ofeach feature value separated by a different colour codewithin this same circle Each cluster consists of four circlesand the circle represents from the outside to the inside theencoding gTLD ccTLD and OS +e percentage in Table 3represents howmany cases one cluster contains among all ofwebsite defacement cases collected from the zone-horg site+e representative hacker represents a notable hacker orhacking group among the members of them in each clusterAs described in Figure 6 clusters of similar patterns werefound in the clusters +e most conspicuously similarclusters were 4 and 7 which had the feature of using Arabicand Chinese a feature of the attack against an industrialorganization whose headquarters are located in WesternEurope +e cases in Clusters 4 and 7 accounted for 4129percent among all of website defacement cases collectedfrom the zone-horg site+e results of the clustering processcontribute to the concretization of the similarity between thenew and existing cases A large number of new cases haveflowed in the database and then if the clustering process isperformed with the dataset a clustering result may take on adifferent pattern of course

4 Application

41 Experimental Results and Analysis Considering that theassumption that the attackers tend to use similar or uniqueattack methods is not always valid and it is difficult toevaluate the accuracy of the similarity mechanism As timeprogresses attackersrsquo hacking skills advance and in additionthe attack plan campaign purpose and target groups canchange depending on the situation +erefore in the present

10 Security and Communication Networks

Table 2 Value and the weight for the similarity score by the case vector All of the values of the similarity score are normalized to 0 or 1

Case vector Weight Impact +e similarity measure between a new case andexisting cases Value

Encoding 05 High mdash 0 or 1

IP address 02 Medium

If the same (eg 14324816 and 14324816) 1If the 1st 2nd and 3rd octet are matched (eg

14324816 and 14324818) 075

If the 1st and 2nd octet are matched (eg 14324816and 14324844) 05

Only the 1st octet is matched (eg 14324816 and1431324) 025

No common octet (eg 14324816 and 1631325) 0

Domain 015 Medium

An identical domain 1Service name is matched and one of the gTLD and

ccTLD is matched 08

gTLD and ccTLD is matched 03Service name is matched 01

ccTLD is matched 01gTLD is matched 01

Nonidentical domain 0

Date 01 Low

Period of about 6 months back and forth (1 year) 1Period of about 18 months back and forth (3 years) 075Period of about 30 months back and forth (5 years) 05Period of about 42 months back and forth (7 years) 025Over period of about 42 months (over 7 years) 0

OS 005 Low mdash 0 or 1

Table 3 Characteristics and metadata of several different clusters derived from the clustering processing

Cluster number Ratio () Description Representative hacker (group)

0 784+e group uses Central European languages +eyprincipally attacked against the profit organization

and Linux-based OS in Western EuropeJaMaYcKa Super2li

1 816

+e group uses Arabic and Cyrillic +ey principallyattacked against the organization that manages thenetwork and Linux-based and Unix-based OS +eirattack region is distributed throughout SouthernEurope South America Eastern Europe and

Southeast Asia

BI0S

2 1036

+e group uses Central European languages +eyprincipally attacked against the organization that

manages the network and nonprofit organizations inWestern Europe

JaMaYcKa

3 933+e group uses Central European languages +eyprincipally attacked against the profit organization

and Windows-based OS in Western Europe1923Turk

4 2536+e group uses Arabic and Chinese +ey principally

attacked against the profit organization andWindows-based OS in Western Europe

EL_MuHaMMeD federal-atackorg

5 173

+e group uses Central European languages +eyprincipally attacked against the profit organizationand Unix-based OS in Southern Europe and Eastern

Europe

d3bsimX SuSKuN

6 524

+e group uses Central European languages +eyprincipally attacked against the profit organizationthe educational institution the government and stateagencies and also Windows-based OS in East Asia

1923Turk

Security and Communication Networks 11

study rather than evaluating the accuracy of the similaritymechanism we tested the overall performance of the pro-posed methodology with the ratio of correctly identified

hackers +e developed testing procedures unfolded in thefollowing four steps and are depicted in detail in Figure 7where ldquoKrdquo presents all hackers within the database

Table 3 Continued

Cluster number Ratio () Description Representative hacker (group)

7 1593+e group uses Arabic Chinese and Turkish +eyprincipally attacked against the profit organization

and Linux-based OS in Western EuropeRya iskorpitx

8 911+e group uses Central European languages +eyprincipally attacked against the profit organization

and Windows-based OS in Western Europe1923Turk

9 363

+e group uses Central European languages +eyprincipally attacked against the profit organizationand Linux-based OS in South America and Eastern

Europe

Hmei7

10 139

+e group uses Central European languages +eyprincipally attacked against Windows-based OS inSouth America and Southeast Asia+eir attack target

is mostly the educational institution and thegovernment and state agencies

BHS F4keLive

11 192

+e group uses Arabic and Central Europeanlanguages+ey principally attacked against the profitorganization and Windows-based OS in Southern

Europe

EL_MuHaMMeD linuXploit_cre

Clustering 00

25

25

50

75

0100

25

50

75

0100

25

50

75

0100

25

50

75

0100

25

50

75

0100

25

50

75

0100

25

50

75

0100

25

50

75

0100

25

50

75

0100

25

50

75

0100

25

50

75

0100

50

75

0100

Clustering 01 Clustering 02 Clustering 03

Clustering 04 Clustering 05 Clustering 06 Clustering 07

Clustering 08 Clustering 09 Clustering 10 Clustering 11

West EuropeTurkishCentral EuropeArabicCyrillicChinese

com

net

org

gov

edu

mil

Western EuropeEast AsiaSouthern EuropeSouth AmericaEastern EuropeSoutheast Asia

WindowLinuxUnixMacOS

Encoding

gTLD

ccTLD

OS

Figure 6 Visualization of the 12 different clusters (00 through 11) in our data annotated with various features encoding gTLD ccTLD andOS and their corresponding share (legend on the right side)

12 Security and Communication Networks

Rk Count Casesmk( )Count Casesallk( )

(3)

where ldquomrdquo means the past cases which are within the denedscope concerning a randomly selected hacker ldquokrdquo

(i) Step 1 selection the measurement objects ie 100hackers were randomly selected from the database

(ii) Step 2 case labelling we retrieved all previous attackcases conducted by the randomly selected 100hackers in Step 1 and then subsequently labelled allprevious attack cases by each hacker name

(iii) Step 3 case extraction we selected the most recentcase among the cases labelled in Step 2 as an inputvalue shye similarity score was then estimated bycomparing themost recent case (ie RCmdashone of theretrieved cases) with all other cases in the database(ie TCsmdashall cases in the cases-centric DB)

(iv) Step 4 scoring similarity score was sorteddepending on the value and the weight for thesimilarity score by the case vector (see Table 2) inthe descending orderWhenever the similarity valuewas 0 it was not displayed on the scoring list of Step4 shye feasibility of the proposed methodology wasevaluated based on how many past cases of a hackerthere were in the N scope at the scoring list of Step 4that is regarding the ratio of the attack cases by eachhacker we checked whether the cases were includedat the top N scope (N scope from the top 1 percentto the top 30 percent)

NScope Count CasesScopeK( )Count CasesallK( )

times 100 (2)

First we randomly picked 100 hackers from the col-lected dataset (ie cases-centric DB) thereafter we re-trieved and extracted all past attack cases for each hackershye extracted past cases were labelled with the hackerrsquosname Figure 8 depicts the number of website defacementattack cases in the past for each hacker In Steps 3 and 4similarity between a retrieved case (ie the most recentcase) and all other stored website defacement cases weremeasured

Specically we checked whether the result (ie thesorted hackerrsquos past cases with a high similarity score)stemming from the similarity measurement was included atthe top N scope shyis process was meant to check based onthe similarity score how many past attack cases of randomlypicked 100 hackers were included in the dened topN scopeTo this end we divided the top N scope into eight criterionfactors from the top 1 percent to the top 30 percent and theratio R all the past attack cases for each hacker into sixcriterion factors from 50 percent to 100 percent (ie at 10percent intervals) As illustrated in equations (2) and (3) theN scope and the ratio R were categorized as ratios accordingto the dened measure rule More specically the criterionof the top N scope ie ldquotop N percentrdquo was based on theresult derived from the similarity measurement Attack caseswere sorted in order of high similarity score and thereforethe cases were within the range of topN scope (see Figure 9)Also in the case of the hacking case ratio of a randomly

Step 4 scoring

bullbullbull

Randomly selected100 hackers

from the database

Step 1 selection

bullbullbull

1 eBuGz2 Hmei73 3xp1r3

98 S3cure99 drm1st3r

100 Lulz53c

bullbullbull

Step 2 case labelling1 TheBuGz

100 Lulz53c

Step 3 case extraction

A retrieved case(the most recent case)

bullbullbull

1 TheBuGz

100 Lulz53c

bullbullbull

Cases-centricDB

Hackername Date Encoding IP address Domain OS Score

Hackername Date Encoding IP address Domain OS Score

bullbullbull

1 eBuGz2 Hmei73 3xp1r3

98 S3cure99 drm1st3r

100 Lulz53c

Case1 bullbullbull Casem

Case1 bullbullbull Casemprime

Case1 bullbullbull Casem

Case1 bullbullbull Casemprimei=1

cv[Distance (RCcv TCcv) times Weightcv]

Casemprime

Casemprime

Casem

Casem

Figure 7 shye developed testing procedures from step 1 to step 4

Security and Communication Networks 13

selected hacker some parts of the past attack cases (ie ratioR) concerning a hacker were within the defined N scope (seeFigure 9)

Figure 10 shows the number of an identified hacker fora retrieved case (ie the most recent case) among allhacking cases of each hacker +e X-axis in Figure 10 showsthe criterion of the topN scope including the eight criterionfactors () and of the ratio R including the six criterionfactors () +e Y-axis presents the number of an identifiedhacker in the top N scope among the randomly selected 100hackers in Step 1 As can be seen in Figure 10 the higher theratio R and the narrower the N scope the lower the numberof an identified hacker in the top N scope among therandomly selected 100 hackers On the other hand thelower the ratio R and the wider the N scope the higher thenumber of identified hackers in the top N scope among therandomly selected 100 hackers Consequently even ifhacking cases were caused by the same hacker as the hackeror hacking group which only attacked the same or similarobjects were rare it is impossible to draw results with a highsimilarity score for all cases of a hacker Nevertheless theresults demonstrated that the proposed CBR-based de-cision support methodology can successfully reduce thenumber of hackers and their cases and suggest potential topN percent candidates among hundreds of thousands ofcases

+erefore an investigator should consider the avail-ability and flexibility of data with respect to the data selectioncriteria for the similarity measurement As mentionedabove when a new attack occurs they can limit the searchrange of the data and determine the direction of the criminalinvestigation With such the reduction in the number ofcandidate-related cases the outcomes of our similaritymechanism are highly valuable in terms of reducing theinvestigation time to determine the potential suspect of agiven hacking incident

42 Case Study As mentioned above the accuracy of theCBR depends on the quality of the collected data and theoverall accuracy is difficult to evaluate Nevertheless al-though the data are insufficient to evaluate the proposedmethodology the DS and SPE cases include the ground-truth data with specific information related to the hacker orhacking groups Based on the public ground-truth data ofthe DS and SPE cases we found the most similar top threehackers or hacking groups to them and noticed theircharacteristic by the proposed similarity measure and theclustering processing

+e hackers of the DS cyberattack defaced the groupwarehomepage of LG U+ the 3rd largest telecommunicationcompany in South Korea and the English version of the

Step 4 scoring

1 TheBuGzTop N scope(1~30)

Ratio R (50~100)

Hackername Date Encoding IP address Domain OS Score

Case1 bullbullbull Casem

Casem

Figure 9 Scoring step on the top N scope and the ratio R

0

1000

2000

3000

4000

5000

0 25 50 75 100Hacker

Num

ber o

f cas

es

Figure 8 +e number of website defacement attack cases in the past of each hacker

14 Security and Communication Networks

Korean Broadcasting System (KBS) homepage +ey leftunique images and many messages on the defaced websites+e three Calaveras image (ie skull image) used in the LGU+rsquos defaced website appeared on many European websites+e character encoding set of the message was the WesternEuropean language system Based on these insights we couldinfer that the hackersrsquo background is European ldquoHASTATIrdquowas the word written on the KBS homepage meaning theforefront line of the Roman troops hinting that the DScyberattack could be a starting point rather than a transientattack it was a persistent one Even if we excluded otherimages and messages as well as other features from thesimilarity processes due to the unanticipated loss or absenceof data one could establish the similarity and intent of theattackers with reasonable confidence However given thesufficiently large hacker profiling source such abundant datacould support and enhance the accuracy of inference Fig-ure 11 shows the screenshots of the defaced websites at thattime

In the SPE case similarly to the DS case some imagesand messages were left on the computers of SPE Regardingcolour skulls image and misspellings the imagesFigure 11(c) used in the SPE cases took on the characteristicssimilar to those of the images Figure 11(b) used in the DScases As shown in Figure 11 the colour schemes in greenand red and the visual similarities seen in skull image areother crucial elements for crime tracing In both the DS andSPE cases the phrase such as ldquothis is the beginningrdquo andldquoyour datardquo were commonly found in the messages How-ever given the intentional hacking nature of forging orhiding their identity motivation and location some experts

say that these characteristics are not the conclusive proofthat Sony has been attacked by the same hacker [49ndash51]

For the evaluation of the results of the case study we firstmeasured the similarity between the new website de-facement cases (ie the DS and SPE cases) and the collectedexisting cases in the database +is approach coheres withthe CBR process used in cybercrime investigation (seeFigure 2) Two new website defacement cases the DS and theSPE were applied as RC and the similarity score for each ofthese two cases was computed using the similarity measure(see equation (1)) proposed in Section 331 Provided thatbecause the DS and SPE cases do the function of the targetcases as an input value we considered a direct comparisonbetween the DS and SPE cases for the similarity score wasnot appropriate [52]

+e similarity measure mentioned in the previousparagraph is based on the metadata released by an analysisreport of the DS and SPE real cases We summarized furtherthe characteristics and metadata associated with them inTable 4 +e similarity score was derived through com-parison between the presented metadata of the DS and SPEcases and all cases in the cases-centric DB We gave the mostsimilar top three cases among the result of the similarityscore (see the right side in Table) Notifier Hmei7 and d3b_Xare among the cases that belonged to Clusters 0 and 8 whichwere the two clusters that exhibited identical characteristicsIt can thus be understood that they used the encoding systempertinent to Central European languages based on the Latinlanguage system and typically launched attacks against aprofit organization located in Western Europe Notifieroaddah MTRiX and EL_MuHaMMeD were all classified

Top 1 Top 3 Top 5 Top 10 Top 15 Top 20 Top 25 Top 30Criterion of the top N ()

Num

ber o

f ide

ntifi

ed h

acke

rs

0

20

40

60

80

100

Ratio of the attack cases ()506070

8090100

Figure 10 +e number of identified hackers in the top N scope among the randomly selected 100 hackers

Security and Communication Networks 15

as the same cluster (Cluster 7) where the hackers of Cluster 7used the encoding system pertinent to Arabic and Chineselanguages and typically attacked against the profit organi-zation located in Western Europe

Next to ensure the objectivity of the similarity scorebased on the case study by the DS and SPE we computed thesimilarity score of any randomly selected pair from thewhole case Figure 12(a) shows the distribution of thesimilarity score of the randomly selected cases We took thedistribution of the similarity score using the central limittheorem which describes the average distribution of ran-dom samples extracted from a finite population +e dis-tribution shows that the calculation of the similarity score ofthe randomly selected two website defacement cases wasrepeatedly performed for 10000 times +e similarity scoresof any randomly selected pair of cases were typically dis-tributed around 03 +is result (Figure 12(a)) substantiatesthat the similarity scores are not low even if the similarityscores of the DS and SPE cases (Figure 12(b)) do not appear

numerically high Figure 12(b) shows the similarity scores ofthe DS and SPE cases+e top score of the similarity was 069in the DS case and all measured cases concentrated aroundthe similarity score (X-axis) of 00 to 015 and of 05 to 06 Inthe SPE case the top score of the similarity was 0615 and allmeasured cases concentrated around the similarity score (X-axis) of 00 to 02

Figure 13 shows the distribution of the similarity scorefor randomly selected 100 hackers mentioned in Section 41To know the mean value of the similarity score for eachhacker case we calculated the similarity score from thehackerrsquos own past cases Cases used for the similarity scoremeans not all cases in the cases-centric DB but just the pastcases conducted by the hacker in the cases-centric DB +emean value of the similarity scores in the hackers is 05233+e similarity scores of the tested cases in Table 4 is abovethe mean value +us the similarity scores for each hackeradequately underpin the similarity scores from the TCs inDS and SPE

(a) (b) (c)

Figure 11 A snippet of website defacement cases by a comparison of examples of the DS and SPE the defaced LGU+ groupware homepage(a) and KBS homepage (b) in the DS case and the defaced website in SPE case (c)

Table 4 Further characteristics and metadata associated with the DS and SPE cases

Retrieved case Tested cases

Case name NotifierDarkSeoul (DS) Hmei7 d3b_X StifLer

Encoding Windows-1252 Windows-1252 Windows-1252 ISO-8859-9IP address 203248195178 2038623868 2031243766 77921083Domain gyunggionnet21com httpwwwgarychengcom healthajkgovpk yapikimyasallaricomtrDate 20 Mar 2013 6 Feb 2014 4 Feb 2014 8 Jun 2013OS Windows Windows Windows WindowsSimilarity mdash 0690 0675 0665Cluster mdash 0 8 4

Retrieved case Tested casesCase name Notifier

Sony pictures Entertainment (SPE) Oaddah MTRiX EL_MuHaMMeDEncoding EUC-KR EUC-CN GB2312 GB2312 GB2312IP address 203131222102 2031241555 20829198 2081164534Domain httpwwwsonypicturesstockfootagecom httpwwwhzkcggcom daxdigitalromcom digitalairstripnetDate 24 Nov 2014 14 Jun 2012 16 Dec 2002 18 June 2009OS Windows Windows Windows WindowsSimilarity mdash 0615 0615 0600Cluster mdash 7 7 7+e metadata are arranged according to the defined case vector corresponding with the DS and SPE cases on the left side (shown in part in boldface type)

16 Security and Communication Networks

43 Follow-Up Investigation A case study is a researchmethod involving an in-depth and detailed investigationof a subject of study as well as its related contextualmethodology Hence we conducted follow-up in-vestigations of the most similar top three hackers asmentioned above in Table 4 According to the resultsspecifically over 93 percent of the hackerrsquos attacks weresimilar to the DS case that occurred in 2013 and 2014+eir major targets were com domain sites and theytargeted primarily Germany Italy New Zealand RussiaTurkey Taiwan and South Korea (see Table 5) Twohackers (ie Hmei7 and d3b_X) primarily attackedgovernment agencies Interestingly 20 percent of theattacks by the hackers named d3b_X targeted SouthKorea In the SPE incident the similar hackerrsquos attacksoccurred throughout the period from 2002 to 2014 +ehackers named MTRiX and EL_MuHaMMeD in-tensively executed such attacks in 2003 and 2009 +eirmajor targets were com (or co) and org domain sitesand they targeted primarily Brazil Canada DenmarkFrance Greece Hong Kong and Italy (see Table 5) Twohackers (ie MTRiX and EL_MuHaMMeD) primarilyattacked commercial agencies and additionally attackedthe public and network agencies As shown in Figure 14 to

describe the follow-up investigation more discernibly andto focus on the attack flow we used an alluvial diagramwhich is a type of Sankey diagram developed to representchanges in a network structure over time [53] It shows theinvestigation of the top three hackers with website de-facement cases most similar to the DS case and SPE case+e case vectors were based on the attack year ccTLD andgTLD+e thickness of the attack flow in this figure meansthe degree of attack +is network visualization methodcould support an investigator to understand the flow andcore of the crime clearly by listing the multidimensionalevidence that is complicatedly entangled or hidden suchthat it does not look presentable

5 Limitations and Discussion

+e CBR algorithm has the disadvantage that the perfor-mance evaluation may be degraded if the property de-scribing the case is inappropriate +erefore in order toobtain more accurate results cross-data analysis with othervarious data sources should be considered For examplecybercrime statistics data from law enforcement agenciesthreat intelligence data from malware analysis groups andvulnerability databases could be useful resources to

100806040200Similarity score

600

400

200

0

Freq

uenc

y

Mean = 02930 Var = 00866

(a)

100806040200

Freq

uenc

y e highest similarity score 0615on Sony Pictures Entertainment case

Similarity score100806040200

Similarity score

40000

30000

20000

10000

0

Freq

uenc

y

40000

30000

20000

10000

0

Mean = 0114 Var = 01500

e highest similarity score 069on DarkSeoul case

Mean = 0063Var = 00370

A B

(b)

Figure 12 (a) Probability distribution of the similarity score for any pair of randomly selected cases (b) distribution of the similarity valuebetween the collected website defacement cases with the DS case (A) and the distribution of the similarity value between the collectedwebsite defacement cases with the SPE case (B) +e similarity was calculated between each studied case and all other cases in our system

0

2

4

6

000 025 050 075

Freq

uenc

y

Mean value of the similarity score000 025 050 075

Figure 13 Distribution of the similarity score for randomly selected 100 hackers

Security and Communication Networks 17

improve the accuracy and usability of our proposedmethodology However at the time of writing the presentpaper we did not have access to open and public dataconcerning cybercrime

For that reason we tried to demonstrate the practica-bility of the proposed methodology as a proof of concept+erefore we focused on the dataset of the zone-horg thatincludes a large number of website defacement cases Al-though the zone-horg provides an extensive dataset on thepast incident events not all incidents can be included in ourstudy +erefore if a hacker penetrated some target orga-nizations by APT attacks and performed stealthy activitiessuch hacking activities would not be reported in the datasetof the zone-horg and the proposed methodology would notbe able to detect similar cases with reasonable confidence

6 Conclusion and Future Work

In this study the similarity of website defacement caseswas assessed through the similarity measure and theclustering processing using the CBR as a methodology+e collected raw data of the defaced web sitesrsquo resourceswas sanitized via data parsing and data cleaning processAlso based on the large size of real dataset data-drivenanalysis for the hacker profiling is achieved To this endthe case vector was designed and the significant featureswere chosen for applying to the case-based reasoning Fora successful cybercrime investigation hacker profiling viaclustering analysis is the most basic and importantprocess in order to find out the relevant incident casesand significant data on some prime incidents data-driven

Table 5 Follow-up investigation on the top three hackers with website defacement cases most similar to the DS case and SPE case +e casevector value means the hackerrsquos attack rate

DomainDS case SPE case

Hmei7 d3b_X StifLer Oaddah MTRiX EL_MuHaMMeDCom 7832 8581 10000 10000 8627 8298Edu 162 096 mdash mdash 176 191Net 340 320 mdash mdash 546 574Gov 1216 651 mdash mdash 106 mdashYear Hmei7 d3b_X StifLer Oaddah MTRiX EL_MuHaMMeD2002 mdash mdash mdash mdash 1074 mdash2003 mdash mdash mdash mdash 8908 mdash2006 mdash mdash mdash mdash mdash mdash2007 009 mdash mdash mdash 018 mdash2008 mdash mdash mdash mdash mdash mdash2009 315 mdash mdash mdash mdash 99572010 009 mdash mdash mdash mdash mdash2011 034 mdash mdash mdash mdash mdash2012 340 mdash mdash 10000 mdash mdash2013 3486 3917 10000 mdash mdash mdash2014 5808 5977 mdash mdash mdash 0432015 mdash 107 mdash mdash mdash mdash

d3b~x

Hmei7

StifLer

2009

2012

2013

2014

AustraliaBrazilFrance

Germany

IndonesiaItaly

KoreaNetherlandsNewZealand

PolandRussia

Thailand

Turkey

Unknown

com

gov

net

org

Unknown

No

Yes

Hacker Year ccTLD gTLD Attack

(a)

EL_M

uHaM

MeD

MTRiX

oaddah

2002

2003

2009

2012

BrazilCanada

DenmarkFranceGreece

HongKongItaly

Unknown

com

net

org

Unknown

No

Yes

Hacker Year ccTLD gTLD Attack

(b)

Figure 14 Follow-up investigation on the top three hackers with website defacement cases that are most similar to the DS case (a) and SPEcase (b)

18 Security and Communication Networks

and evidence-driven decision making should be thecritical process Also reducing the amount of data andtime to be analysed are important factors to deliver thehigh value of intelligence data

Although the obtained results appear to be sound andmeaningful it is difficult to evaluate the accuracy of theresults unless the attacker is captured Naturally theground-truth data with specific information about theinvolved hacking groups for verification are rare (ie noadversary claimed that the two attacks were the result oftheir actions) However it is noteworthy that our meth-odology provides a meaningful insight into the confidentialand undercover network of cybercrime as well especiallywhen there is a lack of information Also the proposedmethodology contributes to facilitate the analysis and re-ducing the time required for searching for possible suspectsof cybercrime We believe that the proposed system ismeaningful for further exploration and correlation ofvarious website defacement cases

As mentioned in Discussion and Limitations a cross-data analysis with other various data sources should bereviewed Said differently the use of additional online oroffline information acquired by human intelligence(HUMINT) or different types of signal intelligence(SIGINT) and sources may also help to reason compo-sition requirements of crime and reduce the category ofinvestigation Furthermore the proposed methodologycan be expanded into incident information for compat-ibility and information exchangeability with othercyberthreat intelligence system as the Structured +reatInformation eXpression (STIX) and Trusted AutomatedeXchange of Indicator Information (TAXII) which arekey strategic elements of the information-sharingsystem [54]

+ere are features such as the particular messages (iethanks-to notifier nationality religion and anniversary)or image and mp3 file in the web resources which aregathered from the zone-horg site Although these featuresare limited to only a small number of hackers of the webresources in future research we will try to study a close-knit network among them such as the hub hacking groupkey player and followers Furthermore we also plan tomore definitely classify and systemize the hackersrsquo intentsusing text mining and mood detection techniques +efindings of this prospective study will contribute mean-ingful insights to trace hackersrsquo behavioural patterns and toestimate their primary purpose and intent

Data Availability

+e web-hacking dataset applied to our paper can bedownloaded from the linked site below httpocslabhksecuritynetDatasetsweb-hacking-profiling

Conflicts of Interest

+e authors declare that there are no conflicts of interestregarding the publication of this paper

Acknowledgments

+is work was supported under the framework of internationalcooperation program managed by the National ResearchFoundation of Korea (No 2017K1A3A1A17092614)

References

[1] S S Response ldquoSwift attackersrsquo malware linked to more fi-nancial attacksrdquo 2016 httpswwwsymanteccomconnectblogsswift-attackers-malware-linked-more-financial-attacks

[2] S S Response ldquoWannacry ransomware attacks show strong linksto lazarus grouprdquo 2017 httpswwwsymanteccomconnectblogswannacry-ransomware-attacks-show-strong-links-lazarus-group

[3] K lab ldquoLazarus under the hoodrdquo 2018 httpsmediakasperskycontenthubcomwp-contentuploadssites4320180307180244Lazarus_Under_+e_Hood_PDF_finalpdf

[4] Operation Blockbuster ldquoDestructive malware reportrdquo 2016httpswwwoperationblockbustercomwp-contentuploads201602Operation-Blockbuster-Destructive-Malware-Reportpdf

[5] D Martin and SANS Institute InfoSec Reading Room ldquoTracingthe lineage of DarkSeoulrdquo 2016 httpswwwsansorgreading-roomwhitepaperscriticaltracing-lineage-darkseoul-36787

[6] D S C T U T Intelligence ldquoWiper malware threatanalysisrdquo 2013 httpswwwsecureworkscomresearchwiper-malware-analysis-attacking-korean-financial-sector

[7] R Sherstobitoff M L Itai Liba and O O T C James WalterldquoDissecting operation troy cyberespionage in South Koreardquo2013 httpswwwmcafeecomenterpriseen-usassetswhite-paperswp-dissecting-operation-troypdf

[8] N Horton andA DeSimone ldquoSonyrsquos nightmare before christmasthe 2014 North Korean cyber attack on Sony and lessons for USgovernment actions in cyberspacerdquo 2018 httpswwwjhuapleduContentdocumentsSonyNightmareBeforeChristmaspdf

[9] I K Lee and S R Ramsey 9e Korean Language StateUniversity of New York Albany NY USA 2000

[10] V Benjamin and H Chen ldquoSecuring cyberspace identifyingkey actors in hacker communitiesrdquo in Proceedings of the 2012IEEE International Conference on Intelligence and SecurityInformatics pp 24ndash29 Arlington VA USA June 2012

[11] Y Lu X Luo M Polgar et al ldquoSocial network analysis of acriminal hacker communityrdquo Journal of Computer In-formation Systems vol 51 no 2 pp 31ndash41 2010

[12] J-W Jang H Kang J Woo A Mohaisen and H K KimldquoAndro-autopsy anti-malware system based on similaritymatching of malware and malware creator-centric in-formationrdquo Digital Investigation vol 14 pp 17ndash35 2015

[13] J W Jang and H K Kim ldquoFunction-orientedmobile malwareanalysis as first aidrdquo Mobile Information Systems vol 2016Article ID 6707524 11 pages 2016

[14] Y Ki E Kim and H K Kim ldquoA novel approach to detectmalware based on api call sequence analysisrdquo InternationalJournal of Distributed Sensor Networks vol 11 no 6 ArticleID 659101 2015

[15] M L Han H C Han A R Kang et al ldquoWeb-hacking datasetfor the cyber criminal profilingrdquo 2016 httpocslabhksecuritynetDatasetsweb-hacking-profiling

[16] M L Han H C Han A R Kang B I Kwak A Mohaisenand H K Kim ldquoWAHP web-hacking profiling using case-based reasoningrdquo in Proceedings of the 2016 IEEE Conference

Security and Communication Networks 19

on Communications and Network Security (CNS) pp 344-345Philadelphia PA USA October 2016

[17] A Aamodt and E Plaza ldquoCase-based reasoning foundationalissues methodological variations and system approachesrdquo AICommunications vol 7 no 1 pp 39ndash59 1994

[18] D M L Martins and F B D Lima Neto ldquoHybrid intelligentdecision support using a semiotic case-based reasoning andself-organizing mapsrdquo IEEE Transactions on Systems Manand Cybernetics Systems no 99 pp 1ndash8 2017

[19] H K Kim K H Im and S C Park ldquoDSS for computersecurity incident response applying CBR and collaborativeresponserdquo Expert Systems with Applications vol 37 no 1pp 852ndash870 2010

[20] J-B Lamy B Sekar G Guezennec J Bouaud andB Seroussi ldquoExplainable artificial intelligence for breastcancer a visual case-based reasoning approachrdquo ArtificialIntelligence in Medicine vol 94 pp 42ndash53 2019

[21] M Relich and P Pawlewski ldquoA case-based reasoning ap-proach to cost estimation of new product developmentrdquoNeurocomputing vol 272 pp 40ndash45 2018

[22] E R Reyes S Negny G C Robles et al ldquoImprovement ofonline adaptation knowledge acquisition and reuse in case-based reasoning application to process engineering designrdquoEngineering Applications of Artificial Intelligence vol 41pp 1ndash16 2015

[23] H K Kim S-K Kim and S-H Kim ldquoDecision supportsystem for zero-day attack responserdquo Applied Mathematicsand Information Sciences vol 6 no 1 pp 221Sndash241S 2012

[24] G Horsman C Laing and P Vickers ldquoA case-based rea-soning method for locating evidence during digital forensicdevice triagerdquo Decision Support Systems vol 61 pp 69ndash782014

[25] G Horsman C Laing and P Vickers ldquoA case based reasoningsystem for automated forensic examinationsrdquo in Proceedings ofthe PGNET 2011 the 12th Annual Postgraduate Symposium onthe Convergence of Telecommunications Networking andBroadcasting pp 26ndash31 Liverpool UK June 2011

[26] Z Yin Y Gao and B Chen ldquoOn development of supple-mentary criminal analysis system based on cbr and ontologyrdquoin Proceedings of the 2010 International Conference onComputer Application and System Modeling (ICCASM 2010)vol 14 Taiyuan China October 2010

[27] A J Pinizzotto and N J Finkel ldquoCriminal personality pro-filing an outcome and process studyrdquo Law and HumanBehavior vol 14 no 3 pp 215ndash233 1990

[28] P Chen and J Kurland ldquoTime place and modus operandi asimple apriori algorithm experiment for crime pattern de-tectionrdquo in Proceedings of the 2018 9th International Con-ference on Information Intelligence Systems and Applications(IISA) pp 1ndash3 Zakynthos Greece July 2018

[29] C J R Collie and K Shalev Greene ldquoExamining modusoperandi in stranger child abduction a comparison ofattempted and completed casesrdquo Journal of InvestigativePsychology and Offender Profiling vol 16 no 2 pp 91ndash1092019

[30] V Benjamin B Zhang J F Nunamaker Jr and H ChenldquoExamining hacker participation length in cybercriminalinternet-relay-chat communitiesrdquo Journal of ManagementInformation Systems vol 33 no 2 pp 482ndash510 2016

[31] V Benjamin and H Chen ldquoTime-to-event modeling forpredicting hacker IRC community participant trajectoryrdquo inProceedings of the 2014 IEEE Joint Intelligence and SecurityInformatics Conference pp 25ndash32 +e Hague +e Nether-lands September 2014

[32] K Veena and K Meena ldquoIdentification of cyber criminal byanalysing the users profilerdquo International Journal of NetworkSecurity vol 20 no 4 pp 738ndash745 2018

[33] F Iqbal B C M Fung M Debbabi R Batool andA Marrington ldquoWordnet-based criminal networks miningfor cybercrime investigationrdquo IEEE Access vol 7pp 22740ndash22755 2019

[34] N Qazi and B L W Wong ldquoAn interactive human centereddata science approach towards crime pattern analysisrdquo In-formation Processing ampManagement vol 56 no 6 p 1020662019

[35] N Jain P Sharma R Anchan et al ldquoComputerized forensicapproach using data mining techniquesrdquo in Proceedings of theACM Symposium on Women in Research 2016 pp 55ndash60ACM New York NY USA 2016

[36] P M Cozens G Saville and D Hillier ldquoCrime preventionthrough environmental design (cpted) a review and modernbibliographyrdquo Property Management vol 23 no 5pp 328ndash356 2005

[37] H Hassani X Huang E S Silva andM Ghodsi ldquoA review ofdata mining applications in crimerdquo Statistical Analysis andData Mining 9e ASA Data Science Journal vol 9 no 3pp 139ndash154 2016

[38] A Sharma and S Sharma ldquoAn intelligent analysis of webcrime data using data miningrdquo International Journal of En-gineering and Innovative Technology (IJEIT) vol 2 no 32012

[39] S-T Li S-C Kuo and F-C Tsai ldquoAn intelligent decision-support model using FSOM and rule extraction for crimepreventionrdquo Expert Systems with Applications vol 37 no 10pp 7108ndash7119 2010

[40] Y-H Tseng Z-P Ho K-S Yang and C-C Chen ldquoMiningterm networks from text collections for crime investigationrdquoExpert Systems with Applications vol 39 no 11 pp 10082ndash10090 2012

[41] A Malathi and S S Baboo ldquoAn enhanced algorithm topredict a future crime using data miningrdquo InternationalJournal of Computer Applications vol 21 no 1 2011

[42] S Kapetanakis A Filippoupolitis G Loukas et al ldquoProfilingcyber attackers using case-based reasoningrdquo in Proceedings ofthe 19th UK Workshop on Case-Based Reasoning (UKCBR2014) Cambridge UK December 2014

[43] R Al-Zaidy B C Fung A M Youssef et al ldquoMining criminalnetworks from unstructured text documentsrdquo Digital In-vestigation vol 8 no 3-4 pp 147ndash160 2012

[44] M Zulfadhilah Y Prayudi and I Riadi ldquoCyber profilingusing log analysis and k-means clusteringrdquo InternationalJournal of Advanced Computer Science and Applicationsvol 7 no 7 pp 430ndash435 2016

[45] S V Nath ldquoCrime pattern detection using data miningrdquo inProceedings of the 2006 IEEEWICACM International Con-ference on Web Intelligence and Intelligent Agent TechnologyWorkshops pp 41ndash44 Hong Kong China December 2006

[46] ITPnet ldquoSyria Egypt crises spur escalation of me cyber at-tacksrdquo 2013 httpwwwitpnet594742-syria-egypt-crises-spur-escalation-of-me-cyber-attack

[47] A McEnery and R Xiao ldquoCharacter encoding in corpusconstructionrdquo in Developing Linguistic Corpora A Guide toGood Practice Oxbow Books Ltd Oxford UK 2005

[48] B Bos T Ccedilelik I Hickson et al ldquoCascading style sheets level2 revision 1 (CSS 21) specificationrdquo W3C Working Draft2005 httpwwww3orgTRCSS21

20 Security and Communication Networks

[49] W Stuckey ldquoMassive sony breach sheds light on murkyhacker universerdquo 2018 httpamericaaljazeeracomarticles20141224sony-hacker-universehtml

[50] S Gallagher ldquoSony pictures malware tied to SeoulldquoShamoonrdquo cyber-attacksrdquo 2018 httpsarstechnicacominformation-technology201412sony-pictures-malware-tied-to-seoul-shamoon-cyber-attacks

[51] J Pagliery ldquoSony hack signs point to North Koreardquo 2018httpsmoneycnncom20141205technologysecuritysony-hack-north-korea-employeeindexhtml

[52] K Ketler ldquoCase-based reasoning an introductionrdquo ExpertSystems with Applications vol 6 no 1 pp 3ndash8 1993

[53] M Rosvall and C T Bergstrom ldquoMapping change in largenetworksrdquo PLoS One vol 5 no 1 Article ID e8694 2010

[54] OASIS ldquoSTIXTAXII standardsrdquo 2017-2018 httpsoasis-opengithubiocti-documentation

Security and Communication Networks 21

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 9: CBR-Based Decision Support Methodology for Cybercrime

Input TCs(Tested_DB)lowast +e Tested_DB indicates the cases-centric DB lowastRC (Retrieved_Case)⟵ Encodi ngRC IPRC DomainRC DateRC OSRClowast RC means one of the retrieved cases lowastW (Weight)⟵ Encodi ngW IPW DomainW DateW OSW

Output Similarity_score(1) TCEncodi ngTC IPTC DomainTC DateTC OSTC⟵TCs(2) While RC in TCs do(3) if Encodi ngRC Encodi ngTC then(4) Encoding_similarity_value⟵ 10(5) else(6) Encoding_similarity_value⟵ 00(7) end(8) IPRC Octet ARC Octet BRC Octet CRC Octet DRC IPTC Octet ATC Octet BTC Octet CTC Octet DTC(9) if (Octet ARC Octet ATC) (Octet BRC Octet BTC) (Octet CRC Octet CTC) (Octet DRC Octet DTC) then(10) IP_similarity_value⟵ 10(11) else if (Octet ARC Octet ATC) (Octet BRC Octet BTC) (Octet CRC Octet CTC) then(12) IP_similarity_value⟵ 075(13) else if (Octet ARC Octet ATC) (Octet BRC Octet BTC) then(14) IP_similarity_value⟵ 05(15) else if (Octet ARC Octet ATC) then(16) IP_similarity_value⟵ 025(17) else(18) IP_similarity_value⟵ 00(19) end(20) DomainRC ServiceNameRC gTLDRC ccTLDRC DomainTC ServiceNameTC gTLDTC ccTLDTC(21) if an identical domain then(22) Domain_similarity_value⟵ 10(23) else if (ServiceNameRC ServiceNameTC) (gTLDRC gTLDTC) (ccTLDRC ccTLDTC) then(24) Domain similarity_value⟵ 08(25) else if (gTLDRC gTLDTC) (ccTLDRC ccTLDTC) then(26) Domain_similarity_value⟵ 03(27) else if (ServiceNameRC ServiceNameTC) then(28) Domain_similarity_value⟵ 01(29) else if (ccTLDRC ccTLDTC) then(30) Domain_similarity_value⟵ 01(31) else if (gTLDRC gTLDTC) then(32) Domain_similarity_value⟵ 01(33) else(34) Domain_similarity_value⟵ 00(35) end(36) Date_variance⟵ |Da teRC―Da teTC|lowast It converts a date format year month and day (ie yyyy-mm-dd) into a day

calculated with numeric lowast(37) if 0leDate_variancele 365 then(38) Date_similarity_value⟵ 10(39) else if 365ltDate_variancele 1095 then(40) Date_similarity_value⟵ 075(41) else if 1095ltDate_variancele 1825 then(42) Date_similarity_value⟵ 05(43) else if 1825ltDate_variancele 2555 then(44) Date_similarity_value⟵ 025(45) else if 2555ltDate_variance then(46) Date_similarity_value⟵ 00(47) end(48) if OSRC OSTC then(49) OS_similarity_value⟵ 10(50) else(51) OS_similarity_value⟵ 00(52) end(53) Similarity_score⟵ (Encoding_similarity_valuetimes EncodingW) +

(IP_similarity_valuetimes IPW) + (Domain_similarity_valuetimes DomainW) +(Date_similarity_valuetimes DateW) + (OS_similarity_valuetimes OSW)

(54) return Similarity score between RC and TC(55) end while

ALGORITHM 1 Similarity measure module

Security and Communication Networks 9

updating the weight value is an issue worth addressing infurther research In the present study we set the weight valuesfor the case vector including the encoding IP address domainattack date and OS (see Table 2)

Some case vectorsrsquo distance cannot be directly estimatedas they have mixed numerical and nominal data (such as IPaddress range and domain name) For this reason to cal-culate the distance between the nominal data we defined thediscrete similarity measure +e similarity of IP addresseswas calculated by measuring the similarity among the sameoctet of two given IP addresses +e IP address space iscomposed of a number combination of four octets separatedby ldquordquo In the present study we compared if octets from the1st octet to the 4th octet of RC and TC were identicalSubsequently a similarity value was assigned to the IPaddress vector We suggested the discrete similarity valuebetween two IP addresses as visible in Table 2 +e proposedapproach is advantageous in that it enables the distancecalculation between the IP addresses efficiently

(i) IP address of RC zzz yyy xxx www

(ii) IP address of TC zzz yyy xxx www

Meanwhile the similarity between domains is calculatedaccording to their domain properties +e domain iscomposed of the gTLD ccTLD and service name+e gTLDrefers to a generic top-level domain in the domain rule Forinstance com and co are used for commercial companies ororganizations org and or are used for nonprofit organi-zations go and gov are used for government and stateagencies Besides ccTLD refers to a country code top-leveldomain in the domain rule and means a unique sign thatrepresents a specific region such as kr cn br and uk DNSmakes change in the IP address into a unique Domain Namewhich is easy to remember because it consists of a combi-nation of an alphabet letter and a number Among theDomain Name the service name is built corresponding withthe characteristics of the groups organizations or corpo-rations that the gTLD is intending and pursuing +e servicename has diverse and different names depending on thecategories of the gTLD such as educational institutionscommercial enterprises military organizations nonprofitorganizations and government and state agencies Unlikeother case vectors we set the rule for estimating the simi-larity of the domain as depicted in Table 2

Furthermore we defined the attack date similarity Similarto the offline criminal investigation case if the time of a crimeoccurrence is near we can analyse the cases as a similar crimewith a cross-analysis of the target area and the criminalsrsquopatterns +e similarity value depends on the period differencebetween a new case and existing cases As visible in Table 2 thesimilarity value is described according to the date gap of twocases that occurred on different dates In summary accordingto the similarity degree of a variation range of a section thesimilarity values of the attack IP address domain and attackdate were set to the similarity value between 0 and 1

332 Clustering Processing Merely sorting the data andvisually analysing them render it difficult for an investigator to

infer the correlations and similarity among the potentialfeatures of incidents Hence an advanced tool that wouldcapture the complex underlying structures and data prop-erties is required Accordingly in the present study weconducted the clustering process using the EM algorithmbased on the probability of the individual data attributes +isalgorithm does not restrict the number of clusters in theparameters but automatically generates a number of validclusters by cross-validation +ereafter the algorithm de-termines the probability that some data items existed in thecluster bymaximizing the correlation and dependence amongthe objectsWe applied practically the EM algorithm to 80948data items having the information of encoding gTLD ccTLDand OS from 212093 data for clustering +e characterencoding was normalized by a group of congenial cover codeunits (ISO-8859 MS Windows character set GB and EUCseries) We excluded the Unicode because it is too generalwhich accounts for themajority of the collected encoding datafor clustering In the case of the service name even if we canfind out similar combinations of alphabet letters or numbersit is not easy to find commonality or relevance between them+erefore it is not suitable for being used as the similaritymeasure of the reasoning engine Consequently character-istics and metadata concerning the 12 clusters were obtained(see Table 3) +ese clustering results are also visualized andstored in the database (see Figure 6)

+e donut charts include the different features fromoutside to inside (in order) with the corresponding share ofeach feature value separated by a different colour codewithin this same circle Each cluster consists of four circlesand the circle represents from the outside to the inside theencoding gTLD ccTLD and OS +e percentage in Table 3represents howmany cases one cluster contains among all ofwebsite defacement cases collected from the zone-horg site+e representative hacker represents a notable hacker orhacking group among the members of them in each clusterAs described in Figure 6 clusters of similar patterns werefound in the clusters +e most conspicuously similarclusters were 4 and 7 which had the feature of using Arabicand Chinese a feature of the attack against an industrialorganization whose headquarters are located in WesternEurope +e cases in Clusters 4 and 7 accounted for 4129percent among all of website defacement cases collectedfrom the zone-horg site+e results of the clustering processcontribute to the concretization of the similarity between thenew and existing cases A large number of new cases haveflowed in the database and then if the clustering process isperformed with the dataset a clustering result may take on adifferent pattern of course

4 Application

41 Experimental Results and Analysis Considering that theassumption that the attackers tend to use similar or uniqueattack methods is not always valid and it is difficult toevaluate the accuracy of the similarity mechanism As timeprogresses attackersrsquo hacking skills advance and in additionthe attack plan campaign purpose and target groups canchange depending on the situation +erefore in the present

10 Security and Communication Networks

Table 2 Value and the weight for the similarity score by the case vector All of the values of the similarity score are normalized to 0 or 1

Case vector Weight Impact +e similarity measure between a new case andexisting cases Value

Encoding 05 High mdash 0 or 1

IP address 02 Medium

If the same (eg 14324816 and 14324816) 1If the 1st 2nd and 3rd octet are matched (eg

14324816 and 14324818) 075

If the 1st and 2nd octet are matched (eg 14324816and 14324844) 05

Only the 1st octet is matched (eg 14324816 and1431324) 025

No common octet (eg 14324816 and 1631325) 0

Domain 015 Medium

An identical domain 1Service name is matched and one of the gTLD and

ccTLD is matched 08

gTLD and ccTLD is matched 03Service name is matched 01

ccTLD is matched 01gTLD is matched 01

Nonidentical domain 0

Date 01 Low

Period of about 6 months back and forth (1 year) 1Period of about 18 months back and forth (3 years) 075Period of about 30 months back and forth (5 years) 05Period of about 42 months back and forth (7 years) 025Over period of about 42 months (over 7 years) 0

OS 005 Low mdash 0 or 1

Table 3 Characteristics and metadata of several different clusters derived from the clustering processing

Cluster number Ratio () Description Representative hacker (group)

0 784+e group uses Central European languages +eyprincipally attacked against the profit organization

and Linux-based OS in Western EuropeJaMaYcKa Super2li

1 816

+e group uses Arabic and Cyrillic +ey principallyattacked against the organization that manages thenetwork and Linux-based and Unix-based OS +eirattack region is distributed throughout SouthernEurope South America Eastern Europe and

Southeast Asia

BI0S

2 1036

+e group uses Central European languages +eyprincipally attacked against the organization that

manages the network and nonprofit organizations inWestern Europe

JaMaYcKa

3 933+e group uses Central European languages +eyprincipally attacked against the profit organization

and Windows-based OS in Western Europe1923Turk

4 2536+e group uses Arabic and Chinese +ey principally

attacked against the profit organization andWindows-based OS in Western Europe

EL_MuHaMMeD federal-atackorg

5 173

+e group uses Central European languages +eyprincipally attacked against the profit organizationand Unix-based OS in Southern Europe and Eastern

Europe

d3bsimX SuSKuN

6 524

+e group uses Central European languages +eyprincipally attacked against the profit organizationthe educational institution the government and stateagencies and also Windows-based OS in East Asia

1923Turk

Security and Communication Networks 11

study rather than evaluating the accuracy of the similaritymechanism we tested the overall performance of the pro-posed methodology with the ratio of correctly identified

hackers +e developed testing procedures unfolded in thefollowing four steps and are depicted in detail in Figure 7where ldquoKrdquo presents all hackers within the database

Table 3 Continued

Cluster number Ratio () Description Representative hacker (group)

7 1593+e group uses Arabic Chinese and Turkish +eyprincipally attacked against the profit organization

and Linux-based OS in Western EuropeRya iskorpitx

8 911+e group uses Central European languages +eyprincipally attacked against the profit organization

and Windows-based OS in Western Europe1923Turk

9 363

+e group uses Central European languages +eyprincipally attacked against the profit organizationand Linux-based OS in South America and Eastern

Europe

Hmei7

10 139

+e group uses Central European languages +eyprincipally attacked against Windows-based OS inSouth America and Southeast Asia+eir attack target

is mostly the educational institution and thegovernment and state agencies

BHS F4keLive

11 192

+e group uses Arabic and Central Europeanlanguages+ey principally attacked against the profitorganization and Windows-based OS in Southern

Europe

EL_MuHaMMeD linuXploit_cre

Clustering 00

25

25

50

75

0100

25

50

75

0100

25

50

75

0100

25

50

75

0100

25

50

75

0100

25

50

75

0100

25

50

75

0100

25

50

75

0100

25

50

75

0100

25

50

75

0100

25

50

75

0100

50

75

0100

Clustering 01 Clustering 02 Clustering 03

Clustering 04 Clustering 05 Clustering 06 Clustering 07

Clustering 08 Clustering 09 Clustering 10 Clustering 11

West EuropeTurkishCentral EuropeArabicCyrillicChinese

com

net

org

gov

edu

mil

Western EuropeEast AsiaSouthern EuropeSouth AmericaEastern EuropeSoutheast Asia

WindowLinuxUnixMacOS

Encoding

gTLD

ccTLD

OS

Figure 6 Visualization of the 12 different clusters (00 through 11) in our data annotated with various features encoding gTLD ccTLD andOS and their corresponding share (legend on the right side)

12 Security and Communication Networks

Rk Count Casesmk( )Count Casesallk( )

(3)

where ldquomrdquo means the past cases which are within the denedscope concerning a randomly selected hacker ldquokrdquo

(i) Step 1 selection the measurement objects ie 100hackers were randomly selected from the database

(ii) Step 2 case labelling we retrieved all previous attackcases conducted by the randomly selected 100hackers in Step 1 and then subsequently labelled allprevious attack cases by each hacker name

(iii) Step 3 case extraction we selected the most recentcase among the cases labelled in Step 2 as an inputvalue shye similarity score was then estimated bycomparing themost recent case (ie RCmdashone of theretrieved cases) with all other cases in the database(ie TCsmdashall cases in the cases-centric DB)

(iv) Step 4 scoring similarity score was sorteddepending on the value and the weight for thesimilarity score by the case vector (see Table 2) inthe descending orderWhenever the similarity valuewas 0 it was not displayed on the scoring list of Step4 shye feasibility of the proposed methodology wasevaluated based on how many past cases of a hackerthere were in the N scope at the scoring list of Step 4that is regarding the ratio of the attack cases by eachhacker we checked whether the cases were includedat the top N scope (N scope from the top 1 percentto the top 30 percent)

NScope Count CasesScopeK( )Count CasesallK( )

times 100 (2)

First we randomly picked 100 hackers from the col-lected dataset (ie cases-centric DB) thereafter we re-trieved and extracted all past attack cases for each hackershye extracted past cases were labelled with the hackerrsquosname Figure 8 depicts the number of website defacementattack cases in the past for each hacker In Steps 3 and 4similarity between a retrieved case (ie the most recentcase) and all other stored website defacement cases weremeasured

Specically we checked whether the result (ie thesorted hackerrsquos past cases with a high similarity score)stemming from the similarity measurement was included atthe top N scope shyis process was meant to check based onthe similarity score how many past attack cases of randomlypicked 100 hackers were included in the dened topN scopeTo this end we divided the top N scope into eight criterionfactors from the top 1 percent to the top 30 percent and theratio R all the past attack cases for each hacker into sixcriterion factors from 50 percent to 100 percent (ie at 10percent intervals) As illustrated in equations (2) and (3) theN scope and the ratio R were categorized as ratios accordingto the dened measure rule More specically the criterionof the top N scope ie ldquotop N percentrdquo was based on theresult derived from the similarity measurement Attack caseswere sorted in order of high similarity score and thereforethe cases were within the range of topN scope (see Figure 9)Also in the case of the hacking case ratio of a randomly

Step 4 scoring

bullbullbull

Randomly selected100 hackers

from the database

Step 1 selection

bullbullbull

1 eBuGz2 Hmei73 3xp1r3

98 S3cure99 drm1st3r

100 Lulz53c

bullbullbull

Step 2 case labelling1 TheBuGz

100 Lulz53c

Step 3 case extraction

A retrieved case(the most recent case)

bullbullbull

1 TheBuGz

100 Lulz53c

bullbullbull

Cases-centricDB

Hackername Date Encoding IP address Domain OS Score

Hackername Date Encoding IP address Domain OS Score

bullbullbull

1 eBuGz2 Hmei73 3xp1r3

98 S3cure99 drm1st3r

100 Lulz53c

Case1 bullbullbull Casem

Case1 bullbullbull Casemprime

Case1 bullbullbull Casem

Case1 bullbullbull Casemprimei=1

cv[Distance (RCcv TCcv) times Weightcv]

Casemprime

Casemprime

Casem

Casem

Figure 7 shye developed testing procedures from step 1 to step 4

Security and Communication Networks 13

selected hacker some parts of the past attack cases (ie ratioR) concerning a hacker were within the defined N scope (seeFigure 9)

Figure 10 shows the number of an identified hacker fora retrieved case (ie the most recent case) among allhacking cases of each hacker +e X-axis in Figure 10 showsthe criterion of the topN scope including the eight criterionfactors () and of the ratio R including the six criterionfactors () +e Y-axis presents the number of an identifiedhacker in the top N scope among the randomly selected 100hackers in Step 1 As can be seen in Figure 10 the higher theratio R and the narrower the N scope the lower the numberof an identified hacker in the top N scope among therandomly selected 100 hackers On the other hand thelower the ratio R and the wider the N scope the higher thenumber of identified hackers in the top N scope among therandomly selected 100 hackers Consequently even ifhacking cases were caused by the same hacker as the hackeror hacking group which only attacked the same or similarobjects were rare it is impossible to draw results with a highsimilarity score for all cases of a hacker Nevertheless theresults demonstrated that the proposed CBR-based de-cision support methodology can successfully reduce thenumber of hackers and their cases and suggest potential topN percent candidates among hundreds of thousands ofcases

+erefore an investigator should consider the avail-ability and flexibility of data with respect to the data selectioncriteria for the similarity measurement As mentionedabove when a new attack occurs they can limit the searchrange of the data and determine the direction of the criminalinvestigation With such the reduction in the number ofcandidate-related cases the outcomes of our similaritymechanism are highly valuable in terms of reducing theinvestigation time to determine the potential suspect of agiven hacking incident

42 Case Study As mentioned above the accuracy of theCBR depends on the quality of the collected data and theoverall accuracy is difficult to evaluate Nevertheless al-though the data are insufficient to evaluate the proposedmethodology the DS and SPE cases include the ground-truth data with specific information related to the hacker orhacking groups Based on the public ground-truth data ofthe DS and SPE cases we found the most similar top threehackers or hacking groups to them and noticed theircharacteristic by the proposed similarity measure and theclustering processing

+e hackers of the DS cyberattack defaced the groupwarehomepage of LG U+ the 3rd largest telecommunicationcompany in South Korea and the English version of the

Step 4 scoring

1 TheBuGzTop N scope(1~30)

Ratio R (50~100)

Hackername Date Encoding IP address Domain OS Score

Case1 bullbullbull Casem

Casem

Figure 9 Scoring step on the top N scope and the ratio R

0

1000

2000

3000

4000

5000

0 25 50 75 100Hacker

Num

ber o

f cas

es

Figure 8 +e number of website defacement attack cases in the past of each hacker

14 Security and Communication Networks

Korean Broadcasting System (KBS) homepage +ey leftunique images and many messages on the defaced websites+e three Calaveras image (ie skull image) used in the LGU+rsquos defaced website appeared on many European websites+e character encoding set of the message was the WesternEuropean language system Based on these insights we couldinfer that the hackersrsquo background is European ldquoHASTATIrdquowas the word written on the KBS homepage meaning theforefront line of the Roman troops hinting that the DScyberattack could be a starting point rather than a transientattack it was a persistent one Even if we excluded otherimages and messages as well as other features from thesimilarity processes due to the unanticipated loss or absenceof data one could establish the similarity and intent of theattackers with reasonable confidence However given thesufficiently large hacker profiling source such abundant datacould support and enhance the accuracy of inference Fig-ure 11 shows the screenshots of the defaced websites at thattime

In the SPE case similarly to the DS case some imagesand messages were left on the computers of SPE Regardingcolour skulls image and misspellings the imagesFigure 11(c) used in the SPE cases took on the characteristicssimilar to those of the images Figure 11(b) used in the DScases As shown in Figure 11 the colour schemes in greenand red and the visual similarities seen in skull image areother crucial elements for crime tracing In both the DS andSPE cases the phrase such as ldquothis is the beginningrdquo andldquoyour datardquo were commonly found in the messages How-ever given the intentional hacking nature of forging orhiding their identity motivation and location some experts

say that these characteristics are not the conclusive proofthat Sony has been attacked by the same hacker [49ndash51]

For the evaluation of the results of the case study we firstmeasured the similarity between the new website de-facement cases (ie the DS and SPE cases) and the collectedexisting cases in the database +is approach coheres withthe CBR process used in cybercrime investigation (seeFigure 2) Two new website defacement cases the DS and theSPE were applied as RC and the similarity score for each ofthese two cases was computed using the similarity measure(see equation (1)) proposed in Section 331 Provided thatbecause the DS and SPE cases do the function of the targetcases as an input value we considered a direct comparisonbetween the DS and SPE cases for the similarity score wasnot appropriate [52]

+e similarity measure mentioned in the previousparagraph is based on the metadata released by an analysisreport of the DS and SPE real cases We summarized furtherthe characteristics and metadata associated with them inTable 4 +e similarity score was derived through com-parison between the presented metadata of the DS and SPEcases and all cases in the cases-centric DB We gave the mostsimilar top three cases among the result of the similarityscore (see the right side in Table) Notifier Hmei7 and d3b_Xare among the cases that belonged to Clusters 0 and 8 whichwere the two clusters that exhibited identical characteristicsIt can thus be understood that they used the encoding systempertinent to Central European languages based on the Latinlanguage system and typically launched attacks against aprofit organization located in Western Europe Notifieroaddah MTRiX and EL_MuHaMMeD were all classified

Top 1 Top 3 Top 5 Top 10 Top 15 Top 20 Top 25 Top 30Criterion of the top N ()

Num

ber o

f ide

ntifi

ed h

acke

rs

0

20

40

60

80

100

Ratio of the attack cases ()506070

8090100

Figure 10 +e number of identified hackers in the top N scope among the randomly selected 100 hackers

Security and Communication Networks 15

as the same cluster (Cluster 7) where the hackers of Cluster 7used the encoding system pertinent to Arabic and Chineselanguages and typically attacked against the profit organi-zation located in Western Europe

Next to ensure the objectivity of the similarity scorebased on the case study by the DS and SPE we computed thesimilarity score of any randomly selected pair from thewhole case Figure 12(a) shows the distribution of thesimilarity score of the randomly selected cases We took thedistribution of the similarity score using the central limittheorem which describes the average distribution of ran-dom samples extracted from a finite population +e dis-tribution shows that the calculation of the similarity score ofthe randomly selected two website defacement cases wasrepeatedly performed for 10000 times +e similarity scoresof any randomly selected pair of cases were typically dis-tributed around 03 +is result (Figure 12(a)) substantiatesthat the similarity scores are not low even if the similarityscores of the DS and SPE cases (Figure 12(b)) do not appear

numerically high Figure 12(b) shows the similarity scores ofthe DS and SPE cases+e top score of the similarity was 069in the DS case and all measured cases concentrated aroundthe similarity score (X-axis) of 00 to 015 and of 05 to 06 Inthe SPE case the top score of the similarity was 0615 and allmeasured cases concentrated around the similarity score (X-axis) of 00 to 02

Figure 13 shows the distribution of the similarity scorefor randomly selected 100 hackers mentioned in Section 41To know the mean value of the similarity score for eachhacker case we calculated the similarity score from thehackerrsquos own past cases Cases used for the similarity scoremeans not all cases in the cases-centric DB but just the pastcases conducted by the hacker in the cases-centric DB +emean value of the similarity scores in the hackers is 05233+e similarity scores of the tested cases in Table 4 is abovethe mean value +us the similarity scores for each hackeradequately underpin the similarity scores from the TCs inDS and SPE

(a) (b) (c)

Figure 11 A snippet of website defacement cases by a comparison of examples of the DS and SPE the defaced LGU+ groupware homepage(a) and KBS homepage (b) in the DS case and the defaced website in SPE case (c)

Table 4 Further characteristics and metadata associated with the DS and SPE cases

Retrieved case Tested cases

Case name NotifierDarkSeoul (DS) Hmei7 d3b_X StifLer

Encoding Windows-1252 Windows-1252 Windows-1252 ISO-8859-9IP address 203248195178 2038623868 2031243766 77921083Domain gyunggionnet21com httpwwwgarychengcom healthajkgovpk yapikimyasallaricomtrDate 20 Mar 2013 6 Feb 2014 4 Feb 2014 8 Jun 2013OS Windows Windows Windows WindowsSimilarity mdash 0690 0675 0665Cluster mdash 0 8 4

Retrieved case Tested casesCase name Notifier

Sony pictures Entertainment (SPE) Oaddah MTRiX EL_MuHaMMeDEncoding EUC-KR EUC-CN GB2312 GB2312 GB2312IP address 203131222102 2031241555 20829198 2081164534Domain httpwwwsonypicturesstockfootagecom httpwwwhzkcggcom daxdigitalromcom digitalairstripnetDate 24 Nov 2014 14 Jun 2012 16 Dec 2002 18 June 2009OS Windows Windows Windows WindowsSimilarity mdash 0615 0615 0600Cluster mdash 7 7 7+e metadata are arranged according to the defined case vector corresponding with the DS and SPE cases on the left side (shown in part in boldface type)

16 Security and Communication Networks

43 Follow-Up Investigation A case study is a researchmethod involving an in-depth and detailed investigationof a subject of study as well as its related contextualmethodology Hence we conducted follow-up in-vestigations of the most similar top three hackers asmentioned above in Table 4 According to the resultsspecifically over 93 percent of the hackerrsquos attacks weresimilar to the DS case that occurred in 2013 and 2014+eir major targets were com domain sites and theytargeted primarily Germany Italy New Zealand RussiaTurkey Taiwan and South Korea (see Table 5) Twohackers (ie Hmei7 and d3b_X) primarily attackedgovernment agencies Interestingly 20 percent of theattacks by the hackers named d3b_X targeted SouthKorea In the SPE incident the similar hackerrsquos attacksoccurred throughout the period from 2002 to 2014 +ehackers named MTRiX and EL_MuHaMMeD in-tensively executed such attacks in 2003 and 2009 +eirmajor targets were com (or co) and org domain sitesand they targeted primarily Brazil Canada DenmarkFrance Greece Hong Kong and Italy (see Table 5) Twohackers (ie MTRiX and EL_MuHaMMeD) primarilyattacked commercial agencies and additionally attackedthe public and network agencies As shown in Figure 14 to

describe the follow-up investigation more discernibly andto focus on the attack flow we used an alluvial diagramwhich is a type of Sankey diagram developed to representchanges in a network structure over time [53] It shows theinvestigation of the top three hackers with website de-facement cases most similar to the DS case and SPE case+e case vectors were based on the attack year ccTLD andgTLD+e thickness of the attack flow in this figure meansthe degree of attack +is network visualization methodcould support an investigator to understand the flow andcore of the crime clearly by listing the multidimensionalevidence that is complicatedly entangled or hidden suchthat it does not look presentable

5 Limitations and Discussion

+e CBR algorithm has the disadvantage that the perfor-mance evaluation may be degraded if the property de-scribing the case is inappropriate +erefore in order toobtain more accurate results cross-data analysis with othervarious data sources should be considered For examplecybercrime statistics data from law enforcement agenciesthreat intelligence data from malware analysis groups andvulnerability databases could be useful resources to

100806040200Similarity score

600

400

200

0

Freq

uenc

y

Mean = 02930 Var = 00866

(a)

100806040200

Freq

uenc

y e highest similarity score 0615on Sony Pictures Entertainment case

Similarity score100806040200

Similarity score

40000

30000

20000

10000

0

Freq

uenc

y

40000

30000

20000

10000

0

Mean = 0114 Var = 01500

e highest similarity score 069on DarkSeoul case

Mean = 0063Var = 00370

A B

(b)

Figure 12 (a) Probability distribution of the similarity score for any pair of randomly selected cases (b) distribution of the similarity valuebetween the collected website defacement cases with the DS case (A) and the distribution of the similarity value between the collectedwebsite defacement cases with the SPE case (B) +e similarity was calculated between each studied case and all other cases in our system

0

2

4

6

000 025 050 075

Freq

uenc

y

Mean value of the similarity score000 025 050 075

Figure 13 Distribution of the similarity score for randomly selected 100 hackers

Security and Communication Networks 17

improve the accuracy and usability of our proposedmethodology However at the time of writing the presentpaper we did not have access to open and public dataconcerning cybercrime

For that reason we tried to demonstrate the practica-bility of the proposed methodology as a proof of concept+erefore we focused on the dataset of the zone-horg thatincludes a large number of website defacement cases Al-though the zone-horg provides an extensive dataset on thepast incident events not all incidents can be included in ourstudy +erefore if a hacker penetrated some target orga-nizations by APT attacks and performed stealthy activitiessuch hacking activities would not be reported in the datasetof the zone-horg and the proposed methodology would notbe able to detect similar cases with reasonable confidence

6 Conclusion and Future Work

In this study the similarity of website defacement caseswas assessed through the similarity measure and theclustering processing using the CBR as a methodology+e collected raw data of the defaced web sitesrsquo resourceswas sanitized via data parsing and data cleaning processAlso based on the large size of real dataset data-drivenanalysis for the hacker profiling is achieved To this endthe case vector was designed and the significant featureswere chosen for applying to the case-based reasoning Fora successful cybercrime investigation hacker profiling viaclustering analysis is the most basic and importantprocess in order to find out the relevant incident casesand significant data on some prime incidents data-driven

Table 5 Follow-up investigation on the top three hackers with website defacement cases most similar to the DS case and SPE case +e casevector value means the hackerrsquos attack rate

DomainDS case SPE case

Hmei7 d3b_X StifLer Oaddah MTRiX EL_MuHaMMeDCom 7832 8581 10000 10000 8627 8298Edu 162 096 mdash mdash 176 191Net 340 320 mdash mdash 546 574Gov 1216 651 mdash mdash 106 mdashYear Hmei7 d3b_X StifLer Oaddah MTRiX EL_MuHaMMeD2002 mdash mdash mdash mdash 1074 mdash2003 mdash mdash mdash mdash 8908 mdash2006 mdash mdash mdash mdash mdash mdash2007 009 mdash mdash mdash 018 mdash2008 mdash mdash mdash mdash mdash mdash2009 315 mdash mdash mdash mdash 99572010 009 mdash mdash mdash mdash mdash2011 034 mdash mdash mdash mdash mdash2012 340 mdash mdash 10000 mdash mdash2013 3486 3917 10000 mdash mdash mdash2014 5808 5977 mdash mdash mdash 0432015 mdash 107 mdash mdash mdash mdash

d3b~x

Hmei7

StifLer

2009

2012

2013

2014

AustraliaBrazilFrance

Germany

IndonesiaItaly

KoreaNetherlandsNewZealand

PolandRussia

Thailand

Turkey

Unknown

com

gov

net

org

Unknown

No

Yes

Hacker Year ccTLD gTLD Attack

(a)

EL_M

uHaM

MeD

MTRiX

oaddah

2002

2003

2009

2012

BrazilCanada

DenmarkFranceGreece

HongKongItaly

Unknown

com

net

org

Unknown

No

Yes

Hacker Year ccTLD gTLD Attack

(b)

Figure 14 Follow-up investigation on the top three hackers with website defacement cases that are most similar to the DS case (a) and SPEcase (b)

18 Security and Communication Networks

and evidence-driven decision making should be thecritical process Also reducing the amount of data andtime to be analysed are important factors to deliver thehigh value of intelligence data

Although the obtained results appear to be sound andmeaningful it is difficult to evaluate the accuracy of theresults unless the attacker is captured Naturally theground-truth data with specific information about theinvolved hacking groups for verification are rare (ie noadversary claimed that the two attacks were the result oftheir actions) However it is noteworthy that our meth-odology provides a meaningful insight into the confidentialand undercover network of cybercrime as well especiallywhen there is a lack of information Also the proposedmethodology contributes to facilitate the analysis and re-ducing the time required for searching for possible suspectsof cybercrime We believe that the proposed system ismeaningful for further exploration and correlation ofvarious website defacement cases

As mentioned in Discussion and Limitations a cross-data analysis with other various data sources should bereviewed Said differently the use of additional online oroffline information acquired by human intelligence(HUMINT) or different types of signal intelligence(SIGINT) and sources may also help to reason compo-sition requirements of crime and reduce the category ofinvestigation Furthermore the proposed methodologycan be expanded into incident information for compat-ibility and information exchangeability with othercyberthreat intelligence system as the Structured +reatInformation eXpression (STIX) and Trusted AutomatedeXchange of Indicator Information (TAXII) which arekey strategic elements of the information-sharingsystem [54]

+ere are features such as the particular messages (iethanks-to notifier nationality religion and anniversary)or image and mp3 file in the web resources which aregathered from the zone-horg site Although these featuresare limited to only a small number of hackers of the webresources in future research we will try to study a close-knit network among them such as the hub hacking groupkey player and followers Furthermore we also plan tomore definitely classify and systemize the hackersrsquo intentsusing text mining and mood detection techniques +efindings of this prospective study will contribute mean-ingful insights to trace hackersrsquo behavioural patterns and toestimate their primary purpose and intent

Data Availability

+e web-hacking dataset applied to our paper can bedownloaded from the linked site below httpocslabhksecuritynetDatasetsweb-hacking-profiling

Conflicts of Interest

+e authors declare that there are no conflicts of interestregarding the publication of this paper

Acknowledgments

+is work was supported under the framework of internationalcooperation program managed by the National ResearchFoundation of Korea (No 2017K1A3A1A17092614)

References

[1] S S Response ldquoSwift attackersrsquo malware linked to more fi-nancial attacksrdquo 2016 httpswwwsymanteccomconnectblogsswift-attackers-malware-linked-more-financial-attacks

[2] S S Response ldquoWannacry ransomware attacks show strong linksto lazarus grouprdquo 2017 httpswwwsymanteccomconnectblogswannacry-ransomware-attacks-show-strong-links-lazarus-group

[3] K lab ldquoLazarus under the hoodrdquo 2018 httpsmediakasperskycontenthubcomwp-contentuploadssites4320180307180244Lazarus_Under_+e_Hood_PDF_finalpdf

[4] Operation Blockbuster ldquoDestructive malware reportrdquo 2016httpswwwoperationblockbustercomwp-contentuploads201602Operation-Blockbuster-Destructive-Malware-Reportpdf

[5] D Martin and SANS Institute InfoSec Reading Room ldquoTracingthe lineage of DarkSeoulrdquo 2016 httpswwwsansorgreading-roomwhitepaperscriticaltracing-lineage-darkseoul-36787

[6] D S C T U T Intelligence ldquoWiper malware threatanalysisrdquo 2013 httpswwwsecureworkscomresearchwiper-malware-analysis-attacking-korean-financial-sector

[7] R Sherstobitoff M L Itai Liba and O O T C James WalterldquoDissecting operation troy cyberespionage in South Koreardquo2013 httpswwwmcafeecomenterpriseen-usassetswhite-paperswp-dissecting-operation-troypdf

[8] N Horton andA DeSimone ldquoSonyrsquos nightmare before christmasthe 2014 North Korean cyber attack on Sony and lessons for USgovernment actions in cyberspacerdquo 2018 httpswwwjhuapleduContentdocumentsSonyNightmareBeforeChristmaspdf

[9] I K Lee and S R Ramsey 9e Korean Language StateUniversity of New York Albany NY USA 2000

[10] V Benjamin and H Chen ldquoSecuring cyberspace identifyingkey actors in hacker communitiesrdquo in Proceedings of the 2012IEEE International Conference on Intelligence and SecurityInformatics pp 24ndash29 Arlington VA USA June 2012

[11] Y Lu X Luo M Polgar et al ldquoSocial network analysis of acriminal hacker communityrdquo Journal of Computer In-formation Systems vol 51 no 2 pp 31ndash41 2010

[12] J-W Jang H Kang J Woo A Mohaisen and H K KimldquoAndro-autopsy anti-malware system based on similaritymatching of malware and malware creator-centric in-formationrdquo Digital Investigation vol 14 pp 17ndash35 2015

[13] J W Jang and H K Kim ldquoFunction-orientedmobile malwareanalysis as first aidrdquo Mobile Information Systems vol 2016Article ID 6707524 11 pages 2016

[14] Y Ki E Kim and H K Kim ldquoA novel approach to detectmalware based on api call sequence analysisrdquo InternationalJournal of Distributed Sensor Networks vol 11 no 6 ArticleID 659101 2015

[15] M L Han H C Han A R Kang et al ldquoWeb-hacking datasetfor the cyber criminal profilingrdquo 2016 httpocslabhksecuritynetDatasetsweb-hacking-profiling

[16] M L Han H C Han A R Kang B I Kwak A Mohaisenand H K Kim ldquoWAHP web-hacking profiling using case-based reasoningrdquo in Proceedings of the 2016 IEEE Conference

Security and Communication Networks 19

on Communications and Network Security (CNS) pp 344-345Philadelphia PA USA October 2016

[17] A Aamodt and E Plaza ldquoCase-based reasoning foundationalissues methodological variations and system approachesrdquo AICommunications vol 7 no 1 pp 39ndash59 1994

[18] D M L Martins and F B D Lima Neto ldquoHybrid intelligentdecision support using a semiotic case-based reasoning andself-organizing mapsrdquo IEEE Transactions on Systems Manand Cybernetics Systems no 99 pp 1ndash8 2017

[19] H K Kim K H Im and S C Park ldquoDSS for computersecurity incident response applying CBR and collaborativeresponserdquo Expert Systems with Applications vol 37 no 1pp 852ndash870 2010

[20] J-B Lamy B Sekar G Guezennec J Bouaud andB Seroussi ldquoExplainable artificial intelligence for breastcancer a visual case-based reasoning approachrdquo ArtificialIntelligence in Medicine vol 94 pp 42ndash53 2019

[21] M Relich and P Pawlewski ldquoA case-based reasoning ap-proach to cost estimation of new product developmentrdquoNeurocomputing vol 272 pp 40ndash45 2018

[22] E R Reyes S Negny G C Robles et al ldquoImprovement ofonline adaptation knowledge acquisition and reuse in case-based reasoning application to process engineering designrdquoEngineering Applications of Artificial Intelligence vol 41pp 1ndash16 2015

[23] H K Kim S-K Kim and S-H Kim ldquoDecision supportsystem for zero-day attack responserdquo Applied Mathematicsand Information Sciences vol 6 no 1 pp 221Sndash241S 2012

[24] G Horsman C Laing and P Vickers ldquoA case-based rea-soning method for locating evidence during digital forensicdevice triagerdquo Decision Support Systems vol 61 pp 69ndash782014

[25] G Horsman C Laing and P Vickers ldquoA case based reasoningsystem for automated forensic examinationsrdquo in Proceedings ofthe PGNET 2011 the 12th Annual Postgraduate Symposium onthe Convergence of Telecommunications Networking andBroadcasting pp 26ndash31 Liverpool UK June 2011

[26] Z Yin Y Gao and B Chen ldquoOn development of supple-mentary criminal analysis system based on cbr and ontologyrdquoin Proceedings of the 2010 International Conference onComputer Application and System Modeling (ICCASM 2010)vol 14 Taiyuan China October 2010

[27] A J Pinizzotto and N J Finkel ldquoCriminal personality pro-filing an outcome and process studyrdquo Law and HumanBehavior vol 14 no 3 pp 215ndash233 1990

[28] P Chen and J Kurland ldquoTime place and modus operandi asimple apriori algorithm experiment for crime pattern de-tectionrdquo in Proceedings of the 2018 9th International Con-ference on Information Intelligence Systems and Applications(IISA) pp 1ndash3 Zakynthos Greece July 2018

[29] C J R Collie and K Shalev Greene ldquoExamining modusoperandi in stranger child abduction a comparison ofattempted and completed casesrdquo Journal of InvestigativePsychology and Offender Profiling vol 16 no 2 pp 91ndash1092019

[30] V Benjamin B Zhang J F Nunamaker Jr and H ChenldquoExamining hacker participation length in cybercriminalinternet-relay-chat communitiesrdquo Journal of ManagementInformation Systems vol 33 no 2 pp 482ndash510 2016

[31] V Benjamin and H Chen ldquoTime-to-event modeling forpredicting hacker IRC community participant trajectoryrdquo inProceedings of the 2014 IEEE Joint Intelligence and SecurityInformatics Conference pp 25ndash32 +e Hague +e Nether-lands September 2014

[32] K Veena and K Meena ldquoIdentification of cyber criminal byanalysing the users profilerdquo International Journal of NetworkSecurity vol 20 no 4 pp 738ndash745 2018

[33] F Iqbal B C M Fung M Debbabi R Batool andA Marrington ldquoWordnet-based criminal networks miningfor cybercrime investigationrdquo IEEE Access vol 7pp 22740ndash22755 2019

[34] N Qazi and B L W Wong ldquoAn interactive human centereddata science approach towards crime pattern analysisrdquo In-formation Processing ampManagement vol 56 no 6 p 1020662019

[35] N Jain P Sharma R Anchan et al ldquoComputerized forensicapproach using data mining techniquesrdquo in Proceedings of theACM Symposium on Women in Research 2016 pp 55ndash60ACM New York NY USA 2016

[36] P M Cozens G Saville and D Hillier ldquoCrime preventionthrough environmental design (cpted) a review and modernbibliographyrdquo Property Management vol 23 no 5pp 328ndash356 2005

[37] H Hassani X Huang E S Silva andM Ghodsi ldquoA review ofdata mining applications in crimerdquo Statistical Analysis andData Mining 9e ASA Data Science Journal vol 9 no 3pp 139ndash154 2016

[38] A Sharma and S Sharma ldquoAn intelligent analysis of webcrime data using data miningrdquo International Journal of En-gineering and Innovative Technology (IJEIT) vol 2 no 32012

[39] S-T Li S-C Kuo and F-C Tsai ldquoAn intelligent decision-support model using FSOM and rule extraction for crimepreventionrdquo Expert Systems with Applications vol 37 no 10pp 7108ndash7119 2010

[40] Y-H Tseng Z-P Ho K-S Yang and C-C Chen ldquoMiningterm networks from text collections for crime investigationrdquoExpert Systems with Applications vol 39 no 11 pp 10082ndash10090 2012

[41] A Malathi and S S Baboo ldquoAn enhanced algorithm topredict a future crime using data miningrdquo InternationalJournal of Computer Applications vol 21 no 1 2011

[42] S Kapetanakis A Filippoupolitis G Loukas et al ldquoProfilingcyber attackers using case-based reasoningrdquo in Proceedings ofthe 19th UK Workshop on Case-Based Reasoning (UKCBR2014) Cambridge UK December 2014

[43] R Al-Zaidy B C Fung A M Youssef et al ldquoMining criminalnetworks from unstructured text documentsrdquo Digital In-vestigation vol 8 no 3-4 pp 147ndash160 2012

[44] M Zulfadhilah Y Prayudi and I Riadi ldquoCyber profilingusing log analysis and k-means clusteringrdquo InternationalJournal of Advanced Computer Science and Applicationsvol 7 no 7 pp 430ndash435 2016

[45] S V Nath ldquoCrime pattern detection using data miningrdquo inProceedings of the 2006 IEEEWICACM International Con-ference on Web Intelligence and Intelligent Agent TechnologyWorkshops pp 41ndash44 Hong Kong China December 2006

[46] ITPnet ldquoSyria Egypt crises spur escalation of me cyber at-tacksrdquo 2013 httpwwwitpnet594742-syria-egypt-crises-spur-escalation-of-me-cyber-attack

[47] A McEnery and R Xiao ldquoCharacter encoding in corpusconstructionrdquo in Developing Linguistic Corpora A Guide toGood Practice Oxbow Books Ltd Oxford UK 2005

[48] B Bos T Ccedilelik I Hickson et al ldquoCascading style sheets level2 revision 1 (CSS 21) specificationrdquo W3C Working Draft2005 httpwwww3orgTRCSS21

20 Security and Communication Networks

[49] W Stuckey ldquoMassive sony breach sheds light on murkyhacker universerdquo 2018 httpamericaaljazeeracomarticles20141224sony-hacker-universehtml

[50] S Gallagher ldquoSony pictures malware tied to SeoulldquoShamoonrdquo cyber-attacksrdquo 2018 httpsarstechnicacominformation-technology201412sony-pictures-malware-tied-to-seoul-shamoon-cyber-attacks

[51] J Pagliery ldquoSony hack signs point to North Koreardquo 2018httpsmoneycnncom20141205technologysecuritysony-hack-north-korea-employeeindexhtml

[52] K Ketler ldquoCase-based reasoning an introductionrdquo ExpertSystems with Applications vol 6 no 1 pp 3ndash8 1993

[53] M Rosvall and C T Bergstrom ldquoMapping change in largenetworksrdquo PLoS One vol 5 no 1 Article ID e8694 2010

[54] OASIS ldquoSTIXTAXII standardsrdquo 2017-2018 httpsoasis-opengithubiocti-documentation

Security and Communication Networks 21

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 10: CBR-Based Decision Support Methodology for Cybercrime

updating the weight value is an issue worth addressing infurther research In the present study we set the weight valuesfor the case vector including the encoding IP address domainattack date and OS (see Table 2)

Some case vectorsrsquo distance cannot be directly estimatedas they have mixed numerical and nominal data (such as IPaddress range and domain name) For this reason to cal-culate the distance between the nominal data we defined thediscrete similarity measure +e similarity of IP addresseswas calculated by measuring the similarity among the sameoctet of two given IP addresses +e IP address space iscomposed of a number combination of four octets separatedby ldquordquo In the present study we compared if octets from the1st octet to the 4th octet of RC and TC were identicalSubsequently a similarity value was assigned to the IPaddress vector We suggested the discrete similarity valuebetween two IP addresses as visible in Table 2 +e proposedapproach is advantageous in that it enables the distancecalculation between the IP addresses efficiently

(i) IP address of RC zzz yyy xxx www

(ii) IP address of TC zzz yyy xxx www

Meanwhile the similarity between domains is calculatedaccording to their domain properties +e domain iscomposed of the gTLD ccTLD and service name+e gTLDrefers to a generic top-level domain in the domain rule Forinstance com and co are used for commercial companies ororganizations org and or are used for nonprofit organi-zations go and gov are used for government and stateagencies Besides ccTLD refers to a country code top-leveldomain in the domain rule and means a unique sign thatrepresents a specific region such as kr cn br and uk DNSmakes change in the IP address into a unique Domain Namewhich is easy to remember because it consists of a combi-nation of an alphabet letter and a number Among theDomain Name the service name is built corresponding withthe characteristics of the groups organizations or corpo-rations that the gTLD is intending and pursuing +e servicename has diverse and different names depending on thecategories of the gTLD such as educational institutionscommercial enterprises military organizations nonprofitorganizations and government and state agencies Unlikeother case vectors we set the rule for estimating the simi-larity of the domain as depicted in Table 2

Furthermore we defined the attack date similarity Similarto the offline criminal investigation case if the time of a crimeoccurrence is near we can analyse the cases as a similar crimewith a cross-analysis of the target area and the criminalsrsquopatterns +e similarity value depends on the period differencebetween a new case and existing cases As visible in Table 2 thesimilarity value is described according to the date gap of twocases that occurred on different dates In summary accordingto the similarity degree of a variation range of a section thesimilarity values of the attack IP address domain and attackdate were set to the similarity value between 0 and 1

332 Clustering Processing Merely sorting the data andvisually analysing them render it difficult for an investigator to

infer the correlations and similarity among the potentialfeatures of incidents Hence an advanced tool that wouldcapture the complex underlying structures and data prop-erties is required Accordingly in the present study weconducted the clustering process using the EM algorithmbased on the probability of the individual data attributes +isalgorithm does not restrict the number of clusters in theparameters but automatically generates a number of validclusters by cross-validation +ereafter the algorithm de-termines the probability that some data items existed in thecluster bymaximizing the correlation and dependence amongthe objectsWe applied practically the EM algorithm to 80948data items having the information of encoding gTLD ccTLDand OS from 212093 data for clustering +e characterencoding was normalized by a group of congenial cover codeunits (ISO-8859 MS Windows character set GB and EUCseries) We excluded the Unicode because it is too generalwhich accounts for themajority of the collected encoding datafor clustering In the case of the service name even if we canfind out similar combinations of alphabet letters or numbersit is not easy to find commonality or relevance between them+erefore it is not suitable for being used as the similaritymeasure of the reasoning engine Consequently character-istics and metadata concerning the 12 clusters were obtained(see Table 3) +ese clustering results are also visualized andstored in the database (see Figure 6)

+e donut charts include the different features fromoutside to inside (in order) with the corresponding share ofeach feature value separated by a different colour codewithin this same circle Each cluster consists of four circlesand the circle represents from the outside to the inside theencoding gTLD ccTLD and OS +e percentage in Table 3represents howmany cases one cluster contains among all ofwebsite defacement cases collected from the zone-horg site+e representative hacker represents a notable hacker orhacking group among the members of them in each clusterAs described in Figure 6 clusters of similar patterns werefound in the clusters +e most conspicuously similarclusters were 4 and 7 which had the feature of using Arabicand Chinese a feature of the attack against an industrialorganization whose headquarters are located in WesternEurope +e cases in Clusters 4 and 7 accounted for 4129percent among all of website defacement cases collectedfrom the zone-horg site+e results of the clustering processcontribute to the concretization of the similarity between thenew and existing cases A large number of new cases haveflowed in the database and then if the clustering process isperformed with the dataset a clustering result may take on adifferent pattern of course

4 Application

41 Experimental Results and Analysis Considering that theassumption that the attackers tend to use similar or uniqueattack methods is not always valid and it is difficult toevaluate the accuracy of the similarity mechanism As timeprogresses attackersrsquo hacking skills advance and in additionthe attack plan campaign purpose and target groups canchange depending on the situation +erefore in the present

10 Security and Communication Networks

Table 2 Value and the weight for the similarity score by the case vector All of the values of the similarity score are normalized to 0 or 1

Case vector Weight Impact +e similarity measure between a new case andexisting cases Value

Encoding 05 High mdash 0 or 1

IP address 02 Medium

If the same (eg 14324816 and 14324816) 1If the 1st 2nd and 3rd octet are matched (eg

14324816 and 14324818) 075

If the 1st and 2nd octet are matched (eg 14324816and 14324844) 05

Only the 1st octet is matched (eg 14324816 and1431324) 025

No common octet (eg 14324816 and 1631325) 0

Domain 015 Medium

An identical domain 1Service name is matched and one of the gTLD and

ccTLD is matched 08

gTLD and ccTLD is matched 03Service name is matched 01

ccTLD is matched 01gTLD is matched 01

Nonidentical domain 0

Date 01 Low

Period of about 6 months back and forth (1 year) 1Period of about 18 months back and forth (3 years) 075Period of about 30 months back and forth (5 years) 05Period of about 42 months back and forth (7 years) 025Over period of about 42 months (over 7 years) 0

OS 005 Low mdash 0 or 1

Table 3 Characteristics and metadata of several different clusters derived from the clustering processing

Cluster number Ratio () Description Representative hacker (group)

0 784+e group uses Central European languages +eyprincipally attacked against the profit organization

and Linux-based OS in Western EuropeJaMaYcKa Super2li

1 816

+e group uses Arabic and Cyrillic +ey principallyattacked against the organization that manages thenetwork and Linux-based and Unix-based OS +eirattack region is distributed throughout SouthernEurope South America Eastern Europe and

Southeast Asia

BI0S

2 1036

+e group uses Central European languages +eyprincipally attacked against the organization that

manages the network and nonprofit organizations inWestern Europe

JaMaYcKa

3 933+e group uses Central European languages +eyprincipally attacked against the profit organization

and Windows-based OS in Western Europe1923Turk

4 2536+e group uses Arabic and Chinese +ey principally

attacked against the profit organization andWindows-based OS in Western Europe

EL_MuHaMMeD federal-atackorg

5 173

+e group uses Central European languages +eyprincipally attacked against the profit organizationand Unix-based OS in Southern Europe and Eastern

Europe

d3bsimX SuSKuN

6 524

+e group uses Central European languages +eyprincipally attacked against the profit organizationthe educational institution the government and stateagencies and also Windows-based OS in East Asia

1923Turk

Security and Communication Networks 11

study rather than evaluating the accuracy of the similaritymechanism we tested the overall performance of the pro-posed methodology with the ratio of correctly identified

hackers +e developed testing procedures unfolded in thefollowing four steps and are depicted in detail in Figure 7where ldquoKrdquo presents all hackers within the database

Table 3 Continued

Cluster number Ratio () Description Representative hacker (group)

7 1593+e group uses Arabic Chinese and Turkish +eyprincipally attacked against the profit organization

and Linux-based OS in Western EuropeRya iskorpitx

8 911+e group uses Central European languages +eyprincipally attacked against the profit organization

and Windows-based OS in Western Europe1923Turk

9 363

+e group uses Central European languages +eyprincipally attacked against the profit organizationand Linux-based OS in South America and Eastern

Europe

Hmei7

10 139

+e group uses Central European languages +eyprincipally attacked against Windows-based OS inSouth America and Southeast Asia+eir attack target

is mostly the educational institution and thegovernment and state agencies

BHS F4keLive

11 192

+e group uses Arabic and Central Europeanlanguages+ey principally attacked against the profitorganization and Windows-based OS in Southern

Europe

EL_MuHaMMeD linuXploit_cre

Clustering 00

25

25

50

75

0100

25

50

75

0100

25

50

75

0100

25

50

75

0100

25

50

75

0100

25

50

75

0100

25

50

75

0100

25

50

75

0100

25

50

75

0100

25

50

75

0100

25

50

75

0100

50

75

0100

Clustering 01 Clustering 02 Clustering 03

Clustering 04 Clustering 05 Clustering 06 Clustering 07

Clustering 08 Clustering 09 Clustering 10 Clustering 11

West EuropeTurkishCentral EuropeArabicCyrillicChinese

com

net

org

gov

edu

mil

Western EuropeEast AsiaSouthern EuropeSouth AmericaEastern EuropeSoutheast Asia

WindowLinuxUnixMacOS

Encoding

gTLD

ccTLD

OS

Figure 6 Visualization of the 12 different clusters (00 through 11) in our data annotated with various features encoding gTLD ccTLD andOS and their corresponding share (legend on the right side)

12 Security and Communication Networks

Rk Count Casesmk( )Count Casesallk( )

(3)

where ldquomrdquo means the past cases which are within the denedscope concerning a randomly selected hacker ldquokrdquo

(i) Step 1 selection the measurement objects ie 100hackers were randomly selected from the database

(ii) Step 2 case labelling we retrieved all previous attackcases conducted by the randomly selected 100hackers in Step 1 and then subsequently labelled allprevious attack cases by each hacker name

(iii) Step 3 case extraction we selected the most recentcase among the cases labelled in Step 2 as an inputvalue shye similarity score was then estimated bycomparing themost recent case (ie RCmdashone of theretrieved cases) with all other cases in the database(ie TCsmdashall cases in the cases-centric DB)

(iv) Step 4 scoring similarity score was sorteddepending on the value and the weight for thesimilarity score by the case vector (see Table 2) inthe descending orderWhenever the similarity valuewas 0 it was not displayed on the scoring list of Step4 shye feasibility of the proposed methodology wasevaluated based on how many past cases of a hackerthere were in the N scope at the scoring list of Step 4that is regarding the ratio of the attack cases by eachhacker we checked whether the cases were includedat the top N scope (N scope from the top 1 percentto the top 30 percent)

NScope Count CasesScopeK( )Count CasesallK( )

times 100 (2)

First we randomly picked 100 hackers from the col-lected dataset (ie cases-centric DB) thereafter we re-trieved and extracted all past attack cases for each hackershye extracted past cases were labelled with the hackerrsquosname Figure 8 depicts the number of website defacementattack cases in the past for each hacker In Steps 3 and 4similarity between a retrieved case (ie the most recentcase) and all other stored website defacement cases weremeasured

Specically we checked whether the result (ie thesorted hackerrsquos past cases with a high similarity score)stemming from the similarity measurement was included atthe top N scope shyis process was meant to check based onthe similarity score how many past attack cases of randomlypicked 100 hackers were included in the dened topN scopeTo this end we divided the top N scope into eight criterionfactors from the top 1 percent to the top 30 percent and theratio R all the past attack cases for each hacker into sixcriterion factors from 50 percent to 100 percent (ie at 10percent intervals) As illustrated in equations (2) and (3) theN scope and the ratio R were categorized as ratios accordingto the dened measure rule More specically the criterionof the top N scope ie ldquotop N percentrdquo was based on theresult derived from the similarity measurement Attack caseswere sorted in order of high similarity score and thereforethe cases were within the range of topN scope (see Figure 9)Also in the case of the hacking case ratio of a randomly

Step 4 scoring

bullbullbull

Randomly selected100 hackers

from the database

Step 1 selection

bullbullbull

1 eBuGz2 Hmei73 3xp1r3

98 S3cure99 drm1st3r

100 Lulz53c

bullbullbull

Step 2 case labelling1 TheBuGz

100 Lulz53c

Step 3 case extraction

A retrieved case(the most recent case)

bullbullbull

1 TheBuGz

100 Lulz53c

bullbullbull

Cases-centricDB

Hackername Date Encoding IP address Domain OS Score

Hackername Date Encoding IP address Domain OS Score

bullbullbull

1 eBuGz2 Hmei73 3xp1r3

98 S3cure99 drm1st3r

100 Lulz53c

Case1 bullbullbull Casem

Case1 bullbullbull Casemprime

Case1 bullbullbull Casem

Case1 bullbullbull Casemprimei=1

cv[Distance (RCcv TCcv) times Weightcv]

Casemprime

Casemprime

Casem

Casem

Figure 7 shye developed testing procedures from step 1 to step 4

Security and Communication Networks 13

selected hacker some parts of the past attack cases (ie ratioR) concerning a hacker were within the defined N scope (seeFigure 9)

Figure 10 shows the number of an identified hacker fora retrieved case (ie the most recent case) among allhacking cases of each hacker +e X-axis in Figure 10 showsthe criterion of the topN scope including the eight criterionfactors () and of the ratio R including the six criterionfactors () +e Y-axis presents the number of an identifiedhacker in the top N scope among the randomly selected 100hackers in Step 1 As can be seen in Figure 10 the higher theratio R and the narrower the N scope the lower the numberof an identified hacker in the top N scope among therandomly selected 100 hackers On the other hand thelower the ratio R and the wider the N scope the higher thenumber of identified hackers in the top N scope among therandomly selected 100 hackers Consequently even ifhacking cases were caused by the same hacker as the hackeror hacking group which only attacked the same or similarobjects were rare it is impossible to draw results with a highsimilarity score for all cases of a hacker Nevertheless theresults demonstrated that the proposed CBR-based de-cision support methodology can successfully reduce thenumber of hackers and their cases and suggest potential topN percent candidates among hundreds of thousands ofcases

+erefore an investigator should consider the avail-ability and flexibility of data with respect to the data selectioncriteria for the similarity measurement As mentionedabove when a new attack occurs they can limit the searchrange of the data and determine the direction of the criminalinvestigation With such the reduction in the number ofcandidate-related cases the outcomes of our similaritymechanism are highly valuable in terms of reducing theinvestigation time to determine the potential suspect of agiven hacking incident

42 Case Study As mentioned above the accuracy of theCBR depends on the quality of the collected data and theoverall accuracy is difficult to evaluate Nevertheless al-though the data are insufficient to evaluate the proposedmethodology the DS and SPE cases include the ground-truth data with specific information related to the hacker orhacking groups Based on the public ground-truth data ofthe DS and SPE cases we found the most similar top threehackers or hacking groups to them and noticed theircharacteristic by the proposed similarity measure and theclustering processing

+e hackers of the DS cyberattack defaced the groupwarehomepage of LG U+ the 3rd largest telecommunicationcompany in South Korea and the English version of the

Step 4 scoring

1 TheBuGzTop N scope(1~30)

Ratio R (50~100)

Hackername Date Encoding IP address Domain OS Score

Case1 bullbullbull Casem

Casem

Figure 9 Scoring step on the top N scope and the ratio R

0

1000

2000

3000

4000

5000

0 25 50 75 100Hacker

Num

ber o

f cas

es

Figure 8 +e number of website defacement attack cases in the past of each hacker

14 Security and Communication Networks

Korean Broadcasting System (KBS) homepage +ey leftunique images and many messages on the defaced websites+e three Calaveras image (ie skull image) used in the LGU+rsquos defaced website appeared on many European websites+e character encoding set of the message was the WesternEuropean language system Based on these insights we couldinfer that the hackersrsquo background is European ldquoHASTATIrdquowas the word written on the KBS homepage meaning theforefront line of the Roman troops hinting that the DScyberattack could be a starting point rather than a transientattack it was a persistent one Even if we excluded otherimages and messages as well as other features from thesimilarity processes due to the unanticipated loss or absenceof data one could establish the similarity and intent of theattackers with reasonable confidence However given thesufficiently large hacker profiling source such abundant datacould support and enhance the accuracy of inference Fig-ure 11 shows the screenshots of the defaced websites at thattime

In the SPE case similarly to the DS case some imagesand messages were left on the computers of SPE Regardingcolour skulls image and misspellings the imagesFigure 11(c) used in the SPE cases took on the characteristicssimilar to those of the images Figure 11(b) used in the DScases As shown in Figure 11 the colour schemes in greenand red and the visual similarities seen in skull image areother crucial elements for crime tracing In both the DS andSPE cases the phrase such as ldquothis is the beginningrdquo andldquoyour datardquo were commonly found in the messages How-ever given the intentional hacking nature of forging orhiding their identity motivation and location some experts

say that these characteristics are not the conclusive proofthat Sony has been attacked by the same hacker [49ndash51]

For the evaluation of the results of the case study we firstmeasured the similarity between the new website de-facement cases (ie the DS and SPE cases) and the collectedexisting cases in the database +is approach coheres withthe CBR process used in cybercrime investigation (seeFigure 2) Two new website defacement cases the DS and theSPE were applied as RC and the similarity score for each ofthese two cases was computed using the similarity measure(see equation (1)) proposed in Section 331 Provided thatbecause the DS and SPE cases do the function of the targetcases as an input value we considered a direct comparisonbetween the DS and SPE cases for the similarity score wasnot appropriate [52]

+e similarity measure mentioned in the previousparagraph is based on the metadata released by an analysisreport of the DS and SPE real cases We summarized furtherthe characteristics and metadata associated with them inTable 4 +e similarity score was derived through com-parison between the presented metadata of the DS and SPEcases and all cases in the cases-centric DB We gave the mostsimilar top three cases among the result of the similarityscore (see the right side in Table) Notifier Hmei7 and d3b_Xare among the cases that belonged to Clusters 0 and 8 whichwere the two clusters that exhibited identical characteristicsIt can thus be understood that they used the encoding systempertinent to Central European languages based on the Latinlanguage system and typically launched attacks against aprofit organization located in Western Europe Notifieroaddah MTRiX and EL_MuHaMMeD were all classified

Top 1 Top 3 Top 5 Top 10 Top 15 Top 20 Top 25 Top 30Criterion of the top N ()

Num

ber o

f ide

ntifi

ed h

acke

rs

0

20

40

60

80

100

Ratio of the attack cases ()506070

8090100

Figure 10 +e number of identified hackers in the top N scope among the randomly selected 100 hackers

Security and Communication Networks 15

as the same cluster (Cluster 7) where the hackers of Cluster 7used the encoding system pertinent to Arabic and Chineselanguages and typically attacked against the profit organi-zation located in Western Europe

Next to ensure the objectivity of the similarity scorebased on the case study by the DS and SPE we computed thesimilarity score of any randomly selected pair from thewhole case Figure 12(a) shows the distribution of thesimilarity score of the randomly selected cases We took thedistribution of the similarity score using the central limittheorem which describes the average distribution of ran-dom samples extracted from a finite population +e dis-tribution shows that the calculation of the similarity score ofthe randomly selected two website defacement cases wasrepeatedly performed for 10000 times +e similarity scoresof any randomly selected pair of cases were typically dis-tributed around 03 +is result (Figure 12(a)) substantiatesthat the similarity scores are not low even if the similarityscores of the DS and SPE cases (Figure 12(b)) do not appear

numerically high Figure 12(b) shows the similarity scores ofthe DS and SPE cases+e top score of the similarity was 069in the DS case and all measured cases concentrated aroundthe similarity score (X-axis) of 00 to 015 and of 05 to 06 Inthe SPE case the top score of the similarity was 0615 and allmeasured cases concentrated around the similarity score (X-axis) of 00 to 02

Figure 13 shows the distribution of the similarity scorefor randomly selected 100 hackers mentioned in Section 41To know the mean value of the similarity score for eachhacker case we calculated the similarity score from thehackerrsquos own past cases Cases used for the similarity scoremeans not all cases in the cases-centric DB but just the pastcases conducted by the hacker in the cases-centric DB +emean value of the similarity scores in the hackers is 05233+e similarity scores of the tested cases in Table 4 is abovethe mean value +us the similarity scores for each hackeradequately underpin the similarity scores from the TCs inDS and SPE

(a) (b) (c)

Figure 11 A snippet of website defacement cases by a comparison of examples of the DS and SPE the defaced LGU+ groupware homepage(a) and KBS homepage (b) in the DS case and the defaced website in SPE case (c)

Table 4 Further characteristics and metadata associated with the DS and SPE cases

Retrieved case Tested cases

Case name NotifierDarkSeoul (DS) Hmei7 d3b_X StifLer

Encoding Windows-1252 Windows-1252 Windows-1252 ISO-8859-9IP address 203248195178 2038623868 2031243766 77921083Domain gyunggionnet21com httpwwwgarychengcom healthajkgovpk yapikimyasallaricomtrDate 20 Mar 2013 6 Feb 2014 4 Feb 2014 8 Jun 2013OS Windows Windows Windows WindowsSimilarity mdash 0690 0675 0665Cluster mdash 0 8 4

Retrieved case Tested casesCase name Notifier

Sony pictures Entertainment (SPE) Oaddah MTRiX EL_MuHaMMeDEncoding EUC-KR EUC-CN GB2312 GB2312 GB2312IP address 203131222102 2031241555 20829198 2081164534Domain httpwwwsonypicturesstockfootagecom httpwwwhzkcggcom daxdigitalromcom digitalairstripnetDate 24 Nov 2014 14 Jun 2012 16 Dec 2002 18 June 2009OS Windows Windows Windows WindowsSimilarity mdash 0615 0615 0600Cluster mdash 7 7 7+e metadata are arranged according to the defined case vector corresponding with the DS and SPE cases on the left side (shown in part in boldface type)

16 Security and Communication Networks

43 Follow-Up Investigation A case study is a researchmethod involving an in-depth and detailed investigationof a subject of study as well as its related contextualmethodology Hence we conducted follow-up in-vestigations of the most similar top three hackers asmentioned above in Table 4 According to the resultsspecifically over 93 percent of the hackerrsquos attacks weresimilar to the DS case that occurred in 2013 and 2014+eir major targets were com domain sites and theytargeted primarily Germany Italy New Zealand RussiaTurkey Taiwan and South Korea (see Table 5) Twohackers (ie Hmei7 and d3b_X) primarily attackedgovernment agencies Interestingly 20 percent of theattacks by the hackers named d3b_X targeted SouthKorea In the SPE incident the similar hackerrsquos attacksoccurred throughout the period from 2002 to 2014 +ehackers named MTRiX and EL_MuHaMMeD in-tensively executed such attacks in 2003 and 2009 +eirmajor targets were com (or co) and org domain sitesand they targeted primarily Brazil Canada DenmarkFrance Greece Hong Kong and Italy (see Table 5) Twohackers (ie MTRiX and EL_MuHaMMeD) primarilyattacked commercial agencies and additionally attackedthe public and network agencies As shown in Figure 14 to

describe the follow-up investigation more discernibly andto focus on the attack flow we used an alluvial diagramwhich is a type of Sankey diagram developed to representchanges in a network structure over time [53] It shows theinvestigation of the top three hackers with website de-facement cases most similar to the DS case and SPE case+e case vectors were based on the attack year ccTLD andgTLD+e thickness of the attack flow in this figure meansthe degree of attack +is network visualization methodcould support an investigator to understand the flow andcore of the crime clearly by listing the multidimensionalevidence that is complicatedly entangled or hidden suchthat it does not look presentable

5 Limitations and Discussion

+e CBR algorithm has the disadvantage that the perfor-mance evaluation may be degraded if the property de-scribing the case is inappropriate +erefore in order toobtain more accurate results cross-data analysis with othervarious data sources should be considered For examplecybercrime statistics data from law enforcement agenciesthreat intelligence data from malware analysis groups andvulnerability databases could be useful resources to

100806040200Similarity score

600

400

200

0

Freq

uenc

y

Mean = 02930 Var = 00866

(a)

100806040200

Freq

uenc

y e highest similarity score 0615on Sony Pictures Entertainment case

Similarity score100806040200

Similarity score

40000

30000

20000

10000

0

Freq

uenc

y

40000

30000

20000

10000

0

Mean = 0114 Var = 01500

e highest similarity score 069on DarkSeoul case

Mean = 0063Var = 00370

A B

(b)

Figure 12 (a) Probability distribution of the similarity score for any pair of randomly selected cases (b) distribution of the similarity valuebetween the collected website defacement cases with the DS case (A) and the distribution of the similarity value between the collectedwebsite defacement cases with the SPE case (B) +e similarity was calculated between each studied case and all other cases in our system

0

2

4

6

000 025 050 075

Freq

uenc

y

Mean value of the similarity score000 025 050 075

Figure 13 Distribution of the similarity score for randomly selected 100 hackers

Security and Communication Networks 17

improve the accuracy and usability of our proposedmethodology However at the time of writing the presentpaper we did not have access to open and public dataconcerning cybercrime

For that reason we tried to demonstrate the practica-bility of the proposed methodology as a proof of concept+erefore we focused on the dataset of the zone-horg thatincludes a large number of website defacement cases Al-though the zone-horg provides an extensive dataset on thepast incident events not all incidents can be included in ourstudy +erefore if a hacker penetrated some target orga-nizations by APT attacks and performed stealthy activitiessuch hacking activities would not be reported in the datasetof the zone-horg and the proposed methodology would notbe able to detect similar cases with reasonable confidence

6 Conclusion and Future Work

In this study the similarity of website defacement caseswas assessed through the similarity measure and theclustering processing using the CBR as a methodology+e collected raw data of the defaced web sitesrsquo resourceswas sanitized via data parsing and data cleaning processAlso based on the large size of real dataset data-drivenanalysis for the hacker profiling is achieved To this endthe case vector was designed and the significant featureswere chosen for applying to the case-based reasoning Fora successful cybercrime investigation hacker profiling viaclustering analysis is the most basic and importantprocess in order to find out the relevant incident casesand significant data on some prime incidents data-driven

Table 5 Follow-up investigation on the top three hackers with website defacement cases most similar to the DS case and SPE case +e casevector value means the hackerrsquos attack rate

DomainDS case SPE case

Hmei7 d3b_X StifLer Oaddah MTRiX EL_MuHaMMeDCom 7832 8581 10000 10000 8627 8298Edu 162 096 mdash mdash 176 191Net 340 320 mdash mdash 546 574Gov 1216 651 mdash mdash 106 mdashYear Hmei7 d3b_X StifLer Oaddah MTRiX EL_MuHaMMeD2002 mdash mdash mdash mdash 1074 mdash2003 mdash mdash mdash mdash 8908 mdash2006 mdash mdash mdash mdash mdash mdash2007 009 mdash mdash mdash 018 mdash2008 mdash mdash mdash mdash mdash mdash2009 315 mdash mdash mdash mdash 99572010 009 mdash mdash mdash mdash mdash2011 034 mdash mdash mdash mdash mdash2012 340 mdash mdash 10000 mdash mdash2013 3486 3917 10000 mdash mdash mdash2014 5808 5977 mdash mdash mdash 0432015 mdash 107 mdash mdash mdash mdash

d3b~x

Hmei7

StifLer

2009

2012

2013

2014

AustraliaBrazilFrance

Germany

IndonesiaItaly

KoreaNetherlandsNewZealand

PolandRussia

Thailand

Turkey

Unknown

com

gov

net

org

Unknown

No

Yes

Hacker Year ccTLD gTLD Attack

(a)

EL_M

uHaM

MeD

MTRiX

oaddah

2002

2003

2009

2012

BrazilCanada

DenmarkFranceGreece

HongKongItaly

Unknown

com

net

org

Unknown

No

Yes

Hacker Year ccTLD gTLD Attack

(b)

Figure 14 Follow-up investigation on the top three hackers with website defacement cases that are most similar to the DS case (a) and SPEcase (b)

18 Security and Communication Networks

and evidence-driven decision making should be thecritical process Also reducing the amount of data andtime to be analysed are important factors to deliver thehigh value of intelligence data

Although the obtained results appear to be sound andmeaningful it is difficult to evaluate the accuracy of theresults unless the attacker is captured Naturally theground-truth data with specific information about theinvolved hacking groups for verification are rare (ie noadversary claimed that the two attacks were the result oftheir actions) However it is noteworthy that our meth-odology provides a meaningful insight into the confidentialand undercover network of cybercrime as well especiallywhen there is a lack of information Also the proposedmethodology contributes to facilitate the analysis and re-ducing the time required for searching for possible suspectsof cybercrime We believe that the proposed system ismeaningful for further exploration and correlation ofvarious website defacement cases

As mentioned in Discussion and Limitations a cross-data analysis with other various data sources should bereviewed Said differently the use of additional online oroffline information acquired by human intelligence(HUMINT) or different types of signal intelligence(SIGINT) and sources may also help to reason compo-sition requirements of crime and reduce the category ofinvestigation Furthermore the proposed methodologycan be expanded into incident information for compat-ibility and information exchangeability with othercyberthreat intelligence system as the Structured +reatInformation eXpression (STIX) and Trusted AutomatedeXchange of Indicator Information (TAXII) which arekey strategic elements of the information-sharingsystem [54]

+ere are features such as the particular messages (iethanks-to notifier nationality religion and anniversary)or image and mp3 file in the web resources which aregathered from the zone-horg site Although these featuresare limited to only a small number of hackers of the webresources in future research we will try to study a close-knit network among them such as the hub hacking groupkey player and followers Furthermore we also plan tomore definitely classify and systemize the hackersrsquo intentsusing text mining and mood detection techniques +efindings of this prospective study will contribute mean-ingful insights to trace hackersrsquo behavioural patterns and toestimate their primary purpose and intent

Data Availability

+e web-hacking dataset applied to our paper can bedownloaded from the linked site below httpocslabhksecuritynetDatasetsweb-hacking-profiling

Conflicts of Interest

+e authors declare that there are no conflicts of interestregarding the publication of this paper

Acknowledgments

+is work was supported under the framework of internationalcooperation program managed by the National ResearchFoundation of Korea (No 2017K1A3A1A17092614)

References

[1] S S Response ldquoSwift attackersrsquo malware linked to more fi-nancial attacksrdquo 2016 httpswwwsymanteccomconnectblogsswift-attackers-malware-linked-more-financial-attacks

[2] S S Response ldquoWannacry ransomware attacks show strong linksto lazarus grouprdquo 2017 httpswwwsymanteccomconnectblogswannacry-ransomware-attacks-show-strong-links-lazarus-group

[3] K lab ldquoLazarus under the hoodrdquo 2018 httpsmediakasperskycontenthubcomwp-contentuploadssites4320180307180244Lazarus_Under_+e_Hood_PDF_finalpdf

[4] Operation Blockbuster ldquoDestructive malware reportrdquo 2016httpswwwoperationblockbustercomwp-contentuploads201602Operation-Blockbuster-Destructive-Malware-Reportpdf

[5] D Martin and SANS Institute InfoSec Reading Room ldquoTracingthe lineage of DarkSeoulrdquo 2016 httpswwwsansorgreading-roomwhitepaperscriticaltracing-lineage-darkseoul-36787

[6] D S C T U T Intelligence ldquoWiper malware threatanalysisrdquo 2013 httpswwwsecureworkscomresearchwiper-malware-analysis-attacking-korean-financial-sector

[7] R Sherstobitoff M L Itai Liba and O O T C James WalterldquoDissecting operation troy cyberespionage in South Koreardquo2013 httpswwwmcafeecomenterpriseen-usassetswhite-paperswp-dissecting-operation-troypdf

[8] N Horton andA DeSimone ldquoSonyrsquos nightmare before christmasthe 2014 North Korean cyber attack on Sony and lessons for USgovernment actions in cyberspacerdquo 2018 httpswwwjhuapleduContentdocumentsSonyNightmareBeforeChristmaspdf

[9] I K Lee and S R Ramsey 9e Korean Language StateUniversity of New York Albany NY USA 2000

[10] V Benjamin and H Chen ldquoSecuring cyberspace identifyingkey actors in hacker communitiesrdquo in Proceedings of the 2012IEEE International Conference on Intelligence and SecurityInformatics pp 24ndash29 Arlington VA USA June 2012

[11] Y Lu X Luo M Polgar et al ldquoSocial network analysis of acriminal hacker communityrdquo Journal of Computer In-formation Systems vol 51 no 2 pp 31ndash41 2010

[12] J-W Jang H Kang J Woo A Mohaisen and H K KimldquoAndro-autopsy anti-malware system based on similaritymatching of malware and malware creator-centric in-formationrdquo Digital Investigation vol 14 pp 17ndash35 2015

[13] J W Jang and H K Kim ldquoFunction-orientedmobile malwareanalysis as first aidrdquo Mobile Information Systems vol 2016Article ID 6707524 11 pages 2016

[14] Y Ki E Kim and H K Kim ldquoA novel approach to detectmalware based on api call sequence analysisrdquo InternationalJournal of Distributed Sensor Networks vol 11 no 6 ArticleID 659101 2015

[15] M L Han H C Han A R Kang et al ldquoWeb-hacking datasetfor the cyber criminal profilingrdquo 2016 httpocslabhksecuritynetDatasetsweb-hacking-profiling

[16] M L Han H C Han A R Kang B I Kwak A Mohaisenand H K Kim ldquoWAHP web-hacking profiling using case-based reasoningrdquo in Proceedings of the 2016 IEEE Conference

Security and Communication Networks 19

on Communications and Network Security (CNS) pp 344-345Philadelphia PA USA October 2016

[17] A Aamodt and E Plaza ldquoCase-based reasoning foundationalissues methodological variations and system approachesrdquo AICommunications vol 7 no 1 pp 39ndash59 1994

[18] D M L Martins and F B D Lima Neto ldquoHybrid intelligentdecision support using a semiotic case-based reasoning andself-organizing mapsrdquo IEEE Transactions on Systems Manand Cybernetics Systems no 99 pp 1ndash8 2017

[19] H K Kim K H Im and S C Park ldquoDSS for computersecurity incident response applying CBR and collaborativeresponserdquo Expert Systems with Applications vol 37 no 1pp 852ndash870 2010

[20] J-B Lamy B Sekar G Guezennec J Bouaud andB Seroussi ldquoExplainable artificial intelligence for breastcancer a visual case-based reasoning approachrdquo ArtificialIntelligence in Medicine vol 94 pp 42ndash53 2019

[21] M Relich and P Pawlewski ldquoA case-based reasoning ap-proach to cost estimation of new product developmentrdquoNeurocomputing vol 272 pp 40ndash45 2018

[22] E R Reyes S Negny G C Robles et al ldquoImprovement ofonline adaptation knowledge acquisition and reuse in case-based reasoning application to process engineering designrdquoEngineering Applications of Artificial Intelligence vol 41pp 1ndash16 2015

[23] H K Kim S-K Kim and S-H Kim ldquoDecision supportsystem for zero-day attack responserdquo Applied Mathematicsand Information Sciences vol 6 no 1 pp 221Sndash241S 2012

[24] G Horsman C Laing and P Vickers ldquoA case-based rea-soning method for locating evidence during digital forensicdevice triagerdquo Decision Support Systems vol 61 pp 69ndash782014

[25] G Horsman C Laing and P Vickers ldquoA case based reasoningsystem for automated forensic examinationsrdquo in Proceedings ofthe PGNET 2011 the 12th Annual Postgraduate Symposium onthe Convergence of Telecommunications Networking andBroadcasting pp 26ndash31 Liverpool UK June 2011

[26] Z Yin Y Gao and B Chen ldquoOn development of supple-mentary criminal analysis system based on cbr and ontologyrdquoin Proceedings of the 2010 International Conference onComputer Application and System Modeling (ICCASM 2010)vol 14 Taiyuan China October 2010

[27] A J Pinizzotto and N J Finkel ldquoCriminal personality pro-filing an outcome and process studyrdquo Law and HumanBehavior vol 14 no 3 pp 215ndash233 1990

[28] P Chen and J Kurland ldquoTime place and modus operandi asimple apriori algorithm experiment for crime pattern de-tectionrdquo in Proceedings of the 2018 9th International Con-ference on Information Intelligence Systems and Applications(IISA) pp 1ndash3 Zakynthos Greece July 2018

[29] C J R Collie and K Shalev Greene ldquoExamining modusoperandi in stranger child abduction a comparison ofattempted and completed casesrdquo Journal of InvestigativePsychology and Offender Profiling vol 16 no 2 pp 91ndash1092019

[30] V Benjamin B Zhang J F Nunamaker Jr and H ChenldquoExamining hacker participation length in cybercriminalinternet-relay-chat communitiesrdquo Journal of ManagementInformation Systems vol 33 no 2 pp 482ndash510 2016

[31] V Benjamin and H Chen ldquoTime-to-event modeling forpredicting hacker IRC community participant trajectoryrdquo inProceedings of the 2014 IEEE Joint Intelligence and SecurityInformatics Conference pp 25ndash32 +e Hague +e Nether-lands September 2014

[32] K Veena and K Meena ldquoIdentification of cyber criminal byanalysing the users profilerdquo International Journal of NetworkSecurity vol 20 no 4 pp 738ndash745 2018

[33] F Iqbal B C M Fung M Debbabi R Batool andA Marrington ldquoWordnet-based criminal networks miningfor cybercrime investigationrdquo IEEE Access vol 7pp 22740ndash22755 2019

[34] N Qazi and B L W Wong ldquoAn interactive human centereddata science approach towards crime pattern analysisrdquo In-formation Processing ampManagement vol 56 no 6 p 1020662019

[35] N Jain P Sharma R Anchan et al ldquoComputerized forensicapproach using data mining techniquesrdquo in Proceedings of theACM Symposium on Women in Research 2016 pp 55ndash60ACM New York NY USA 2016

[36] P M Cozens G Saville and D Hillier ldquoCrime preventionthrough environmental design (cpted) a review and modernbibliographyrdquo Property Management vol 23 no 5pp 328ndash356 2005

[37] H Hassani X Huang E S Silva andM Ghodsi ldquoA review ofdata mining applications in crimerdquo Statistical Analysis andData Mining 9e ASA Data Science Journal vol 9 no 3pp 139ndash154 2016

[38] A Sharma and S Sharma ldquoAn intelligent analysis of webcrime data using data miningrdquo International Journal of En-gineering and Innovative Technology (IJEIT) vol 2 no 32012

[39] S-T Li S-C Kuo and F-C Tsai ldquoAn intelligent decision-support model using FSOM and rule extraction for crimepreventionrdquo Expert Systems with Applications vol 37 no 10pp 7108ndash7119 2010

[40] Y-H Tseng Z-P Ho K-S Yang and C-C Chen ldquoMiningterm networks from text collections for crime investigationrdquoExpert Systems with Applications vol 39 no 11 pp 10082ndash10090 2012

[41] A Malathi and S S Baboo ldquoAn enhanced algorithm topredict a future crime using data miningrdquo InternationalJournal of Computer Applications vol 21 no 1 2011

[42] S Kapetanakis A Filippoupolitis G Loukas et al ldquoProfilingcyber attackers using case-based reasoningrdquo in Proceedings ofthe 19th UK Workshop on Case-Based Reasoning (UKCBR2014) Cambridge UK December 2014

[43] R Al-Zaidy B C Fung A M Youssef et al ldquoMining criminalnetworks from unstructured text documentsrdquo Digital In-vestigation vol 8 no 3-4 pp 147ndash160 2012

[44] M Zulfadhilah Y Prayudi and I Riadi ldquoCyber profilingusing log analysis and k-means clusteringrdquo InternationalJournal of Advanced Computer Science and Applicationsvol 7 no 7 pp 430ndash435 2016

[45] S V Nath ldquoCrime pattern detection using data miningrdquo inProceedings of the 2006 IEEEWICACM International Con-ference on Web Intelligence and Intelligent Agent TechnologyWorkshops pp 41ndash44 Hong Kong China December 2006

[46] ITPnet ldquoSyria Egypt crises spur escalation of me cyber at-tacksrdquo 2013 httpwwwitpnet594742-syria-egypt-crises-spur-escalation-of-me-cyber-attack

[47] A McEnery and R Xiao ldquoCharacter encoding in corpusconstructionrdquo in Developing Linguistic Corpora A Guide toGood Practice Oxbow Books Ltd Oxford UK 2005

[48] B Bos T Ccedilelik I Hickson et al ldquoCascading style sheets level2 revision 1 (CSS 21) specificationrdquo W3C Working Draft2005 httpwwww3orgTRCSS21

20 Security and Communication Networks

[49] W Stuckey ldquoMassive sony breach sheds light on murkyhacker universerdquo 2018 httpamericaaljazeeracomarticles20141224sony-hacker-universehtml

[50] S Gallagher ldquoSony pictures malware tied to SeoulldquoShamoonrdquo cyber-attacksrdquo 2018 httpsarstechnicacominformation-technology201412sony-pictures-malware-tied-to-seoul-shamoon-cyber-attacks

[51] J Pagliery ldquoSony hack signs point to North Koreardquo 2018httpsmoneycnncom20141205technologysecuritysony-hack-north-korea-employeeindexhtml

[52] K Ketler ldquoCase-based reasoning an introductionrdquo ExpertSystems with Applications vol 6 no 1 pp 3ndash8 1993

[53] M Rosvall and C T Bergstrom ldquoMapping change in largenetworksrdquo PLoS One vol 5 no 1 Article ID e8694 2010

[54] OASIS ldquoSTIXTAXII standardsrdquo 2017-2018 httpsoasis-opengithubiocti-documentation

Security and Communication Networks 21

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 11: CBR-Based Decision Support Methodology for Cybercrime

Table 2 Value and the weight for the similarity score by the case vector All of the values of the similarity score are normalized to 0 or 1

Case vector Weight Impact +e similarity measure between a new case andexisting cases Value

Encoding 05 High mdash 0 or 1

IP address 02 Medium

If the same (eg 14324816 and 14324816) 1If the 1st 2nd and 3rd octet are matched (eg

14324816 and 14324818) 075

If the 1st and 2nd octet are matched (eg 14324816and 14324844) 05

Only the 1st octet is matched (eg 14324816 and1431324) 025

No common octet (eg 14324816 and 1631325) 0

Domain 015 Medium

An identical domain 1Service name is matched and one of the gTLD and

ccTLD is matched 08

gTLD and ccTLD is matched 03Service name is matched 01

ccTLD is matched 01gTLD is matched 01

Nonidentical domain 0

Date 01 Low

Period of about 6 months back and forth (1 year) 1Period of about 18 months back and forth (3 years) 075Period of about 30 months back and forth (5 years) 05Period of about 42 months back and forth (7 years) 025Over period of about 42 months (over 7 years) 0

OS 005 Low mdash 0 or 1

Table 3 Characteristics and metadata of several different clusters derived from the clustering processing

Cluster number Ratio () Description Representative hacker (group)

0 784+e group uses Central European languages +eyprincipally attacked against the profit organization

and Linux-based OS in Western EuropeJaMaYcKa Super2li

1 816

+e group uses Arabic and Cyrillic +ey principallyattacked against the organization that manages thenetwork and Linux-based and Unix-based OS +eirattack region is distributed throughout SouthernEurope South America Eastern Europe and

Southeast Asia

BI0S

2 1036

+e group uses Central European languages +eyprincipally attacked against the organization that

manages the network and nonprofit organizations inWestern Europe

JaMaYcKa

3 933+e group uses Central European languages +eyprincipally attacked against the profit organization

and Windows-based OS in Western Europe1923Turk

4 2536+e group uses Arabic and Chinese +ey principally

attacked against the profit organization andWindows-based OS in Western Europe

EL_MuHaMMeD federal-atackorg

5 173

+e group uses Central European languages +eyprincipally attacked against the profit organizationand Unix-based OS in Southern Europe and Eastern

Europe

d3bsimX SuSKuN

6 524

+e group uses Central European languages +eyprincipally attacked against the profit organizationthe educational institution the government and stateagencies and also Windows-based OS in East Asia

1923Turk

Security and Communication Networks 11

study rather than evaluating the accuracy of the similaritymechanism we tested the overall performance of the pro-posed methodology with the ratio of correctly identified

hackers +e developed testing procedures unfolded in thefollowing four steps and are depicted in detail in Figure 7where ldquoKrdquo presents all hackers within the database

Table 3 Continued

Cluster number Ratio () Description Representative hacker (group)

7 1593+e group uses Arabic Chinese and Turkish +eyprincipally attacked against the profit organization

and Linux-based OS in Western EuropeRya iskorpitx

8 911+e group uses Central European languages +eyprincipally attacked against the profit organization

and Windows-based OS in Western Europe1923Turk

9 363

+e group uses Central European languages +eyprincipally attacked against the profit organizationand Linux-based OS in South America and Eastern

Europe

Hmei7

10 139

+e group uses Central European languages +eyprincipally attacked against Windows-based OS inSouth America and Southeast Asia+eir attack target

is mostly the educational institution and thegovernment and state agencies

BHS F4keLive

11 192

+e group uses Arabic and Central Europeanlanguages+ey principally attacked against the profitorganization and Windows-based OS in Southern

Europe

EL_MuHaMMeD linuXploit_cre

Clustering 00

25

25

50

75

0100

25

50

75

0100

25

50

75

0100

25

50

75

0100

25

50

75

0100

25

50

75

0100

25

50

75

0100

25

50

75

0100

25

50

75

0100

25

50

75

0100

25

50

75

0100

50

75

0100

Clustering 01 Clustering 02 Clustering 03

Clustering 04 Clustering 05 Clustering 06 Clustering 07

Clustering 08 Clustering 09 Clustering 10 Clustering 11

West EuropeTurkishCentral EuropeArabicCyrillicChinese

com

net

org

gov

edu

mil

Western EuropeEast AsiaSouthern EuropeSouth AmericaEastern EuropeSoutheast Asia

WindowLinuxUnixMacOS

Encoding

gTLD

ccTLD

OS

Figure 6 Visualization of the 12 different clusters (00 through 11) in our data annotated with various features encoding gTLD ccTLD andOS and their corresponding share (legend on the right side)

12 Security and Communication Networks

Rk Count Casesmk( )Count Casesallk( )

(3)

where ldquomrdquo means the past cases which are within the denedscope concerning a randomly selected hacker ldquokrdquo

(i) Step 1 selection the measurement objects ie 100hackers were randomly selected from the database

(ii) Step 2 case labelling we retrieved all previous attackcases conducted by the randomly selected 100hackers in Step 1 and then subsequently labelled allprevious attack cases by each hacker name

(iii) Step 3 case extraction we selected the most recentcase among the cases labelled in Step 2 as an inputvalue shye similarity score was then estimated bycomparing themost recent case (ie RCmdashone of theretrieved cases) with all other cases in the database(ie TCsmdashall cases in the cases-centric DB)

(iv) Step 4 scoring similarity score was sorteddepending on the value and the weight for thesimilarity score by the case vector (see Table 2) inthe descending orderWhenever the similarity valuewas 0 it was not displayed on the scoring list of Step4 shye feasibility of the proposed methodology wasevaluated based on how many past cases of a hackerthere were in the N scope at the scoring list of Step 4that is regarding the ratio of the attack cases by eachhacker we checked whether the cases were includedat the top N scope (N scope from the top 1 percentto the top 30 percent)

NScope Count CasesScopeK( )Count CasesallK( )

times 100 (2)

First we randomly picked 100 hackers from the col-lected dataset (ie cases-centric DB) thereafter we re-trieved and extracted all past attack cases for each hackershye extracted past cases were labelled with the hackerrsquosname Figure 8 depicts the number of website defacementattack cases in the past for each hacker In Steps 3 and 4similarity between a retrieved case (ie the most recentcase) and all other stored website defacement cases weremeasured

Specically we checked whether the result (ie thesorted hackerrsquos past cases with a high similarity score)stemming from the similarity measurement was included atthe top N scope shyis process was meant to check based onthe similarity score how many past attack cases of randomlypicked 100 hackers were included in the dened topN scopeTo this end we divided the top N scope into eight criterionfactors from the top 1 percent to the top 30 percent and theratio R all the past attack cases for each hacker into sixcriterion factors from 50 percent to 100 percent (ie at 10percent intervals) As illustrated in equations (2) and (3) theN scope and the ratio R were categorized as ratios accordingto the dened measure rule More specically the criterionof the top N scope ie ldquotop N percentrdquo was based on theresult derived from the similarity measurement Attack caseswere sorted in order of high similarity score and thereforethe cases were within the range of topN scope (see Figure 9)Also in the case of the hacking case ratio of a randomly

Step 4 scoring

bullbullbull

Randomly selected100 hackers

from the database

Step 1 selection

bullbullbull

1 eBuGz2 Hmei73 3xp1r3

98 S3cure99 drm1st3r

100 Lulz53c

bullbullbull

Step 2 case labelling1 TheBuGz

100 Lulz53c

Step 3 case extraction

A retrieved case(the most recent case)

bullbullbull

1 TheBuGz

100 Lulz53c

bullbullbull

Cases-centricDB

Hackername Date Encoding IP address Domain OS Score

Hackername Date Encoding IP address Domain OS Score

bullbullbull

1 eBuGz2 Hmei73 3xp1r3

98 S3cure99 drm1st3r

100 Lulz53c

Case1 bullbullbull Casem

Case1 bullbullbull Casemprime

Case1 bullbullbull Casem

Case1 bullbullbull Casemprimei=1

cv[Distance (RCcv TCcv) times Weightcv]

Casemprime

Casemprime

Casem

Casem

Figure 7 shye developed testing procedures from step 1 to step 4

Security and Communication Networks 13

selected hacker some parts of the past attack cases (ie ratioR) concerning a hacker were within the defined N scope (seeFigure 9)

Figure 10 shows the number of an identified hacker fora retrieved case (ie the most recent case) among allhacking cases of each hacker +e X-axis in Figure 10 showsthe criterion of the topN scope including the eight criterionfactors () and of the ratio R including the six criterionfactors () +e Y-axis presents the number of an identifiedhacker in the top N scope among the randomly selected 100hackers in Step 1 As can be seen in Figure 10 the higher theratio R and the narrower the N scope the lower the numberof an identified hacker in the top N scope among therandomly selected 100 hackers On the other hand thelower the ratio R and the wider the N scope the higher thenumber of identified hackers in the top N scope among therandomly selected 100 hackers Consequently even ifhacking cases were caused by the same hacker as the hackeror hacking group which only attacked the same or similarobjects were rare it is impossible to draw results with a highsimilarity score for all cases of a hacker Nevertheless theresults demonstrated that the proposed CBR-based de-cision support methodology can successfully reduce thenumber of hackers and their cases and suggest potential topN percent candidates among hundreds of thousands ofcases

+erefore an investigator should consider the avail-ability and flexibility of data with respect to the data selectioncriteria for the similarity measurement As mentionedabove when a new attack occurs they can limit the searchrange of the data and determine the direction of the criminalinvestigation With such the reduction in the number ofcandidate-related cases the outcomes of our similaritymechanism are highly valuable in terms of reducing theinvestigation time to determine the potential suspect of agiven hacking incident

42 Case Study As mentioned above the accuracy of theCBR depends on the quality of the collected data and theoverall accuracy is difficult to evaluate Nevertheless al-though the data are insufficient to evaluate the proposedmethodology the DS and SPE cases include the ground-truth data with specific information related to the hacker orhacking groups Based on the public ground-truth data ofthe DS and SPE cases we found the most similar top threehackers or hacking groups to them and noticed theircharacteristic by the proposed similarity measure and theclustering processing

+e hackers of the DS cyberattack defaced the groupwarehomepage of LG U+ the 3rd largest telecommunicationcompany in South Korea and the English version of the

Step 4 scoring

1 TheBuGzTop N scope(1~30)

Ratio R (50~100)

Hackername Date Encoding IP address Domain OS Score

Case1 bullbullbull Casem

Casem

Figure 9 Scoring step on the top N scope and the ratio R

0

1000

2000

3000

4000

5000

0 25 50 75 100Hacker

Num

ber o

f cas

es

Figure 8 +e number of website defacement attack cases in the past of each hacker

14 Security and Communication Networks

Korean Broadcasting System (KBS) homepage +ey leftunique images and many messages on the defaced websites+e three Calaveras image (ie skull image) used in the LGU+rsquos defaced website appeared on many European websites+e character encoding set of the message was the WesternEuropean language system Based on these insights we couldinfer that the hackersrsquo background is European ldquoHASTATIrdquowas the word written on the KBS homepage meaning theforefront line of the Roman troops hinting that the DScyberattack could be a starting point rather than a transientattack it was a persistent one Even if we excluded otherimages and messages as well as other features from thesimilarity processes due to the unanticipated loss or absenceof data one could establish the similarity and intent of theattackers with reasonable confidence However given thesufficiently large hacker profiling source such abundant datacould support and enhance the accuracy of inference Fig-ure 11 shows the screenshots of the defaced websites at thattime

In the SPE case similarly to the DS case some imagesand messages were left on the computers of SPE Regardingcolour skulls image and misspellings the imagesFigure 11(c) used in the SPE cases took on the characteristicssimilar to those of the images Figure 11(b) used in the DScases As shown in Figure 11 the colour schemes in greenand red and the visual similarities seen in skull image areother crucial elements for crime tracing In both the DS andSPE cases the phrase such as ldquothis is the beginningrdquo andldquoyour datardquo were commonly found in the messages How-ever given the intentional hacking nature of forging orhiding their identity motivation and location some experts

say that these characteristics are not the conclusive proofthat Sony has been attacked by the same hacker [49ndash51]

For the evaluation of the results of the case study we firstmeasured the similarity between the new website de-facement cases (ie the DS and SPE cases) and the collectedexisting cases in the database +is approach coheres withthe CBR process used in cybercrime investigation (seeFigure 2) Two new website defacement cases the DS and theSPE were applied as RC and the similarity score for each ofthese two cases was computed using the similarity measure(see equation (1)) proposed in Section 331 Provided thatbecause the DS and SPE cases do the function of the targetcases as an input value we considered a direct comparisonbetween the DS and SPE cases for the similarity score wasnot appropriate [52]

+e similarity measure mentioned in the previousparagraph is based on the metadata released by an analysisreport of the DS and SPE real cases We summarized furtherthe characteristics and metadata associated with them inTable 4 +e similarity score was derived through com-parison between the presented metadata of the DS and SPEcases and all cases in the cases-centric DB We gave the mostsimilar top three cases among the result of the similarityscore (see the right side in Table) Notifier Hmei7 and d3b_Xare among the cases that belonged to Clusters 0 and 8 whichwere the two clusters that exhibited identical characteristicsIt can thus be understood that they used the encoding systempertinent to Central European languages based on the Latinlanguage system and typically launched attacks against aprofit organization located in Western Europe Notifieroaddah MTRiX and EL_MuHaMMeD were all classified

Top 1 Top 3 Top 5 Top 10 Top 15 Top 20 Top 25 Top 30Criterion of the top N ()

Num

ber o

f ide

ntifi

ed h

acke

rs

0

20

40

60

80

100

Ratio of the attack cases ()506070

8090100

Figure 10 +e number of identified hackers in the top N scope among the randomly selected 100 hackers

Security and Communication Networks 15

as the same cluster (Cluster 7) where the hackers of Cluster 7used the encoding system pertinent to Arabic and Chineselanguages and typically attacked against the profit organi-zation located in Western Europe

Next to ensure the objectivity of the similarity scorebased on the case study by the DS and SPE we computed thesimilarity score of any randomly selected pair from thewhole case Figure 12(a) shows the distribution of thesimilarity score of the randomly selected cases We took thedistribution of the similarity score using the central limittheorem which describes the average distribution of ran-dom samples extracted from a finite population +e dis-tribution shows that the calculation of the similarity score ofthe randomly selected two website defacement cases wasrepeatedly performed for 10000 times +e similarity scoresof any randomly selected pair of cases were typically dis-tributed around 03 +is result (Figure 12(a)) substantiatesthat the similarity scores are not low even if the similarityscores of the DS and SPE cases (Figure 12(b)) do not appear

numerically high Figure 12(b) shows the similarity scores ofthe DS and SPE cases+e top score of the similarity was 069in the DS case and all measured cases concentrated aroundthe similarity score (X-axis) of 00 to 015 and of 05 to 06 Inthe SPE case the top score of the similarity was 0615 and allmeasured cases concentrated around the similarity score (X-axis) of 00 to 02

Figure 13 shows the distribution of the similarity scorefor randomly selected 100 hackers mentioned in Section 41To know the mean value of the similarity score for eachhacker case we calculated the similarity score from thehackerrsquos own past cases Cases used for the similarity scoremeans not all cases in the cases-centric DB but just the pastcases conducted by the hacker in the cases-centric DB +emean value of the similarity scores in the hackers is 05233+e similarity scores of the tested cases in Table 4 is abovethe mean value +us the similarity scores for each hackeradequately underpin the similarity scores from the TCs inDS and SPE

(a) (b) (c)

Figure 11 A snippet of website defacement cases by a comparison of examples of the DS and SPE the defaced LGU+ groupware homepage(a) and KBS homepage (b) in the DS case and the defaced website in SPE case (c)

Table 4 Further characteristics and metadata associated with the DS and SPE cases

Retrieved case Tested cases

Case name NotifierDarkSeoul (DS) Hmei7 d3b_X StifLer

Encoding Windows-1252 Windows-1252 Windows-1252 ISO-8859-9IP address 203248195178 2038623868 2031243766 77921083Domain gyunggionnet21com httpwwwgarychengcom healthajkgovpk yapikimyasallaricomtrDate 20 Mar 2013 6 Feb 2014 4 Feb 2014 8 Jun 2013OS Windows Windows Windows WindowsSimilarity mdash 0690 0675 0665Cluster mdash 0 8 4

Retrieved case Tested casesCase name Notifier

Sony pictures Entertainment (SPE) Oaddah MTRiX EL_MuHaMMeDEncoding EUC-KR EUC-CN GB2312 GB2312 GB2312IP address 203131222102 2031241555 20829198 2081164534Domain httpwwwsonypicturesstockfootagecom httpwwwhzkcggcom daxdigitalromcom digitalairstripnetDate 24 Nov 2014 14 Jun 2012 16 Dec 2002 18 June 2009OS Windows Windows Windows WindowsSimilarity mdash 0615 0615 0600Cluster mdash 7 7 7+e metadata are arranged according to the defined case vector corresponding with the DS and SPE cases on the left side (shown in part in boldface type)

16 Security and Communication Networks

43 Follow-Up Investigation A case study is a researchmethod involving an in-depth and detailed investigationof a subject of study as well as its related contextualmethodology Hence we conducted follow-up in-vestigations of the most similar top three hackers asmentioned above in Table 4 According to the resultsspecifically over 93 percent of the hackerrsquos attacks weresimilar to the DS case that occurred in 2013 and 2014+eir major targets were com domain sites and theytargeted primarily Germany Italy New Zealand RussiaTurkey Taiwan and South Korea (see Table 5) Twohackers (ie Hmei7 and d3b_X) primarily attackedgovernment agencies Interestingly 20 percent of theattacks by the hackers named d3b_X targeted SouthKorea In the SPE incident the similar hackerrsquos attacksoccurred throughout the period from 2002 to 2014 +ehackers named MTRiX and EL_MuHaMMeD in-tensively executed such attacks in 2003 and 2009 +eirmajor targets were com (or co) and org domain sitesand they targeted primarily Brazil Canada DenmarkFrance Greece Hong Kong and Italy (see Table 5) Twohackers (ie MTRiX and EL_MuHaMMeD) primarilyattacked commercial agencies and additionally attackedthe public and network agencies As shown in Figure 14 to

describe the follow-up investigation more discernibly andto focus on the attack flow we used an alluvial diagramwhich is a type of Sankey diagram developed to representchanges in a network structure over time [53] It shows theinvestigation of the top three hackers with website de-facement cases most similar to the DS case and SPE case+e case vectors were based on the attack year ccTLD andgTLD+e thickness of the attack flow in this figure meansthe degree of attack +is network visualization methodcould support an investigator to understand the flow andcore of the crime clearly by listing the multidimensionalevidence that is complicatedly entangled or hidden suchthat it does not look presentable

5 Limitations and Discussion

+e CBR algorithm has the disadvantage that the perfor-mance evaluation may be degraded if the property de-scribing the case is inappropriate +erefore in order toobtain more accurate results cross-data analysis with othervarious data sources should be considered For examplecybercrime statistics data from law enforcement agenciesthreat intelligence data from malware analysis groups andvulnerability databases could be useful resources to

100806040200Similarity score

600

400

200

0

Freq

uenc

y

Mean = 02930 Var = 00866

(a)

100806040200

Freq

uenc

y e highest similarity score 0615on Sony Pictures Entertainment case

Similarity score100806040200

Similarity score

40000

30000

20000

10000

0

Freq

uenc

y

40000

30000

20000

10000

0

Mean = 0114 Var = 01500

e highest similarity score 069on DarkSeoul case

Mean = 0063Var = 00370

A B

(b)

Figure 12 (a) Probability distribution of the similarity score for any pair of randomly selected cases (b) distribution of the similarity valuebetween the collected website defacement cases with the DS case (A) and the distribution of the similarity value between the collectedwebsite defacement cases with the SPE case (B) +e similarity was calculated between each studied case and all other cases in our system

0

2

4

6

000 025 050 075

Freq

uenc

y

Mean value of the similarity score000 025 050 075

Figure 13 Distribution of the similarity score for randomly selected 100 hackers

Security and Communication Networks 17

improve the accuracy and usability of our proposedmethodology However at the time of writing the presentpaper we did not have access to open and public dataconcerning cybercrime

For that reason we tried to demonstrate the practica-bility of the proposed methodology as a proof of concept+erefore we focused on the dataset of the zone-horg thatincludes a large number of website defacement cases Al-though the zone-horg provides an extensive dataset on thepast incident events not all incidents can be included in ourstudy +erefore if a hacker penetrated some target orga-nizations by APT attacks and performed stealthy activitiessuch hacking activities would not be reported in the datasetof the zone-horg and the proposed methodology would notbe able to detect similar cases with reasonable confidence

6 Conclusion and Future Work

In this study the similarity of website defacement caseswas assessed through the similarity measure and theclustering processing using the CBR as a methodology+e collected raw data of the defaced web sitesrsquo resourceswas sanitized via data parsing and data cleaning processAlso based on the large size of real dataset data-drivenanalysis for the hacker profiling is achieved To this endthe case vector was designed and the significant featureswere chosen for applying to the case-based reasoning Fora successful cybercrime investigation hacker profiling viaclustering analysis is the most basic and importantprocess in order to find out the relevant incident casesand significant data on some prime incidents data-driven

Table 5 Follow-up investigation on the top three hackers with website defacement cases most similar to the DS case and SPE case +e casevector value means the hackerrsquos attack rate

DomainDS case SPE case

Hmei7 d3b_X StifLer Oaddah MTRiX EL_MuHaMMeDCom 7832 8581 10000 10000 8627 8298Edu 162 096 mdash mdash 176 191Net 340 320 mdash mdash 546 574Gov 1216 651 mdash mdash 106 mdashYear Hmei7 d3b_X StifLer Oaddah MTRiX EL_MuHaMMeD2002 mdash mdash mdash mdash 1074 mdash2003 mdash mdash mdash mdash 8908 mdash2006 mdash mdash mdash mdash mdash mdash2007 009 mdash mdash mdash 018 mdash2008 mdash mdash mdash mdash mdash mdash2009 315 mdash mdash mdash mdash 99572010 009 mdash mdash mdash mdash mdash2011 034 mdash mdash mdash mdash mdash2012 340 mdash mdash 10000 mdash mdash2013 3486 3917 10000 mdash mdash mdash2014 5808 5977 mdash mdash mdash 0432015 mdash 107 mdash mdash mdash mdash

d3b~x

Hmei7

StifLer

2009

2012

2013

2014

AustraliaBrazilFrance

Germany

IndonesiaItaly

KoreaNetherlandsNewZealand

PolandRussia

Thailand

Turkey

Unknown

com

gov

net

org

Unknown

No

Yes

Hacker Year ccTLD gTLD Attack

(a)

EL_M

uHaM

MeD

MTRiX

oaddah

2002

2003

2009

2012

BrazilCanada

DenmarkFranceGreece

HongKongItaly

Unknown

com

net

org

Unknown

No

Yes

Hacker Year ccTLD gTLD Attack

(b)

Figure 14 Follow-up investigation on the top three hackers with website defacement cases that are most similar to the DS case (a) and SPEcase (b)

18 Security and Communication Networks

and evidence-driven decision making should be thecritical process Also reducing the amount of data andtime to be analysed are important factors to deliver thehigh value of intelligence data

Although the obtained results appear to be sound andmeaningful it is difficult to evaluate the accuracy of theresults unless the attacker is captured Naturally theground-truth data with specific information about theinvolved hacking groups for verification are rare (ie noadversary claimed that the two attacks were the result oftheir actions) However it is noteworthy that our meth-odology provides a meaningful insight into the confidentialand undercover network of cybercrime as well especiallywhen there is a lack of information Also the proposedmethodology contributes to facilitate the analysis and re-ducing the time required for searching for possible suspectsof cybercrime We believe that the proposed system ismeaningful for further exploration and correlation ofvarious website defacement cases

As mentioned in Discussion and Limitations a cross-data analysis with other various data sources should bereviewed Said differently the use of additional online oroffline information acquired by human intelligence(HUMINT) or different types of signal intelligence(SIGINT) and sources may also help to reason compo-sition requirements of crime and reduce the category ofinvestigation Furthermore the proposed methodologycan be expanded into incident information for compat-ibility and information exchangeability with othercyberthreat intelligence system as the Structured +reatInformation eXpression (STIX) and Trusted AutomatedeXchange of Indicator Information (TAXII) which arekey strategic elements of the information-sharingsystem [54]

+ere are features such as the particular messages (iethanks-to notifier nationality religion and anniversary)or image and mp3 file in the web resources which aregathered from the zone-horg site Although these featuresare limited to only a small number of hackers of the webresources in future research we will try to study a close-knit network among them such as the hub hacking groupkey player and followers Furthermore we also plan tomore definitely classify and systemize the hackersrsquo intentsusing text mining and mood detection techniques +efindings of this prospective study will contribute mean-ingful insights to trace hackersrsquo behavioural patterns and toestimate their primary purpose and intent

Data Availability

+e web-hacking dataset applied to our paper can bedownloaded from the linked site below httpocslabhksecuritynetDatasetsweb-hacking-profiling

Conflicts of Interest

+e authors declare that there are no conflicts of interestregarding the publication of this paper

Acknowledgments

+is work was supported under the framework of internationalcooperation program managed by the National ResearchFoundation of Korea (No 2017K1A3A1A17092614)

References

[1] S S Response ldquoSwift attackersrsquo malware linked to more fi-nancial attacksrdquo 2016 httpswwwsymanteccomconnectblogsswift-attackers-malware-linked-more-financial-attacks

[2] S S Response ldquoWannacry ransomware attacks show strong linksto lazarus grouprdquo 2017 httpswwwsymanteccomconnectblogswannacry-ransomware-attacks-show-strong-links-lazarus-group

[3] K lab ldquoLazarus under the hoodrdquo 2018 httpsmediakasperskycontenthubcomwp-contentuploadssites4320180307180244Lazarus_Under_+e_Hood_PDF_finalpdf

[4] Operation Blockbuster ldquoDestructive malware reportrdquo 2016httpswwwoperationblockbustercomwp-contentuploads201602Operation-Blockbuster-Destructive-Malware-Reportpdf

[5] D Martin and SANS Institute InfoSec Reading Room ldquoTracingthe lineage of DarkSeoulrdquo 2016 httpswwwsansorgreading-roomwhitepaperscriticaltracing-lineage-darkseoul-36787

[6] D S C T U T Intelligence ldquoWiper malware threatanalysisrdquo 2013 httpswwwsecureworkscomresearchwiper-malware-analysis-attacking-korean-financial-sector

[7] R Sherstobitoff M L Itai Liba and O O T C James WalterldquoDissecting operation troy cyberespionage in South Koreardquo2013 httpswwwmcafeecomenterpriseen-usassetswhite-paperswp-dissecting-operation-troypdf

[8] N Horton andA DeSimone ldquoSonyrsquos nightmare before christmasthe 2014 North Korean cyber attack on Sony and lessons for USgovernment actions in cyberspacerdquo 2018 httpswwwjhuapleduContentdocumentsSonyNightmareBeforeChristmaspdf

[9] I K Lee and S R Ramsey 9e Korean Language StateUniversity of New York Albany NY USA 2000

[10] V Benjamin and H Chen ldquoSecuring cyberspace identifyingkey actors in hacker communitiesrdquo in Proceedings of the 2012IEEE International Conference on Intelligence and SecurityInformatics pp 24ndash29 Arlington VA USA June 2012

[11] Y Lu X Luo M Polgar et al ldquoSocial network analysis of acriminal hacker communityrdquo Journal of Computer In-formation Systems vol 51 no 2 pp 31ndash41 2010

[12] J-W Jang H Kang J Woo A Mohaisen and H K KimldquoAndro-autopsy anti-malware system based on similaritymatching of malware and malware creator-centric in-formationrdquo Digital Investigation vol 14 pp 17ndash35 2015

[13] J W Jang and H K Kim ldquoFunction-orientedmobile malwareanalysis as first aidrdquo Mobile Information Systems vol 2016Article ID 6707524 11 pages 2016

[14] Y Ki E Kim and H K Kim ldquoA novel approach to detectmalware based on api call sequence analysisrdquo InternationalJournal of Distributed Sensor Networks vol 11 no 6 ArticleID 659101 2015

[15] M L Han H C Han A R Kang et al ldquoWeb-hacking datasetfor the cyber criminal profilingrdquo 2016 httpocslabhksecuritynetDatasetsweb-hacking-profiling

[16] M L Han H C Han A R Kang B I Kwak A Mohaisenand H K Kim ldquoWAHP web-hacking profiling using case-based reasoningrdquo in Proceedings of the 2016 IEEE Conference

Security and Communication Networks 19

on Communications and Network Security (CNS) pp 344-345Philadelphia PA USA October 2016

[17] A Aamodt and E Plaza ldquoCase-based reasoning foundationalissues methodological variations and system approachesrdquo AICommunications vol 7 no 1 pp 39ndash59 1994

[18] D M L Martins and F B D Lima Neto ldquoHybrid intelligentdecision support using a semiotic case-based reasoning andself-organizing mapsrdquo IEEE Transactions on Systems Manand Cybernetics Systems no 99 pp 1ndash8 2017

[19] H K Kim K H Im and S C Park ldquoDSS for computersecurity incident response applying CBR and collaborativeresponserdquo Expert Systems with Applications vol 37 no 1pp 852ndash870 2010

[20] J-B Lamy B Sekar G Guezennec J Bouaud andB Seroussi ldquoExplainable artificial intelligence for breastcancer a visual case-based reasoning approachrdquo ArtificialIntelligence in Medicine vol 94 pp 42ndash53 2019

[21] M Relich and P Pawlewski ldquoA case-based reasoning ap-proach to cost estimation of new product developmentrdquoNeurocomputing vol 272 pp 40ndash45 2018

[22] E R Reyes S Negny G C Robles et al ldquoImprovement ofonline adaptation knowledge acquisition and reuse in case-based reasoning application to process engineering designrdquoEngineering Applications of Artificial Intelligence vol 41pp 1ndash16 2015

[23] H K Kim S-K Kim and S-H Kim ldquoDecision supportsystem for zero-day attack responserdquo Applied Mathematicsand Information Sciences vol 6 no 1 pp 221Sndash241S 2012

[24] G Horsman C Laing and P Vickers ldquoA case-based rea-soning method for locating evidence during digital forensicdevice triagerdquo Decision Support Systems vol 61 pp 69ndash782014

[25] G Horsman C Laing and P Vickers ldquoA case based reasoningsystem for automated forensic examinationsrdquo in Proceedings ofthe PGNET 2011 the 12th Annual Postgraduate Symposium onthe Convergence of Telecommunications Networking andBroadcasting pp 26ndash31 Liverpool UK June 2011

[26] Z Yin Y Gao and B Chen ldquoOn development of supple-mentary criminal analysis system based on cbr and ontologyrdquoin Proceedings of the 2010 International Conference onComputer Application and System Modeling (ICCASM 2010)vol 14 Taiyuan China October 2010

[27] A J Pinizzotto and N J Finkel ldquoCriminal personality pro-filing an outcome and process studyrdquo Law and HumanBehavior vol 14 no 3 pp 215ndash233 1990

[28] P Chen and J Kurland ldquoTime place and modus operandi asimple apriori algorithm experiment for crime pattern de-tectionrdquo in Proceedings of the 2018 9th International Con-ference on Information Intelligence Systems and Applications(IISA) pp 1ndash3 Zakynthos Greece July 2018

[29] C J R Collie and K Shalev Greene ldquoExamining modusoperandi in stranger child abduction a comparison ofattempted and completed casesrdquo Journal of InvestigativePsychology and Offender Profiling vol 16 no 2 pp 91ndash1092019

[30] V Benjamin B Zhang J F Nunamaker Jr and H ChenldquoExamining hacker participation length in cybercriminalinternet-relay-chat communitiesrdquo Journal of ManagementInformation Systems vol 33 no 2 pp 482ndash510 2016

[31] V Benjamin and H Chen ldquoTime-to-event modeling forpredicting hacker IRC community participant trajectoryrdquo inProceedings of the 2014 IEEE Joint Intelligence and SecurityInformatics Conference pp 25ndash32 +e Hague +e Nether-lands September 2014

[32] K Veena and K Meena ldquoIdentification of cyber criminal byanalysing the users profilerdquo International Journal of NetworkSecurity vol 20 no 4 pp 738ndash745 2018

[33] F Iqbal B C M Fung M Debbabi R Batool andA Marrington ldquoWordnet-based criminal networks miningfor cybercrime investigationrdquo IEEE Access vol 7pp 22740ndash22755 2019

[34] N Qazi and B L W Wong ldquoAn interactive human centereddata science approach towards crime pattern analysisrdquo In-formation Processing ampManagement vol 56 no 6 p 1020662019

[35] N Jain P Sharma R Anchan et al ldquoComputerized forensicapproach using data mining techniquesrdquo in Proceedings of theACM Symposium on Women in Research 2016 pp 55ndash60ACM New York NY USA 2016

[36] P M Cozens G Saville and D Hillier ldquoCrime preventionthrough environmental design (cpted) a review and modernbibliographyrdquo Property Management vol 23 no 5pp 328ndash356 2005

[37] H Hassani X Huang E S Silva andM Ghodsi ldquoA review ofdata mining applications in crimerdquo Statistical Analysis andData Mining 9e ASA Data Science Journal vol 9 no 3pp 139ndash154 2016

[38] A Sharma and S Sharma ldquoAn intelligent analysis of webcrime data using data miningrdquo International Journal of En-gineering and Innovative Technology (IJEIT) vol 2 no 32012

[39] S-T Li S-C Kuo and F-C Tsai ldquoAn intelligent decision-support model using FSOM and rule extraction for crimepreventionrdquo Expert Systems with Applications vol 37 no 10pp 7108ndash7119 2010

[40] Y-H Tseng Z-P Ho K-S Yang and C-C Chen ldquoMiningterm networks from text collections for crime investigationrdquoExpert Systems with Applications vol 39 no 11 pp 10082ndash10090 2012

[41] A Malathi and S S Baboo ldquoAn enhanced algorithm topredict a future crime using data miningrdquo InternationalJournal of Computer Applications vol 21 no 1 2011

[42] S Kapetanakis A Filippoupolitis G Loukas et al ldquoProfilingcyber attackers using case-based reasoningrdquo in Proceedings ofthe 19th UK Workshop on Case-Based Reasoning (UKCBR2014) Cambridge UK December 2014

[43] R Al-Zaidy B C Fung A M Youssef et al ldquoMining criminalnetworks from unstructured text documentsrdquo Digital In-vestigation vol 8 no 3-4 pp 147ndash160 2012

[44] M Zulfadhilah Y Prayudi and I Riadi ldquoCyber profilingusing log analysis and k-means clusteringrdquo InternationalJournal of Advanced Computer Science and Applicationsvol 7 no 7 pp 430ndash435 2016

[45] S V Nath ldquoCrime pattern detection using data miningrdquo inProceedings of the 2006 IEEEWICACM International Con-ference on Web Intelligence and Intelligent Agent TechnologyWorkshops pp 41ndash44 Hong Kong China December 2006

[46] ITPnet ldquoSyria Egypt crises spur escalation of me cyber at-tacksrdquo 2013 httpwwwitpnet594742-syria-egypt-crises-spur-escalation-of-me-cyber-attack

[47] A McEnery and R Xiao ldquoCharacter encoding in corpusconstructionrdquo in Developing Linguistic Corpora A Guide toGood Practice Oxbow Books Ltd Oxford UK 2005

[48] B Bos T Ccedilelik I Hickson et al ldquoCascading style sheets level2 revision 1 (CSS 21) specificationrdquo W3C Working Draft2005 httpwwww3orgTRCSS21

20 Security and Communication Networks

[49] W Stuckey ldquoMassive sony breach sheds light on murkyhacker universerdquo 2018 httpamericaaljazeeracomarticles20141224sony-hacker-universehtml

[50] S Gallagher ldquoSony pictures malware tied to SeoulldquoShamoonrdquo cyber-attacksrdquo 2018 httpsarstechnicacominformation-technology201412sony-pictures-malware-tied-to-seoul-shamoon-cyber-attacks

[51] J Pagliery ldquoSony hack signs point to North Koreardquo 2018httpsmoneycnncom20141205technologysecuritysony-hack-north-korea-employeeindexhtml

[52] K Ketler ldquoCase-based reasoning an introductionrdquo ExpertSystems with Applications vol 6 no 1 pp 3ndash8 1993

[53] M Rosvall and C T Bergstrom ldquoMapping change in largenetworksrdquo PLoS One vol 5 no 1 Article ID e8694 2010

[54] OASIS ldquoSTIXTAXII standardsrdquo 2017-2018 httpsoasis-opengithubiocti-documentation

Security and Communication Networks 21

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 12: CBR-Based Decision Support Methodology for Cybercrime

study rather than evaluating the accuracy of the similaritymechanism we tested the overall performance of the pro-posed methodology with the ratio of correctly identified

hackers +e developed testing procedures unfolded in thefollowing four steps and are depicted in detail in Figure 7where ldquoKrdquo presents all hackers within the database

Table 3 Continued

Cluster number Ratio () Description Representative hacker (group)

7 1593+e group uses Arabic Chinese and Turkish +eyprincipally attacked against the profit organization

and Linux-based OS in Western EuropeRya iskorpitx

8 911+e group uses Central European languages +eyprincipally attacked against the profit organization

and Windows-based OS in Western Europe1923Turk

9 363

+e group uses Central European languages +eyprincipally attacked against the profit organizationand Linux-based OS in South America and Eastern

Europe

Hmei7

10 139

+e group uses Central European languages +eyprincipally attacked against Windows-based OS inSouth America and Southeast Asia+eir attack target

is mostly the educational institution and thegovernment and state agencies

BHS F4keLive

11 192

+e group uses Arabic and Central Europeanlanguages+ey principally attacked against the profitorganization and Windows-based OS in Southern

Europe

EL_MuHaMMeD linuXploit_cre

Clustering 00

25

25

50

75

0100

25

50

75

0100

25

50

75

0100

25

50

75

0100

25

50

75

0100

25

50

75

0100

25

50

75

0100

25

50

75

0100

25

50

75

0100

25

50

75

0100

25

50

75

0100

50

75

0100

Clustering 01 Clustering 02 Clustering 03

Clustering 04 Clustering 05 Clustering 06 Clustering 07

Clustering 08 Clustering 09 Clustering 10 Clustering 11

West EuropeTurkishCentral EuropeArabicCyrillicChinese

com

net

org

gov

edu

mil

Western EuropeEast AsiaSouthern EuropeSouth AmericaEastern EuropeSoutheast Asia

WindowLinuxUnixMacOS

Encoding

gTLD

ccTLD

OS

Figure 6 Visualization of the 12 different clusters (00 through 11) in our data annotated with various features encoding gTLD ccTLD andOS and their corresponding share (legend on the right side)

12 Security and Communication Networks

Rk Count Casesmk( )Count Casesallk( )

(3)

where ldquomrdquo means the past cases which are within the denedscope concerning a randomly selected hacker ldquokrdquo

(i) Step 1 selection the measurement objects ie 100hackers were randomly selected from the database

(ii) Step 2 case labelling we retrieved all previous attackcases conducted by the randomly selected 100hackers in Step 1 and then subsequently labelled allprevious attack cases by each hacker name

(iii) Step 3 case extraction we selected the most recentcase among the cases labelled in Step 2 as an inputvalue shye similarity score was then estimated bycomparing themost recent case (ie RCmdashone of theretrieved cases) with all other cases in the database(ie TCsmdashall cases in the cases-centric DB)

(iv) Step 4 scoring similarity score was sorteddepending on the value and the weight for thesimilarity score by the case vector (see Table 2) inthe descending orderWhenever the similarity valuewas 0 it was not displayed on the scoring list of Step4 shye feasibility of the proposed methodology wasevaluated based on how many past cases of a hackerthere were in the N scope at the scoring list of Step 4that is regarding the ratio of the attack cases by eachhacker we checked whether the cases were includedat the top N scope (N scope from the top 1 percentto the top 30 percent)

NScope Count CasesScopeK( )Count CasesallK( )

times 100 (2)

First we randomly picked 100 hackers from the col-lected dataset (ie cases-centric DB) thereafter we re-trieved and extracted all past attack cases for each hackershye extracted past cases were labelled with the hackerrsquosname Figure 8 depicts the number of website defacementattack cases in the past for each hacker In Steps 3 and 4similarity between a retrieved case (ie the most recentcase) and all other stored website defacement cases weremeasured

Specically we checked whether the result (ie thesorted hackerrsquos past cases with a high similarity score)stemming from the similarity measurement was included atthe top N scope shyis process was meant to check based onthe similarity score how many past attack cases of randomlypicked 100 hackers were included in the dened topN scopeTo this end we divided the top N scope into eight criterionfactors from the top 1 percent to the top 30 percent and theratio R all the past attack cases for each hacker into sixcriterion factors from 50 percent to 100 percent (ie at 10percent intervals) As illustrated in equations (2) and (3) theN scope and the ratio R were categorized as ratios accordingto the dened measure rule More specically the criterionof the top N scope ie ldquotop N percentrdquo was based on theresult derived from the similarity measurement Attack caseswere sorted in order of high similarity score and thereforethe cases were within the range of topN scope (see Figure 9)Also in the case of the hacking case ratio of a randomly

Step 4 scoring

bullbullbull

Randomly selected100 hackers

from the database

Step 1 selection

bullbullbull

1 eBuGz2 Hmei73 3xp1r3

98 S3cure99 drm1st3r

100 Lulz53c

bullbullbull

Step 2 case labelling1 TheBuGz

100 Lulz53c

Step 3 case extraction

A retrieved case(the most recent case)

bullbullbull

1 TheBuGz

100 Lulz53c

bullbullbull

Cases-centricDB

Hackername Date Encoding IP address Domain OS Score

Hackername Date Encoding IP address Domain OS Score

bullbullbull

1 eBuGz2 Hmei73 3xp1r3

98 S3cure99 drm1st3r

100 Lulz53c

Case1 bullbullbull Casem

Case1 bullbullbull Casemprime

Case1 bullbullbull Casem

Case1 bullbullbull Casemprimei=1

cv[Distance (RCcv TCcv) times Weightcv]

Casemprime

Casemprime

Casem

Casem

Figure 7 shye developed testing procedures from step 1 to step 4

Security and Communication Networks 13

selected hacker some parts of the past attack cases (ie ratioR) concerning a hacker were within the defined N scope (seeFigure 9)

Figure 10 shows the number of an identified hacker fora retrieved case (ie the most recent case) among allhacking cases of each hacker +e X-axis in Figure 10 showsthe criterion of the topN scope including the eight criterionfactors () and of the ratio R including the six criterionfactors () +e Y-axis presents the number of an identifiedhacker in the top N scope among the randomly selected 100hackers in Step 1 As can be seen in Figure 10 the higher theratio R and the narrower the N scope the lower the numberof an identified hacker in the top N scope among therandomly selected 100 hackers On the other hand thelower the ratio R and the wider the N scope the higher thenumber of identified hackers in the top N scope among therandomly selected 100 hackers Consequently even ifhacking cases were caused by the same hacker as the hackeror hacking group which only attacked the same or similarobjects were rare it is impossible to draw results with a highsimilarity score for all cases of a hacker Nevertheless theresults demonstrated that the proposed CBR-based de-cision support methodology can successfully reduce thenumber of hackers and their cases and suggest potential topN percent candidates among hundreds of thousands ofcases

+erefore an investigator should consider the avail-ability and flexibility of data with respect to the data selectioncriteria for the similarity measurement As mentionedabove when a new attack occurs they can limit the searchrange of the data and determine the direction of the criminalinvestigation With such the reduction in the number ofcandidate-related cases the outcomes of our similaritymechanism are highly valuable in terms of reducing theinvestigation time to determine the potential suspect of agiven hacking incident

42 Case Study As mentioned above the accuracy of theCBR depends on the quality of the collected data and theoverall accuracy is difficult to evaluate Nevertheless al-though the data are insufficient to evaluate the proposedmethodology the DS and SPE cases include the ground-truth data with specific information related to the hacker orhacking groups Based on the public ground-truth data ofthe DS and SPE cases we found the most similar top threehackers or hacking groups to them and noticed theircharacteristic by the proposed similarity measure and theclustering processing

+e hackers of the DS cyberattack defaced the groupwarehomepage of LG U+ the 3rd largest telecommunicationcompany in South Korea and the English version of the

Step 4 scoring

1 TheBuGzTop N scope(1~30)

Ratio R (50~100)

Hackername Date Encoding IP address Domain OS Score

Case1 bullbullbull Casem

Casem

Figure 9 Scoring step on the top N scope and the ratio R

0

1000

2000

3000

4000

5000

0 25 50 75 100Hacker

Num

ber o

f cas

es

Figure 8 +e number of website defacement attack cases in the past of each hacker

14 Security and Communication Networks

Korean Broadcasting System (KBS) homepage +ey leftunique images and many messages on the defaced websites+e three Calaveras image (ie skull image) used in the LGU+rsquos defaced website appeared on many European websites+e character encoding set of the message was the WesternEuropean language system Based on these insights we couldinfer that the hackersrsquo background is European ldquoHASTATIrdquowas the word written on the KBS homepage meaning theforefront line of the Roman troops hinting that the DScyberattack could be a starting point rather than a transientattack it was a persistent one Even if we excluded otherimages and messages as well as other features from thesimilarity processes due to the unanticipated loss or absenceof data one could establish the similarity and intent of theattackers with reasonable confidence However given thesufficiently large hacker profiling source such abundant datacould support and enhance the accuracy of inference Fig-ure 11 shows the screenshots of the defaced websites at thattime

In the SPE case similarly to the DS case some imagesand messages were left on the computers of SPE Regardingcolour skulls image and misspellings the imagesFigure 11(c) used in the SPE cases took on the characteristicssimilar to those of the images Figure 11(b) used in the DScases As shown in Figure 11 the colour schemes in greenand red and the visual similarities seen in skull image areother crucial elements for crime tracing In both the DS andSPE cases the phrase such as ldquothis is the beginningrdquo andldquoyour datardquo were commonly found in the messages How-ever given the intentional hacking nature of forging orhiding their identity motivation and location some experts

say that these characteristics are not the conclusive proofthat Sony has been attacked by the same hacker [49ndash51]

For the evaluation of the results of the case study we firstmeasured the similarity between the new website de-facement cases (ie the DS and SPE cases) and the collectedexisting cases in the database +is approach coheres withthe CBR process used in cybercrime investigation (seeFigure 2) Two new website defacement cases the DS and theSPE were applied as RC and the similarity score for each ofthese two cases was computed using the similarity measure(see equation (1)) proposed in Section 331 Provided thatbecause the DS and SPE cases do the function of the targetcases as an input value we considered a direct comparisonbetween the DS and SPE cases for the similarity score wasnot appropriate [52]

+e similarity measure mentioned in the previousparagraph is based on the metadata released by an analysisreport of the DS and SPE real cases We summarized furtherthe characteristics and metadata associated with them inTable 4 +e similarity score was derived through com-parison between the presented metadata of the DS and SPEcases and all cases in the cases-centric DB We gave the mostsimilar top three cases among the result of the similarityscore (see the right side in Table) Notifier Hmei7 and d3b_Xare among the cases that belonged to Clusters 0 and 8 whichwere the two clusters that exhibited identical characteristicsIt can thus be understood that they used the encoding systempertinent to Central European languages based on the Latinlanguage system and typically launched attacks against aprofit organization located in Western Europe Notifieroaddah MTRiX and EL_MuHaMMeD were all classified

Top 1 Top 3 Top 5 Top 10 Top 15 Top 20 Top 25 Top 30Criterion of the top N ()

Num

ber o

f ide

ntifi

ed h

acke

rs

0

20

40

60

80

100

Ratio of the attack cases ()506070

8090100

Figure 10 +e number of identified hackers in the top N scope among the randomly selected 100 hackers

Security and Communication Networks 15

as the same cluster (Cluster 7) where the hackers of Cluster 7used the encoding system pertinent to Arabic and Chineselanguages and typically attacked against the profit organi-zation located in Western Europe

Next to ensure the objectivity of the similarity scorebased on the case study by the DS and SPE we computed thesimilarity score of any randomly selected pair from thewhole case Figure 12(a) shows the distribution of thesimilarity score of the randomly selected cases We took thedistribution of the similarity score using the central limittheorem which describes the average distribution of ran-dom samples extracted from a finite population +e dis-tribution shows that the calculation of the similarity score ofthe randomly selected two website defacement cases wasrepeatedly performed for 10000 times +e similarity scoresof any randomly selected pair of cases were typically dis-tributed around 03 +is result (Figure 12(a)) substantiatesthat the similarity scores are not low even if the similarityscores of the DS and SPE cases (Figure 12(b)) do not appear

numerically high Figure 12(b) shows the similarity scores ofthe DS and SPE cases+e top score of the similarity was 069in the DS case and all measured cases concentrated aroundthe similarity score (X-axis) of 00 to 015 and of 05 to 06 Inthe SPE case the top score of the similarity was 0615 and allmeasured cases concentrated around the similarity score (X-axis) of 00 to 02

Figure 13 shows the distribution of the similarity scorefor randomly selected 100 hackers mentioned in Section 41To know the mean value of the similarity score for eachhacker case we calculated the similarity score from thehackerrsquos own past cases Cases used for the similarity scoremeans not all cases in the cases-centric DB but just the pastcases conducted by the hacker in the cases-centric DB +emean value of the similarity scores in the hackers is 05233+e similarity scores of the tested cases in Table 4 is abovethe mean value +us the similarity scores for each hackeradequately underpin the similarity scores from the TCs inDS and SPE

(a) (b) (c)

Figure 11 A snippet of website defacement cases by a comparison of examples of the DS and SPE the defaced LGU+ groupware homepage(a) and KBS homepage (b) in the DS case and the defaced website in SPE case (c)

Table 4 Further characteristics and metadata associated with the DS and SPE cases

Retrieved case Tested cases

Case name NotifierDarkSeoul (DS) Hmei7 d3b_X StifLer

Encoding Windows-1252 Windows-1252 Windows-1252 ISO-8859-9IP address 203248195178 2038623868 2031243766 77921083Domain gyunggionnet21com httpwwwgarychengcom healthajkgovpk yapikimyasallaricomtrDate 20 Mar 2013 6 Feb 2014 4 Feb 2014 8 Jun 2013OS Windows Windows Windows WindowsSimilarity mdash 0690 0675 0665Cluster mdash 0 8 4

Retrieved case Tested casesCase name Notifier

Sony pictures Entertainment (SPE) Oaddah MTRiX EL_MuHaMMeDEncoding EUC-KR EUC-CN GB2312 GB2312 GB2312IP address 203131222102 2031241555 20829198 2081164534Domain httpwwwsonypicturesstockfootagecom httpwwwhzkcggcom daxdigitalromcom digitalairstripnetDate 24 Nov 2014 14 Jun 2012 16 Dec 2002 18 June 2009OS Windows Windows Windows WindowsSimilarity mdash 0615 0615 0600Cluster mdash 7 7 7+e metadata are arranged according to the defined case vector corresponding with the DS and SPE cases on the left side (shown in part in boldface type)

16 Security and Communication Networks

43 Follow-Up Investigation A case study is a researchmethod involving an in-depth and detailed investigationof a subject of study as well as its related contextualmethodology Hence we conducted follow-up in-vestigations of the most similar top three hackers asmentioned above in Table 4 According to the resultsspecifically over 93 percent of the hackerrsquos attacks weresimilar to the DS case that occurred in 2013 and 2014+eir major targets were com domain sites and theytargeted primarily Germany Italy New Zealand RussiaTurkey Taiwan and South Korea (see Table 5) Twohackers (ie Hmei7 and d3b_X) primarily attackedgovernment agencies Interestingly 20 percent of theattacks by the hackers named d3b_X targeted SouthKorea In the SPE incident the similar hackerrsquos attacksoccurred throughout the period from 2002 to 2014 +ehackers named MTRiX and EL_MuHaMMeD in-tensively executed such attacks in 2003 and 2009 +eirmajor targets were com (or co) and org domain sitesand they targeted primarily Brazil Canada DenmarkFrance Greece Hong Kong and Italy (see Table 5) Twohackers (ie MTRiX and EL_MuHaMMeD) primarilyattacked commercial agencies and additionally attackedthe public and network agencies As shown in Figure 14 to

describe the follow-up investigation more discernibly andto focus on the attack flow we used an alluvial diagramwhich is a type of Sankey diagram developed to representchanges in a network structure over time [53] It shows theinvestigation of the top three hackers with website de-facement cases most similar to the DS case and SPE case+e case vectors were based on the attack year ccTLD andgTLD+e thickness of the attack flow in this figure meansthe degree of attack +is network visualization methodcould support an investigator to understand the flow andcore of the crime clearly by listing the multidimensionalevidence that is complicatedly entangled or hidden suchthat it does not look presentable

5 Limitations and Discussion

+e CBR algorithm has the disadvantage that the perfor-mance evaluation may be degraded if the property de-scribing the case is inappropriate +erefore in order toobtain more accurate results cross-data analysis with othervarious data sources should be considered For examplecybercrime statistics data from law enforcement agenciesthreat intelligence data from malware analysis groups andvulnerability databases could be useful resources to

100806040200Similarity score

600

400

200

0

Freq

uenc

y

Mean = 02930 Var = 00866

(a)

100806040200

Freq

uenc

y e highest similarity score 0615on Sony Pictures Entertainment case

Similarity score100806040200

Similarity score

40000

30000

20000

10000

0

Freq

uenc

y

40000

30000

20000

10000

0

Mean = 0114 Var = 01500

e highest similarity score 069on DarkSeoul case

Mean = 0063Var = 00370

A B

(b)

Figure 12 (a) Probability distribution of the similarity score for any pair of randomly selected cases (b) distribution of the similarity valuebetween the collected website defacement cases with the DS case (A) and the distribution of the similarity value between the collectedwebsite defacement cases with the SPE case (B) +e similarity was calculated between each studied case and all other cases in our system

0

2

4

6

000 025 050 075

Freq

uenc

y

Mean value of the similarity score000 025 050 075

Figure 13 Distribution of the similarity score for randomly selected 100 hackers

Security and Communication Networks 17

improve the accuracy and usability of our proposedmethodology However at the time of writing the presentpaper we did not have access to open and public dataconcerning cybercrime

For that reason we tried to demonstrate the practica-bility of the proposed methodology as a proof of concept+erefore we focused on the dataset of the zone-horg thatincludes a large number of website defacement cases Al-though the zone-horg provides an extensive dataset on thepast incident events not all incidents can be included in ourstudy +erefore if a hacker penetrated some target orga-nizations by APT attacks and performed stealthy activitiessuch hacking activities would not be reported in the datasetof the zone-horg and the proposed methodology would notbe able to detect similar cases with reasonable confidence

6 Conclusion and Future Work

In this study the similarity of website defacement caseswas assessed through the similarity measure and theclustering processing using the CBR as a methodology+e collected raw data of the defaced web sitesrsquo resourceswas sanitized via data parsing and data cleaning processAlso based on the large size of real dataset data-drivenanalysis for the hacker profiling is achieved To this endthe case vector was designed and the significant featureswere chosen for applying to the case-based reasoning Fora successful cybercrime investigation hacker profiling viaclustering analysis is the most basic and importantprocess in order to find out the relevant incident casesand significant data on some prime incidents data-driven

Table 5 Follow-up investigation on the top three hackers with website defacement cases most similar to the DS case and SPE case +e casevector value means the hackerrsquos attack rate

DomainDS case SPE case

Hmei7 d3b_X StifLer Oaddah MTRiX EL_MuHaMMeDCom 7832 8581 10000 10000 8627 8298Edu 162 096 mdash mdash 176 191Net 340 320 mdash mdash 546 574Gov 1216 651 mdash mdash 106 mdashYear Hmei7 d3b_X StifLer Oaddah MTRiX EL_MuHaMMeD2002 mdash mdash mdash mdash 1074 mdash2003 mdash mdash mdash mdash 8908 mdash2006 mdash mdash mdash mdash mdash mdash2007 009 mdash mdash mdash 018 mdash2008 mdash mdash mdash mdash mdash mdash2009 315 mdash mdash mdash mdash 99572010 009 mdash mdash mdash mdash mdash2011 034 mdash mdash mdash mdash mdash2012 340 mdash mdash 10000 mdash mdash2013 3486 3917 10000 mdash mdash mdash2014 5808 5977 mdash mdash mdash 0432015 mdash 107 mdash mdash mdash mdash

d3b~x

Hmei7

StifLer

2009

2012

2013

2014

AustraliaBrazilFrance

Germany

IndonesiaItaly

KoreaNetherlandsNewZealand

PolandRussia

Thailand

Turkey

Unknown

com

gov

net

org

Unknown

No

Yes

Hacker Year ccTLD gTLD Attack

(a)

EL_M

uHaM

MeD

MTRiX

oaddah

2002

2003

2009

2012

BrazilCanada

DenmarkFranceGreece

HongKongItaly

Unknown

com

net

org

Unknown

No

Yes

Hacker Year ccTLD gTLD Attack

(b)

Figure 14 Follow-up investigation on the top three hackers with website defacement cases that are most similar to the DS case (a) and SPEcase (b)

18 Security and Communication Networks

and evidence-driven decision making should be thecritical process Also reducing the amount of data andtime to be analysed are important factors to deliver thehigh value of intelligence data

Although the obtained results appear to be sound andmeaningful it is difficult to evaluate the accuracy of theresults unless the attacker is captured Naturally theground-truth data with specific information about theinvolved hacking groups for verification are rare (ie noadversary claimed that the two attacks were the result oftheir actions) However it is noteworthy that our meth-odology provides a meaningful insight into the confidentialand undercover network of cybercrime as well especiallywhen there is a lack of information Also the proposedmethodology contributes to facilitate the analysis and re-ducing the time required for searching for possible suspectsof cybercrime We believe that the proposed system ismeaningful for further exploration and correlation ofvarious website defacement cases

As mentioned in Discussion and Limitations a cross-data analysis with other various data sources should bereviewed Said differently the use of additional online oroffline information acquired by human intelligence(HUMINT) or different types of signal intelligence(SIGINT) and sources may also help to reason compo-sition requirements of crime and reduce the category ofinvestigation Furthermore the proposed methodologycan be expanded into incident information for compat-ibility and information exchangeability with othercyberthreat intelligence system as the Structured +reatInformation eXpression (STIX) and Trusted AutomatedeXchange of Indicator Information (TAXII) which arekey strategic elements of the information-sharingsystem [54]

+ere are features such as the particular messages (iethanks-to notifier nationality religion and anniversary)or image and mp3 file in the web resources which aregathered from the zone-horg site Although these featuresare limited to only a small number of hackers of the webresources in future research we will try to study a close-knit network among them such as the hub hacking groupkey player and followers Furthermore we also plan tomore definitely classify and systemize the hackersrsquo intentsusing text mining and mood detection techniques +efindings of this prospective study will contribute mean-ingful insights to trace hackersrsquo behavioural patterns and toestimate their primary purpose and intent

Data Availability

+e web-hacking dataset applied to our paper can bedownloaded from the linked site below httpocslabhksecuritynetDatasetsweb-hacking-profiling

Conflicts of Interest

+e authors declare that there are no conflicts of interestregarding the publication of this paper

Acknowledgments

+is work was supported under the framework of internationalcooperation program managed by the National ResearchFoundation of Korea (No 2017K1A3A1A17092614)

References

[1] S S Response ldquoSwift attackersrsquo malware linked to more fi-nancial attacksrdquo 2016 httpswwwsymanteccomconnectblogsswift-attackers-malware-linked-more-financial-attacks

[2] S S Response ldquoWannacry ransomware attacks show strong linksto lazarus grouprdquo 2017 httpswwwsymanteccomconnectblogswannacry-ransomware-attacks-show-strong-links-lazarus-group

[3] K lab ldquoLazarus under the hoodrdquo 2018 httpsmediakasperskycontenthubcomwp-contentuploadssites4320180307180244Lazarus_Under_+e_Hood_PDF_finalpdf

[4] Operation Blockbuster ldquoDestructive malware reportrdquo 2016httpswwwoperationblockbustercomwp-contentuploads201602Operation-Blockbuster-Destructive-Malware-Reportpdf

[5] D Martin and SANS Institute InfoSec Reading Room ldquoTracingthe lineage of DarkSeoulrdquo 2016 httpswwwsansorgreading-roomwhitepaperscriticaltracing-lineage-darkseoul-36787

[6] D S C T U T Intelligence ldquoWiper malware threatanalysisrdquo 2013 httpswwwsecureworkscomresearchwiper-malware-analysis-attacking-korean-financial-sector

[7] R Sherstobitoff M L Itai Liba and O O T C James WalterldquoDissecting operation troy cyberespionage in South Koreardquo2013 httpswwwmcafeecomenterpriseen-usassetswhite-paperswp-dissecting-operation-troypdf

[8] N Horton andA DeSimone ldquoSonyrsquos nightmare before christmasthe 2014 North Korean cyber attack on Sony and lessons for USgovernment actions in cyberspacerdquo 2018 httpswwwjhuapleduContentdocumentsSonyNightmareBeforeChristmaspdf

[9] I K Lee and S R Ramsey 9e Korean Language StateUniversity of New York Albany NY USA 2000

[10] V Benjamin and H Chen ldquoSecuring cyberspace identifyingkey actors in hacker communitiesrdquo in Proceedings of the 2012IEEE International Conference on Intelligence and SecurityInformatics pp 24ndash29 Arlington VA USA June 2012

[11] Y Lu X Luo M Polgar et al ldquoSocial network analysis of acriminal hacker communityrdquo Journal of Computer In-formation Systems vol 51 no 2 pp 31ndash41 2010

[12] J-W Jang H Kang J Woo A Mohaisen and H K KimldquoAndro-autopsy anti-malware system based on similaritymatching of malware and malware creator-centric in-formationrdquo Digital Investigation vol 14 pp 17ndash35 2015

[13] J W Jang and H K Kim ldquoFunction-orientedmobile malwareanalysis as first aidrdquo Mobile Information Systems vol 2016Article ID 6707524 11 pages 2016

[14] Y Ki E Kim and H K Kim ldquoA novel approach to detectmalware based on api call sequence analysisrdquo InternationalJournal of Distributed Sensor Networks vol 11 no 6 ArticleID 659101 2015

[15] M L Han H C Han A R Kang et al ldquoWeb-hacking datasetfor the cyber criminal profilingrdquo 2016 httpocslabhksecuritynetDatasetsweb-hacking-profiling

[16] M L Han H C Han A R Kang B I Kwak A Mohaisenand H K Kim ldquoWAHP web-hacking profiling using case-based reasoningrdquo in Proceedings of the 2016 IEEE Conference

Security and Communication Networks 19

on Communications and Network Security (CNS) pp 344-345Philadelphia PA USA October 2016

[17] A Aamodt and E Plaza ldquoCase-based reasoning foundationalissues methodological variations and system approachesrdquo AICommunications vol 7 no 1 pp 39ndash59 1994

[18] D M L Martins and F B D Lima Neto ldquoHybrid intelligentdecision support using a semiotic case-based reasoning andself-organizing mapsrdquo IEEE Transactions on Systems Manand Cybernetics Systems no 99 pp 1ndash8 2017

[19] H K Kim K H Im and S C Park ldquoDSS for computersecurity incident response applying CBR and collaborativeresponserdquo Expert Systems with Applications vol 37 no 1pp 852ndash870 2010

[20] J-B Lamy B Sekar G Guezennec J Bouaud andB Seroussi ldquoExplainable artificial intelligence for breastcancer a visual case-based reasoning approachrdquo ArtificialIntelligence in Medicine vol 94 pp 42ndash53 2019

[21] M Relich and P Pawlewski ldquoA case-based reasoning ap-proach to cost estimation of new product developmentrdquoNeurocomputing vol 272 pp 40ndash45 2018

[22] E R Reyes S Negny G C Robles et al ldquoImprovement ofonline adaptation knowledge acquisition and reuse in case-based reasoning application to process engineering designrdquoEngineering Applications of Artificial Intelligence vol 41pp 1ndash16 2015

[23] H K Kim S-K Kim and S-H Kim ldquoDecision supportsystem for zero-day attack responserdquo Applied Mathematicsand Information Sciences vol 6 no 1 pp 221Sndash241S 2012

[24] G Horsman C Laing and P Vickers ldquoA case-based rea-soning method for locating evidence during digital forensicdevice triagerdquo Decision Support Systems vol 61 pp 69ndash782014

[25] G Horsman C Laing and P Vickers ldquoA case based reasoningsystem for automated forensic examinationsrdquo in Proceedings ofthe PGNET 2011 the 12th Annual Postgraduate Symposium onthe Convergence of Telecommunications Networking andBroadcasting pp 26ndash31 Liverpool UK June 2011

[26] Z Yin Y Gao and B Chen ldquoOn development of supple-mentary criminal analysis system based on cbr and ontologyrdquoin Proceedings of the 2010 International Conference onComputer Application and System Modeling (ICCASM 2010)vol 14 Taiyuan China October 2010

[27] A J Pinizzotto and N J Finkel ldquoCriminal personality pro-filing an outcome and process studyrdquo Law and HumanBehavior vol 14 no 3 pp 215ndash233 1990

[28] P Chen and J Kurland ldquoTime place and modus operandi asimple apriori algorithm experiment for crime pattern de-tectionrdquo in Proceedings of the 2018 9th International Con-ference on Information Intelligence Systems and Applications(IISA) pp 1ndash3 Zakynthos Greece July 2018

[29] C J R Collie and K Shalev Greene ldquoExamining modusoperandi in stranger child abduction a comparison ofattempted and completed casesrdquo Journal of InvestigativePsychology and Offender Profiling vol 16 no 2 pp 91ndash1092019

[30] V Benjamin B Zhang J F Nunamaker Jr and H ChenldquoExamining hacker participation length in cybercriminalinternet-relay-chat communitiesrdquo Journal of ManagementInformation Systems vol 33 no 2 pp 482ndash510 2016

[31] V Benjamin and H Chen ldquoTime-to-event modeling forpredicting hacker IRC community participant trajectoryrdquo inProceedings of the 2014 IEEE Joint Intelligence and SecurityInformatics Conference pp 25ndash32 +e Hague +e Nether-lands September 2014

[32] K Veena and K Meena ldquoIdentification of cyber criminal byanalysing the users profilerdquo International Journal of NetworkSecurity vol 20 no 4 pp 738ndash745 2018

[33] F Iqbal B C M Fung M Debbabi R Batool andA Marrington ldquoWordnet-based criminal networks miningfor cybercrime investigationrdquo IEEE Access vol 7pp 22740ndash22755 2019

[34] N Qazi and B L W Wong ldquoAn interactive human centereddata science approach towards crime pattern analysisrdquo In-formation Processing ampManagement vol 56 no 6 p 1020662019

[35] N Jain P Sharma R Anchan et al ldquoComputerized forensicapproach using data mining techniquesrdquo in Proceedings of theACM Symposium on Women in Research 2016 pp 55ndash60ACM New York NY USA 2016

[36] P M Cozens G Saville and D Hillier ldquoCrime preventionthrough environmental design (cpted) a review and modernbibliographyrdquo Property Management vol 23 no 5pp 328ndash356 2005

[37] H Hassani X Huang E S Silva andM Ghodsi ldquoA review ofdata mining applications in crimerdquo Statistical Analysis andData Mining 9e ASA Data Science Journal vol 9 no 3pp 139ndash154 2016

[38] A Sharma and S Sharma ldquoAn intelligent analysis of webcrime data using data miningrdquo International Journal of En-gineering and Innovative Technology (IJEIT) vol 2 no 32012

[39] S-T Li S-C Kuo and F-C Tsai ldquoAn intelligent decision-support model using FSOM and rule extraction for crimepreventionrdquo Expert Systems with Applications vol 37 no 10pp 7108ndash7119 2010

[40] Y-H Tseng Z-P Ho K-S Yang and C-C Chen ldquoMiningterm networks from text collections for crime investigationrdquoExpert Systems with Applications vol 39 no 11 pp 10082ndash10090 2012

[41] A Malathi and S S Baboo ldquoAn enhanced algorithm topredict a future crime using data miningrdquo InternationalJournal of Computer Applications vol 21 no 1 2011

[42] S Kapetanakis A Filippoupolitis G Loukas et al ldquoProfilingcyber attackers using case-based reasoningrdquo in Proceedings ofthe 19th UK Workshop on Case-Based Reasoning (UKCBR2014) Cambridge UK December 2014

[43] R Al-Zaidy B C Fung A M Youssef et al ldquoMining criminalnetworks from unstructured text documentsrdquo Digital In-vestigation vol 8 no 3-4 pp 147ndash160 2012

[44] M Zulfadhilah Y Prayudi and I Riadi ldquoCyber profilingusing log analysis and k-means clusteringrdquo InternationalJournal of Advanced Computer Science and Applicationsvol 7 no 7 pp 430ndash435 2016

[45] S V Nath ldquoCrime pattern detection using data miningrdquo inProceedings of the 2006 IEEEWICACM International Con-ference on Web Intelligence and Intelligent Agent TechnologyWorkshops pp 41ndash44 Hong Kong China December 2006

[46] ITPnet ldquoSyria Egypt crises spur escalation of me cyber at-tacksrdquo 2013 httpwwwitpnet594742-syria-egypt-crises-spur-escalation-of-me-cyber-attack

[47] A McEnery and R Xiao ldquoCharacter encoding in corpusconstructionrdquo in Developing Linguistic Corpora A Guide toGood Practice Oxbow Books Ltd Oxford UK 2005

[48] B Bos T Ccedilelik I Hickson et al ldquoCascading style sheets level2 revision 1 (CSS 21) specificationrdquo W3C Working Draft2005 httpwwww3orgTRCSS21

20 Security and Communication Networks

[49] W Stuckey ldquoMassive sony breach sheds light on murkyhacker universerdquo 2018 httpamericaaljazeeracomarticles20141224sony-hacker-universehtml

[50] S Gallagher ldquoSony pictures malware tied to SeoulldquoShamoonrdquo cyber-attacksrdquo 2018 httpsarstechnicacominformation-technology201412sony-pictures-malware-tied-to-seoul-shamoon-cyber-attacks

[51] J Pagliery ldquoSony hack signs point to North Koreardquo 2018httpsmoneycnncom20141205technologysecuritysony-hack-north-korea-employeeindexhtml

[52] K Ketler ldquoCase-based reasoning an introductionrdquo ExpertSystems with Applications vol 6 no 1 pp 3ndash8 1993

[53] M Rosvall and C T Bergstrom ldquoMapping change in largenetworksrdquo PLoS One vol 5 no 1 Article ID e8694 2010

[54] OASIS ldquoSTIXTAXII standardsrdquo 2017-2018 httpsoasis-opengithubiocti-documentation

Security and Communication Networks 21

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 13: CBR-Based Decision Support Methodology for Cybercrime

Rk Count Casesmk( )Count Casesallk( )

(3)

where ldquomrdquo means the past cases which are within the denedscope concerning a randomly selected hacker ldquokrdquo

(i) Step 1 selection the measurement objects ie 100hackers were randomly selected from the database

(ii) Step 2 case labelling we retrieved all previous attackcases conducted by the randomly selected 100hackers in Step 1 and then subsequently labelled allprevious attack cases by each hacker name

(iii) Step 3 case extraction we selected the most recentcase among the cases labelled in Step 2 as an inputvalue shye similarity score was then estimated bycomparing themost recent case (ie RCmdashone of theretrieved cases) with all other cases in the database(ie TCsmdashall cases in the cases-centric DB)

(iv) Step 4 scoring similarity score was sorteddepending on the value and the weight for thesimilarity score by the case vector (see Table 2) inthe descending orderWhenever the similarity valuewas 0 it was not displayed on the scoring list of Step4 shye feasibility of the proposed methodology wasevaluated based on how many past cases of a hackerthere were in the N scope at the scoring list of Step 4that is regarding the ratio of the attack cases by eachhacker we checked whether the cases were includedat the top N scope (N scope from the top 1 percentto the top 30 percent)

NScope Count CasesScopeK( )Count CasesallK( )

times 100 (2)

First we randomly picked 100 hackers from the col-lected dataset (ie cases-centric DB) thereafter we re-trieved and extracted all past attack cases for each hackershye extracted past cases were labelled with the hackerrsquosname Figure 8 depicts the number of website defacementattack cases in the past for each hacker In Steps 3 and 4similarity between a retrieved case (ie the most recentcase) and all other stored website defacement cases weremeasured

Specically we checked whether the result (ie thesorted hackerrsquos past cases with a high similarity score)stemming from the similarity measurement was included atthe top N scope shyis process was meant to check based onthe similarity score how many past attack cases of randomlypicked 100 hackers were included in the dened topN scopeTo this end we divided the top N scope into eight criterionfactors from the top 1 percent to the top 30 percent and theratio R all the past attack cases for each hacker into sixcriterion factors from 50 percent to 100 percent (ie at 10percent intervals) As illustrated in equations (2) and (3) theN scope and the ratio R were categorized as ratios accordingto the dened measure rule More specically the criterionof the top N scope ie ldquotop N percentrdquo was based on theresult derived from the similarity measurement Attack caseswere sorted in order of high similarity score and thereforethe cases were within the range of topN scope (see Figure 9)Also in the case of the hacking case ratio of a randomly

Step 4 scoring

bullbullbull

Randomly selected100 hackers

from the database

Step 1 selection

bullbullbull

1 eBuGz2 Hmei73 3xp1r3

98 S3cure99 drm1st3r

100 Lulz53c

bullbullbull

Step 2 case labelling1 TheBuGz

100 Lulz53c

Step 3 case extraction

A retrieved case(the most recent case)

bullbullbull

1 TheBuGz

100 Lulz53c

bullbullbull

Cases-centricDB

Hackername Date Encoding IP address Domain OS Score

Hackername Date Encoding IP address Domain OS Score

bullbullbull

1 eBuGz2 Hmei73 3xp1r3

98 S3cure99 drm1st3r

100 Lulz53c

Case1 bullbullbull Casem

Case1 bullbullbull Casemprime

Case1 bullbullbull Casem

Case1 bullbullbull Casemprimei=1

cv[Distance (RCcv TCcv) times Weightcv]

Casemprime

Casemprime

Casem

Casem

Figure 7 shye developed testing procedures from step 1 to step 4

Security and Communication Networks 13

selected hacker some parts of the past attack cases (ie ratioR) concerning a hacker were within the defined N scope (seeFigure 9)

Figure 10 shows the number of an identified hacker fora retrieved case (ie the most recent case) among allhacking cases of each hacker +e X-axis in Figure 10 showsthe criterion of the topN scope including the eight criterionfactors () and of the ratio R including the six criterionfactors () +e Y-axis presents the number of an identifiedhacker in the top N scope among the randomly selected 100hackers in Step 1 As can be seen in Figure 10 the higher theratio R and the narrower the N scope the lower the numberof an identified hacker in the top N scope among therandomly selected 100 hackers On the other hand thelower the ratio R and the wider the N scope the higher thenumber of identified hackers in the top N scope among therandomly selected 100 hackers Consequently even ifhacking cases were caused by the same hacker as the hackeror hacking group which only attacked the same or similarobjects were rare it is impossible to draw results with a highsimilarity score for all cases of a hacker Nevertheless theresults demonstrated that the proposed CBR-based de-cision support methodology can successfully reduce thenumber of hackers and their cases and suggest potential topN percent candidates among hundreds of thousands ofcases

+erefore an investigator should consider the avail-ability and flexibility of data with respect to the data selectioncriteria for the similarity measurement As mentionedabove when a new attack occurs they can limit the searchrange of the data and determine the direction of the criminalinvestigation With such the reduction in the number ofcandidate-related cases the outcomes of our similaritymechanism are highly valuable in terms of reducing theinvestigation time to determine the potential suspect of agiven hacking incident

42 Case Study As mentioned above the accuracy of theCBR depends on the quality of the collected data and theoverall accuracy is difficult to evaluate Nevertheless al-though the data are insufficient to evaluate the proposedmethodology the DS and SPE cases include the ground-truth data with specific information related to the hacker orhacking groups Based on the public ground-truth data ofthe DS and SPE cases we found the most similar top threehackers or hacking groups to them and noticed theircharacteristic by the proposed similarity measure and theclustering processing

+e hackers of the DS cyberattack defaced the groupwarehomepage of LG U+ the 3rd largest telecommunicationcompany in South Korea and the English version of the

Step 4 scoring

1 TheBuGzTop N scope(1~30)

Ratio R (50~100)

Hackername Date Encoding IP address Domain OS Score

Case1 bullbullbull Casem

Casem

Figure 9 Scoring step on the top N scope and the ratio R

0

1000

2000

3000

4000

5000

0 25 50 75 100Hacker

Num

ber o

f cas

es

Figure 8 +e number of website defacement attack cases in the past of each hacker

14 Security and Communication Networks

Korean Broadcasting System (KBS) homepage +ey leftunique images and many messages on the defaced websites+e three Calaveras image (ie skull image) used in the LGU+rsquos defaced website appeared on many European websites+e character encoding set of the message was the WesternEuropean language system Based on these insights we couldinfer that the hackersrsquo background is European ldquoHASTATIrdquowas the word written on the KBS homepage meaning theforefront line of the Roman troops hinting that the DScyberattack could be a starting point rather than a transientattack it was a persistent one Even if we excluded otherimages and messages as well as other features from thesimilarity processes due to the unanticipated loss or absenceof data one could establish the similarity and intent of theattackers with reasonable confidence However given thesufficiently large hacker profiling source such abundant datacould support and enhance the accuracy of inference Fig-ure 11 shows the screenshots of the defaced websites at thattime

In the SPE case similarly to the DS case some imagesand messages were left on the computers of SPE Regardingcolour skulls image and misspellings the imagesFigure 11(c) used in the SPE cases took on the characteristicssimilar to those of the images Figure 11(b) used in the DScases As shown in Figure 11 the colour schemes in greenand red and the visual similarities seen in skull image areother crucial elements for crime tracing In both the DS andSPE cases the phrase such as ldquothis is the beginningrdquo andldquoyour datardquo were commonly found in the messages How-ever given the intentional hacking nature of forging orhiding their identity motivation and location some experts

say that these characteristics are not the conclusive proofthat Sony has been attacked by the same hacker [49ndash51]

For the evaluation of the results of the case study we firstmeasured the similarity between the new website de-facement cases (ie the DS and SPE cases) and the collectedexisting cases in the database +is approach coheres withthe CBR process used in cybercrime investigation (seeFigure 2) Two new website defacement cases the DS and theSPE were applied as RC and the similarity score for each ofthese two cases was computed using the similarity measure(see equation (1)) proposed in Section 331 Provided thatbecause the DS and SPE cases do the function of the targetcases as an input value we considered a direct comparisonbetween the DS and SPE cases for the similarity score wasnot appropriate [52]

+e similarity measure mentioned in the previousparagraph is based on the metadata released by an analysisreport of the DS and SPE real cases We summarized furtherthe characteristics and metadata associated with them inTable 4 +e similarity score was derived through com-parison between the presented metadata of the DS and SPEcases and all cases in the cases-centric DB We gave the mostsimilar top three cases among the result of the similarityscore (see the right side in Table) Notifier Hmei7 and d3b_Xare among the cases that belonged to Clusters 0 and 8 whichwere the two clusters that exhibited identical characteristicsIt can thus be understood that they used the encoding systempertinent to Central European languages based on the Latinlanguage system and typically launched attacks against aprofit organization located in Western Europe Notifieroaddah MTRiX and EL_MuHaMMeD were all classified

Top 1 Top 3 Top 5 Top 10 Top 15 Top 20 Top 25 Top 30Criterion of the top N ()

Num

ber o

f ide

ntifi

ed h

acke

rs

0

20

40

60

80

100

Ratio of the attack cases ()506070

8090100

Figure 10 +e number of identified hackers in the top N scope among the randomly selected 100 hackers

Security and Communication Networks 15

as the same cluster (Cluster 7) where the hackers of Cluster 7used the encoding system pertinent to Arabic and Chineselanguages and typically attacked against the profit organi-zation located in Western Europe

Next to ensure the objectivity of the similarity scorebased on the case study by the DS and SPE we computed thesimilarity score of any randomly selected pair from thewhole case Figure 12(a) shows the distribution of thesimilarity score of the randomly selected cases We took thedistribution of the similarity score using the central limittheorem which describes the average distribution of ran-dom samples extracted from a finite population +e dis-tribution shows that the calculation of the similarity score ofthe randomly selected two website defacement cases wasrepeatedly performed for 10000 times +e similarity scoresof any randomly selected pair of cases were typically dis-tributed around 03 +is result (Figure 12(a)) substantiatesthat the similarity scores are not low even if the similarityscores of the DS and SPE cases (Figure 12(b)) do not appear

numerically high Figure 12(b) shows the similarity scores ofthe DS and SPE cases+e top score of the similarity was 069in the DS case and all measured cases concentrated aroundthe similarity score (X-axis) of 00 to 015 and of 05 to 06 Inthe SPE case the top score of the similarity was 0615 and allmeasured cases concentrated around the similarity score (X-axis) of 00 to 02

Figure 13 shows the distribution of the similarity scorefor randomly selected 100 hackers mentioned in Section 41To know the mean value of the similarity score for eachhacker case we calculated the similarity score from thehackerrsquos own past cases Cases used for the similarity scoremeans not all cases in the cases-centric DB but just the pastcases conducted by the hacker in the cases-centric DB +emean value of the similarity scores in the hackers is 05233+e similarity scores of the tested cases in Table 4 is abovethe mean value +us the similarity scores for each hackeradequately underpin the similarity scores from the TCs inDS and SPE

(a) (b) (c)

Figure 11 A snippet of website defacement cases by a comparison of examples of the DS and SPE the defaced LGU+ groupware homepage(a) and KBS homepage (b) in the DS case and the defaced website in SPE case (c)

Table 4 Further characteristics and metadata associated with the DS and SPE cases

Retrieved case Tested cases

Case name NotifierDarkSeoul (DS) Hmei7 d3b_X StifLer

Encoding Windows-1252 Windows-1252 Windows-1252 ISO-8859-9IP address 203248195178 2038623868 2031243766 77921083Domain gyunggionnet21com httpwwwgarychengcom healthajkgovpk yapikimyasallaricomtrDate 20 Mar 2013 6 Feb 2014 4 Feb 2014 8 Jun 2013OS Windows Windows Windows WindowsSimilarity mdash 0690 0675 0665Cluster mdash 0 8 4

Retrieved case Tested casesCase name Notifier

Sony pictures Entertainment (SPE) Oaddah MTRiX EL_MuHaMMeDEncoding EUC-KR EUC-CN GB2312 GB2312 GB2312IP address 203131222102 2031241555 20829198 2081164534Domain httpwwwsonypicturesstockfootagecom httpwwwhzkcggcom daxdigitalromcom digitalairstripnetDate 24 Nov 2014 14 Jun 2012 16 Dec 2002 18 June 2009OS Windows Windows Windows WindowsSimilarity mdash 0615 0615 0600Cluster mdash 7 7 7+e metadata are arranged according to the defined case vector corresponding with the DS and SPE cases on the left side (shown in part in boldface type)

16 Security and Communication Networks

43 Follow-Up Investigation A case study is a researchmethod involving an in-depth and detailed investigationof a subject of study as well as its related contextualmethodology Hence we conducted follow-up in-vestigations of the most similar top three hackers asmentioned above in Table 4 According to the resultsspecifically over 93 percent of the hackerrsquos attacks weresimilar to the DS case that occurred in 2013 and 2014+eir major targets were com domain sites and theytargeted primarily Germany Italy New Zealand RussiaTurkey Taiwan and South Korea (see Table 5) Twohackers (ie Hmei7 and d3b_X) primarily attackedgovernment agencies Interestingly 20 percent of theattacks by the hackers named d3b_X targeted SouthKorea In the SPE incident the similar hackerrsquos attacksoccurred throughout the period from 2002 to 2014 +ehackers named MTRiX and EL_MuHaMMeD in-tensively executed such attacks in 2003 and 2009 +eirmajor targets were com (or co) and org domain sitesand they targeted primarily Brazil Canada DenmarkFrance Greece Hong Kong and Italy (see Table 5) Twohackers (ie MTRiX and EL_MuHaMMeD) primarilyattacked commercial agencies and additionally attackedthe public and network agencies As shown in Figure 14 to

describe the follow-up investigation more discernibly andto focus on the attack flow we used an alluvial diagramwhich is a type of Sankey diagram developed to representchanges in a network structure over time [53] It shows theinvestigation of the top three hackers with website de-facement cases most similar to the DS case and SPE case+e case vectors were based on the attack year ccTLD andgTLD+e thickness of the attack flow in this figure meansthe degree of attack +is network visualization methodcould support an investigator to understand the flow andcore of the crime clearly by listing the multidimensionalevidence that is complicatedly entangled or hidden suchthat it does not look presentable

5 Limitations and Discussion

+e CBR algorithm has the disadvantage that the perfor-mance evaluation may be degraded if the property de-scribing the case is inappropriate +erefore in order toobtain more accurate results cross-data analysis with othervarious data sources should be considered For examplecybercrime statistics data from law enforcement agenciesthreat intelligence data from malware analysis groups andvulnerability databases could be useful resources to

100806040200Similarity score

600

400

200

0

Freq

uenc

y

Mean = 02930 Var = 00866

(a)

100806040200

Freq

uenc

y e highest similarity score 0615on Sony Pictures Entertainment case

Similarity score100806040200

Similarity score

40000

30000

20000

10000

0

Freq

uenc

y

40000

30000

20000

10000

0

Mean = 0114 Var = 01500

e highest similarity score 069on DarkSeoul case

Mean = 0063Var = 00370

A B

(b)

Figure 12 (a) Probability distribution of the similarity score for any pair of randomly selected cases (b) distribution of the similarity valuebetween the collected website defacement cases with the DS case (A) and the distribution of the similarity value between the collectedwebsite defacement cases with the SPE case (B) +e similarity was calculated between each studied case and all other cases in our system

0

2

4

6

000 025 050 075

Freq

uenc

y

Mean value of the similarity score000 025 050 075

Figure 13 Distribution of the similarity score for randomly selected 100 hackers

Security and Communication Networks 17

improve the accuracy and usability of our proposedmethodology However at the time of writing the presentpaper we did not have access to open and public dataconcerning cybercrime

For that reason we tried to demonstrate the practica-bility of the proposed methodology as a proof of concept+erefore we focused on the dataset of the zone-horg thatincludes a large number of website defacement cases Al-though the zone-horg provides an extensive dataset on thepast incident events not all incidents can be included in ourstudy +erefore if a hacker penetrated some target orga-nizations by APT attacks and performed stealthy activitiessuch hacking activities would not be reported in the datasetof the zone-horg and the proposed methodology would notbe able to detect similar cases with reasonable confidence

6 Conclusion and Future Work

In this study the similarity of website defacement caseswas assessed through the similarity measure and theclustering processing using the CBR as a methodology+e collected raw data of the defaced web sitesrsquo resourceswas sanitized via data parsing and data cleaning processAlso based on the large size of real dataset data-drivenanalysis for the hacker profiling is achieved To this endthe case vector was designed and the significant featureswere chosen for applying to the case-based reasoning Fora successful cybercrime investigation hacker profiling viaclustering analysis is the most basic and importantprocess in order to find out the relevant incident casesand significant data on some prime incidents data-driven

Table 5 Follow-up investigation on the top three hackers with website defacement cases most similar to the DS case and SPE case +e casevector value means the hackerrsquos attack rate

DomainDS case SPE case

Hmei7 d3b_X StifLer Oaddah MTRiX EL_MuHaMMeDCom 7832 8581 10000 10000 8627 8298Edu 162 096 mdash mdash 176 191Net 340 320 mdash mdash 546 574Gov 1216 651 mdash mdash 106 mdashYear Hmei7 d3b_X StifLer Oaddah MTRiX EL_MuHaMMeD2002 mdash mdash mdash mdash 1074 mdash2003 mdash mdash mdash mdash 8908 mdash2006 mdash mdash mdash mdash mdash mdash2007 009 mdash mdash mdash 018 mdash2008 mdash mdash mdash mdash mdash mdash2009 315 mdash mdash mdash mdash 99572010 009 mdash mdash mdash mdash mdash2011 034 mdash mdash mdash mdash mdash2012 340 mdash mdash 10000 mdash mdash2013 3486 3917 10000 mdash mdash mdash2014 5808 5977 mdash mdash mdash 0432015 mdash 107 mdash mdash mdash mdash

d3b~x

Hmei7

StifLer

2009

2012

2013

2014

AustraliaBrazilFrance

Germany

IndonesiaItaly

KoreaNetherlandsNewZealand

PolandRussia

Thailand

Turkey

Unknown

com

gov

net

org

Unknown

No

Yes

Hacker Year ccTLD gTLD Attack

(a)

EL_M

uHaM

MeD

MTRiX

oaddah

2002

2003

2009

2012

BrazilCanada

DenmarkFranceGreece

HongKongItaly

Unknown

com

net

org

Unknown

No

Yes

Hacker Year ccTLD gTLD Attack

(b)

Figure 14 Follow-up investigation on the top three hackers with website defacement cases that are most similar to the DS case (a) and SPEcase (b)

18 Security and Communication Networks

and evidence-driven decision making should be thecritical process Also reducing the amount of data andtime to be analysed are important factors to deliver thehigh value of intelligence data

Although the obtained results appear to be sound andmeaningful it is difficult to evaluate the accuracy of theresults unless the attacker is captured Naturally theground-truth data with specific information about theinvolved hacking groups for verification are rare (ie noadversary claimed that the two attacks were the result oftheir actions) However it is noteworthy that our meth-odology provides a meaningful insight into the confidentialand undercover network of cybercrime as well especiallywhen there is a lack of information Also the proposedmethodology contributes to facilitate the analysis and re-ducing the time required for searching for possible suspectsof cybercrime We believe that the proposed system ismeaningful for further exploration and correlation ofvarious website defacement cases

As mentioned in Discussion and Limitations a cross-data analysis with other various data sources should bereviewed Said differently the use of additional online oroffline information acquired by human intelligence(HUMINT) or different types of signal intelligence(SIGINT) and sources may also help to reason compo-sition requirements of crime and reduce the category ofinvestigation Furthermore the proposed methodologycan be expanded into incident information for compat-ibility and information exchangeability with othercyberthreat intelligence system as the Structured +reatInformation eXpression (STIX) and Trusted AutomatedeXchange of Indicator Information (TAXII) which arekey strategic elements of the information-sharingsystem [54]

+ere are features such as the particular messages (iethanks-to notifier nationality religion and anniversary)or image and mp3 file in the web resources which aregathered from the zone-horg site Although these featuresare limited to only a small number of hackers of the webresources in future research we will try to study a close-knit network among them such as the hub hacking groupkey player and followers Furthermore we also plan tomore definitely classify and systemize the hackersrsquo intentsusing text mining and mood detection techniques +efindings of this prospective study will contribute mean-ingful insights to trace hackersrsquo behavioural patterns and toestimate their primary purpose and intent

Data Availability

+e web-hacking dataset applied to our paper can bedownloaded from the linked site below httpocslabhksecuritynetDatasetsweb-hacking-profiling

Conflicts of Interest

+e authors declare that there are no conflicts of interestregarding the publication of this paper

Acknowledgments

+is work was supported under the framework of internationalcooperation program managed by the National ResearchFoundation of Korea (No 2017K1A3A1A17092614)

References

[1] S S Response ldquoSwift attackersrsquo malware linked to more fi-nancial attacksrdquo 2016 httpswwwsymanteccomconnectblogsswift-attackers-malware-linked-more-financial-attacks

[2] S S Response ldquoWannacry ransomware attacks show strong linksto lazarus grouprdquo 2017 httpswwwsymanteccomconnectblogswannacry-ransomware-attacks-show-strong-links-lazarus-group

[3] K lab ldquoLazarus under the hoodrdquo 2018 httpsmediakasperskycontenthubcomwp-contentuploadssites4320180307180244Lazarus_Under_+e_Hood_PDF_finalpdf

[4] Operation Blockbuster ldquoDestructive malware reportrdquo 2016httpswwwoperationblockbustercomwp-contentuploads201602Operation-Blockbuster-Destructive-Malware-Reportpdf

[5] D Martin and SANS Institute InfoSec Reading Room ldquoTracingthe lineage of DarkSeoulrdquo 2016 httpswwwsansorgreading-roomwhitepaperscriticaltracing-lineage-darkseoul-36787

[6] D S C T U T Intelligence ldquoWiper malware threatanalysisrdquo 2013 httpswwwsecureworkscomresearchwiper-malware-analysis-attacking-korean-financial-sector

[7] R Sherstobitoff M L Itai Liba and O O T C James WalterldquoDissecting operation troy cyberespionage in South Koreardquo2013 httpswwwmcafeecomenterpriseen-usassetswhite-paperswp-dissecting-operation-troypdf

[8] N Horton andA DeSimone ldquoSonyrsquos nightmare before christmasthe 2014 North Korean cyber attack on Sony and lessons for USgovernment actions in cyberspacerdquo 2018 httpswwwjhuapleduContentdocumentsSonyNightmareBeforeChristmaspdf

[9] I K Lee and S R Ramsey 9e Korean Language StateUniversity of New York Albany NY USA 2000

[10] V Benjamin and H Chen ldquoSecuring cyberspace identifyingkey actors in hacker communitiesrdquo in Proceedings of the 2012IEEE International Conference on Intelligence and SecurityInformatics pp 24ndash29 Arlington VA USA June 2012

[11] Y Lu X Luo M Polgar et al ldquoSocial network analysis of acriminal hacker communityrdquo Journal of Computer In-formation Systems vol 51 no 2 pp 31ndash41 2010

[12] J-W Jang H Kang J Woo A Mohaisen and H K KimldquoAndro-autopsy anti-malware system based on similaritymatching of malware and malware creator-centric in-formationrdquo Digital Investigation vol 14 pp 17ndash35 2015

[13] J W Jang and H K Kim ldquoFunction-orientedmobile malwareanalysis as first aidrdquo Mobile Information Systems vol 2016Article ID 6707524 11 pages 2016

[14] Y Ki E Kim and H K Kim ldquoA novel approach to detectmalware based on api call sequence analysisrdquo InternationalJournal of Distributed Sensor Networks vol 11 no 6 ArticleID 659101 2015

[15] M L Han H C Han A R Kang et al ldquoWeb-hacking datasetfor the cyber criminal profilingrdquo 2016 httpocslabhksecuritynetDatasetsweb-hacking-profiling

[16] M L Han H C Han A R Kang B I Kwak A Mohaisenand H K Kim ldquoWAHP web-hacking profiling using case-based reasoningrdquo in Proceedings of the 2016 IEEE Conference

Security and Communication Networks 19

on Communications and Network Security (CNS) pp 344-345Philadelphia PA USA October 2016

[17] A Aamodt and E Plaza ldquoCase-based reasoning foundationalissues methodological variations and system approachesrdquo AICommunications vol 7 no 1 pp 39ndash59 1994

[18] D M L Martins and F B D Lima Neto ldquoHybrid intelligentdecision support using a semiotic case-based reasoning andself-organizing mapsrdquo IEEE Transactions on Systems Manand Cybernetics Systems no 99 pp 1ndash8 2017

[19] H K Kim K H Im and S C Park ldquoDSS for computersecurity incident response applying CBR and collaborativeresponserdquo Expert Systems with Applications vol 37 no 1pp 852ndash870 2010

[20] J-B Lamy B Sekar G Guezennec J Bouaud andB Seroussi ldquoExplainable artificial intelligence for breastcancer a visual case-based reasoning approachrdquo ArtificialIntelligence in Medicine vol 94 pp 42ndash53 2019

[21] M Relich and P Pawlewski ldquoA case-based reasoning ap-proach to cost estimation of new product developmentrdquoNeurocomputing vol 272 pp 40ndash45 2018

[22] E R Reyes S Negny G C Robles et al ldquoImprovement ofonline adaptation knowledge acquisition and reuse in case-based reasoning application to process engineering designrdquoEngineering Applications of Artificial Intelligence vol 41pp 1ndash16 2015

[23] H K Kim S-K Kim and S-H Kim ldquoDecision supportsystem for zero-day attack responserdquo Applied Mathematicsand Information Sciences vol 6 no 1 pp 221Sndash241S 2012

[24] G Horsman C Laing and P Vickers ldquoA case-based rea-soning method for locating evidence during digital forensicdevice triagerdquo Decision Support Systems vol 61 pp 69ndash782014

[25] G Horsman C Laing and P Vickers ldquoA case based reasoningsystem for automated forensic examinationsrdquo in Proceedings ofthe PGNET 2011 the 12th Annual Postgraduate Symposium onthe Convergence of Telecommunications Networking andBroadcasting pp 26ndash31 Liverpool UK June 2011

[26] Z Yin Y Gao and B Chen ldquoOn development of supple-mentary criminal analysis system based on cbr and ontologyrdquoin Proceedings of the 2010 International Conference onComputer Application and System Modeling (ICCASM 2010)vol 14 Taiyuan China October 2010

[27] A J Pinizzotto and N J Finkel ldquoCriminal personality pro-filing an outcome and process studyrdquo Law and HumanBehavior vol 14 no 3 pp 215ndash233 1990

[28] P Chen and J Kurland ldquoTime place and modus operandi asimple apriori algorithm experiment for crime pattern de-tectionrdquo in Proceedings of the 2018 9th International Con-ference on Information Intelligence Systems and Applications(IISA) pp 1ndash3 Zakynthos Greece July 2018

[29] C J R Collie and K Shalev Greene ldquoExamining modusoperandi in stranger child abduction a comparison ofattempted and completed casesrdquo Journal of InvestigativePsychology and Offender Profiling vol 16 no 2 pp 91ndash1092019

[30] V Benjamin B Zhang J F Nunamaker Jr and H ChenldquoExamining hacker participation length in cybercriminalinternet-relay-chat communitiesrdquo Journal of ManagementInformation Systems vol 33 no 2 pp 482ndash510 2016

[31] V Benjamin and H Chen ldquoTime-to-event modeling forpredicting hacker IRC community participant trajectoryrdquo inProceedings of the 2014 IEEE Joint Intelligence and SecurityInformatics Conference pp 25ndash32 +e Hague +e Nether-lands September 2014

[32] K Veena and K Meena ldquoIdentification of cyber criminal byanalysing the users profilerdquo International Journal of NetworkSecurity vol 20 no 4 pp 738ndash745 2018

[33] F Iqbal B C M Fung M Debbabi R Batool andA Marrington ldquoWordnet-based criminal networks miningfor cybercrime investigationrdquo IEEE Access vol 7pp 22740ndash22755 2019

[34] N Qazi and B L W Wong ldquoAn interactive human centereddata science approach towards crime pattern analysisrdquo In-formation Processing ampManagement vol 56 no 6 p 1020662019

[35] N Jain P Sharma R Anchan et al ldquoComputerized forensicapproach using data mining techniquesrdquo in Proceedings of theACM Symposium on Women in Research 2016 pp 55ndash60ACM New York NY USA 2016

[36] P M Cozens G Saville and D Hillier ldquoCrime preventionthrough environmental design (cpted) a review and modernbibliographyrdquo Property Management vol 23 no 5pp 328ndash356 2005

[37] H Hassani X Huang E S Silva andM Ghodsi ldquoA review ofdata mining applications in crimerdquo Statistical Analysis andData Mining 9e ASA Data Science Journal vol 9 no 3pp 139ndash154 2016

[38] A Sharma and S Sharma ldquoAn intelligent analysis of webcrime data using data miningrdquo International Journal of En-gineering and Innovative Technology (IJEIT) vol 2 no 32012

[39] S-T Li S-C Kuo and F-C Tsai ldquoAn intelligent decision-support model using FSOM and rule extraction for crimepreventionrdquo Expert Systems with Applications vol 37 no 10pp 7108ndash7119 2010

[40] Y-H Tseng Z-P Ho K-S Yang and C-C Chen ldquoMiningterm networks from text collections for crime investigationrdquoExpert Systems with Applications vol 39 no 11 pp 10082ndash10090 2012

[41] A Malathi and S S Baboo ldquoAn enhanced algorithm topredict a future crime using data miningrdquo InternationalJournal of Computer Applications vol 21 no 1 2011

[42] S Kapetanakis A Filippoupolitis G Loukas et al ldquoProfilingcyber attackers using case-based reasoningrdquo in Proceedings ofthe 19th UK Workshop on Case-Based Reasoning (UKCBR2014) Cambridge UK December 2014

[43] R Al-Zaidy B C Fung A M Youssef et al ldquoMining criminalnetworks from unstructured text documentsrdquo Digital In-vestigation vol 8 no 3-4 pp 147ndash160 2012

[44] M Zulfadhilah Y Prayudi and I Riadi ldquoCyber profilingusing log analysis and k-means clusteringrdquo InternationalJournal of Advanced Computer Science and Applicationsvol 7 no 7 pp 430ndash435 2016

[45] S V Nath ldquoCrime pattern detection using data miningrdquo inProceedings of the 2006 IEEEWICACM International Con-ference on Web Intelligence and Intelligent Agent TechnologyWorkshops pp 41ndash44 Hong Kong China December 2006

[46] ITPnet ldquoSyria Egypt crises spur escalation of me cyber at-tacksrdquo 2013 httpwwwitpnet594742-syria-egypt-crises-spur-escalation-of-me-cyber-attack

[47] A McEnery and R Xiao ldquoCharacter encoding in corpusconstructionrdquo in Developing Linguistic Corpora A Guide toGood Practice Oxbow Books Ltd Oxford UK 2005

[48] B Bos T Ccedilelik I Hickson et al ldquoCascading style sheets level2 revision 1 (CSS 21) specificationrdquo W3C Working Draft2005 httpwwww3orgTRCSS21

20 Security and Communication Networks

[49] W Stuckey ldquoMassive sony breach sheds light on murkyhacker universerdquo 2018 httpamericaaljazeeracomarticles20141224sony-hacker-universehtml

[50] S Gallagher ldquoSony pictures malware tied to SeoulldquoShamoonrdquo cyber-attacksrdquo 2018 httpsarstechnicacominformation-technology201412sony-pictures-malware-tied-to-seoul-shamoon-cyber-attacks

[51] J Pagliery ldquoSony hack signs point to North Koreardquo 2018httpsmoneycnncom20141205technologysecuritysony-hack-north-korea-employeeindexhtml

[52] K Ketler ldquoCase-based reasoning an introductionrdquo ExpertSystems with Applications vol 6 no 1 pp 3ndash8 1993

[53] M Rosvall and C T Bergstrom ldquoMapping change in largenetworksrdquo PLoS One vol 5 no 1 Article ID e8694 2010

[54] OASIS ldquoSTIXTAXII standardsrdquo 2017-2018 httpsoasis-opengithubiocti-documentation

Security and Communication Networks 21

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 14: CBR-Based Decision Support Methodology for Cybercrime

selected hacker some parts of the past attack cases (ie ratioR) concerning a hacker were within the defined N scope (seeFigure 9)

Figure 10 shows the number of an identified hacker fora retrieved case (ie the most recent case) among allhacking cases of each hacker +e X-axis in Figure 10 showsthe criterion of the topN scope including the eight criterionfactors () and of the ratio R including the six criterionfactors () +e Y-axis presents the number of an identifiedhacker in the top N scope among the randomly selected 100hackers in Step 1 As can be seen in Figure 10 the higher theratio R and the narrower the N scope the lower the numberof an identified hacker in the top N scope among therandomly selected 100 hackers On the other hand thelower the ratio R and the wider the N scope the higher thenumber of identified hackers in the top N scope among therandomly selected 100 hackers Consequently even ifhacking cases were caused by the same hacker as the hackeror hacking group which only attacked the same or similarobjects were rare it is impossible to draw results with a highsimilarity score for all cases of a hacker Nevertheless theresults demonstrated that the proposed CBR-based de-cision support methodology can successfully reduce thenumber of hackers and their cases and suggest potential topN percent candidates among hundreds of thousands ofcases

+erefore an investigator should consider the avail-ability and flexibility of data with respect to the data selectioncriteria for the similarity measurement As mentionedabove when a new attack occurs they can limit the searchrange of the data and determine the direction of the criminalinvestigation With such the reduction in the number ofcandidate-related cases the outcomes of our similaritymechanism are highly valuable in terms of reducing theinvestigation time to determine the potential suspect of agiven hacking incident

42 Case Study As mentioned above the accuracy of theCBR depends on the quality of the collected data and theoverall accuracy is difficult to evaluate Nevertheless al-though the data are insufficient to evaluate the proposedmethodology the DS and SPE cases include the ground-truth data with specific information related to the hacker orhacking groups Based on the public ground-truth data ofthe DS and SPE cases we found the most similar top threehackers or hacking groups to them and noticed theircharacteristic by the proposed similarity measure and theclustering processing

+e hackers of the DS cyberattack defaced the groupwarehomepage of LG U+ the 3rd largest telecommunicationcompany in South Korea and the English version of the

Step 4 scoring

1 TheBuGzTop N scope(1~30)

Ratio R (50~100)

Hackername Date Encoding IP address Domain OS Score

Case1 bullbullbull Casem

Casem

Figure 9 Scoring step on the top N scope and the ratio R

0

1000

2000

3000

4000

5000

0 25 50 75 100Hacker

Num

ber o

f cas

es

Figure 8 +e number of website defacement attack cases in the past of each hacker

14 Security and Communication Networks

Korean Broadcasting System (KBS) homepage +ey leftunique images and many messages on the defaced websites+e three Calaveras image (ie skull image) used in the LGU+rsquos defaced website appeared on many European websites+e character encoding set of the message was the WesternEuropean language system Based on these insights we couldinfer that the hackersrsquo background is European ldquoHASTATIrdquowas the word written on the KBS homepage meaning theforefront line of the Roman troops hinting that the DScyberattack could be a starting point rather than a transientattack it was a persistent one Even if we excluded otherimages and messages as well as other features from thesimilarity processes due to the unanticipated loss or absenceof data one could establish the similarity and intent of theattackers with reasonable confidence However given thesufficiently large hacker profiling source such abundant datacould support and enhance the accuracy of inference Fig-ure 11 shows the screenshots of the defaced websites at thattime

In the SPE case similarly to the DS case some imagesand messages were left on the computers of SPE Regardingcolour skulls image and misspellings the imagesFigure 11(c) used in the SPE cases took on the characteristicssimilar to those of the images Figure 11(b) used in the DScases As shown in Figure 11 the colour schemes in greenand red and the visual similarities seen in skull image areother crucial elements for crime tracing In both the DS andSPE cases the phrase such as ldquothis is the beginningrdquo andldquoyour datardquo were commonly found in the messages How-ever given the intentional hacking nature of forging orhiding their identity motivation and location some experts

say that these characteristics are not the conclusive proofthat Sony has been attacked by the same hacker [49ndash51]

For the evaluation of the results of the case study we firstmeasured the similarity between the new website de-facement cases (ie the DS and SPE cases) and the collectedexisting cases in the database +is approach coheres withthe CBR process used in cybercrime investigation (seeFigure 2) Two new website defacement cases the DS and theSPE were applied as RC and the similarity score for each ofthese two cases was computed using the similarity measure(see equation (1)) proposed in Section 331 Provided thatbecause the DS and SPE cases do the function of the targetcases as an input value we considered a direct comparisonbetween the DS and SPE cases for the similarity score wasnot appropriate [52]

+e similarity measure mentioned in the previousparagraph is based on the metadata released by an analysisreport of the DS and SPE real cases We summarized furtherthe characteristics and metadata associated with them inTable 4 +e similarity score was derived through com-parison between the presented metadata of the DS and SPEcases and all cases in the cases-centric DB We gave the mostsimilar top three cases among the result of the similarityscore (see the right side in Table) Notifier Hmei7 and d3b_Xare among the cases that belonged to Clusters 0 and 8 whichwere the two clusters that exhibited identical characteristicsIt can thus be understood that they used the encoding systempertinent to Central European languages based on the Latinlanguage system and typically launched attacks against aprofit organization located in Western Europe Notifieroaddah MTRiX and EL_MuHaMMeD were all classified

Top 1 Top 3 Top 5 Top 10 Top 15 Top 20 Top 25 Top 30Criterion of the top N ()

Num

ber o

f ide

ntifi

ed h

acke

rs

0

20

40

60

80

100

Ratio of the attack cases ()506070

8090100

Figure 10 +e number of identified hackers in the top N scope among the randomly selected 100 hackers

Security and Communication Networks 15

as the same cluster (Cluster 7) where the hackers of Cluster 7used the encoding system pertinent to Arabic and Chineselanguages and typically attacked against the profit organi-zation located in Western Europe

Next to ensure the objectivity of the similarity scorebased on the case study by the DS and SPE we computed thesimilarity score of any randomly selected pair from thewhole case Figure 12(a) shows the distribution of thesimilarity score of the randomly selected cases We took thedistribution of the similarity score using the central limittheorem which describes the average distribution of ran-dom samples extracted from a finite population +e dis-tribution shows that the calculation of the similarity score ofthe randomly selected two website defacement cases wasrepeatedly performed for 10000 times +e similarity scoresof any randomly selected pair of cases were typically dis-tributed around 03 +is result (Figure 12(a)) substantiatesthat the similarity scores are not low even if the similarityscores of the DS and SPE cases (Figure 12(b)) do not appear

numerically high Figure 12(b) shows the similarity scores ofthe DS and SPE cases+e top score of the similarity was 069in the DS case and all measured cases concentrated aroundthe similarity score (X-axis) of 00 to 015 and of 05 to 06 Inthe SPE case the top score of the similarity was 0615 and allmeasured cases concentrated around the similarity score (X-axis) of 00 to 02

Figure 13 shows the distribution of the similarity scorefor randomly selected 100 hackers mentioned in Section 41To know the mean value of the similarity score for eachhacker case we calculated the similarity score from thehackerrsquos own past cases Cases used for the similarity scoremeans not all cases in the cases-centric DB but just the pastcases conducted by the hacker in the cases-centric DB +emean value of the similarity scores in the hackers is 05233+e similarity scores of the tested cases in Table 4 is abovethe mean value +us the similarity scores for each hackeradequately underpin the similarity scores from the TCs inDS and SPE

(a) (b) (c)

Figure 11 A snippet of website defacement cases by a comparison of examples of the DS and SPE the defaced LGU+ groupware homepage(a) and KBS homepage (b) in the DS case and the defaced website in SPE case (c)

Table 4 Further characteristics and metadata associated with the DS and SPE cases

Retrieved case Tested cases

Case name NotifierDarkSeoul (DS) Hmei7 d3b_X StifLer

Encoding Windows-1252 Windows-1252 Windows-1252 ISO-8859-9IP address 203248195178 2038623868 2031243766 77921083Domain gyunggionnet21com httpwwwgarychengcom healthajkgovpk yapikimyasallaricomtrDate 20 Mar 2013 6 Feb 2014 4 Feb 2014 8 Jun 2013OS Windows Windows Windows WindowsSimilarity mdash 0690 0675 0665Cluster mdash 0 8 4

Retrieved case Tested casesCase name Notifier

Sony pictures Entertainment (SPE) Oaddah MTRiX EL_MuHaMMeDEncoding EUC-KR EUC-CN GB2312 GB2312 GB2312IP address 203131222102 2031241555 20829198 2081164534Domain httpwwwsonypicturesstockfootagecom httpwwwhzkcggcom daxdigitalromcom digitalairstripnetDate 24 Nov 2014 14 Jun 2012 16 Dec 2002 18 June 2009OS Windows Windows Windows WindowsSimilarity mdash 0615 0615 0600Cluster mdash 7 7 7+e metadata are arranged according to the defined case vector corresponding with the DS and SPE cases on the left side (shown in part in boldface type)

16 Security and Communication Networks

43 Follow-Up Investigation A case study is a researchmethod involving an in-depth and detailed investigationof a subject of study as well as its related contextualmethodology Hence we conducted follow-up in-vestigations of the most similar top three hackers asmentioned above in Table 4 According to the resultsspecifically over 93 percent of the hackerrsquos attacks weresimilar to the DS case that occurred in 2013 and 2014+eir major targets were com domain sites and theytargeted primarily Germany Italy New Zealand RussiaTurkey Taiwan and South Korea (see Table 5) Twohackers (ie Hmei7 and d3b_X) primarily attackedgovernment agencies Interestingly 20 percent of theattacks by the hackers named d3b_X targeted SouthKorea In the SPE incident the similar hackerrsquos attacksoccurred throughout the period from 2002 to 2014 +ehackers named MTRiX and EL_MuHaMMeD in-tensively executed such attacks in 2003 and 2009 +eirmajor targets were com (or co) and org domain sitesand they targeted primarily Brazil Canada DenmarkFrance Greece Hong Kong and Italy (see Table 5) Twohackers (ie MTRiX and EL_MuHaMMeD) primarilyattacked commercial agencies and additionally attackedthe public and network agencies As shown in Figure 14 to

describe the follow-up investigation more discernibly andto focus on the attack flow we used an alluvial diagramwhich is a type of Sankey diagram developed to representchanges in a network structure over time [53] It shows theinvestigation of the top three hackers with website de-facement cases most similar to the DS case and SPE case+e case vectors were based on the attack year ccTLD andgTLD+e thickness of the attack flow in this figure meansthe degree of attack +is network visualization methodcould support an investigator to understand the flow andcore of the crime clearly by listing the multidimensionalevidence that is complicatedly entangled or hidden suchthat it does not look presentable

5 Limitations and Discussion

+e CBR algorithm has the disadvantage that the perfor-mance evaluation may be degraded if the property de-scribing the case is inappropriate +erefore in order toobtain more accurate results cross-data analysis with othervarious data sources should be considered For examplecybercrime statistics data from law enforcement agenciesthreat intelligence data from malware analysis groups andvulnerability databases could be useful resources to

100806040200Similarity score

600

400

200

0

Freq

uenc

y

Mean = 02930 Var = 00866

(a)

100806040200

Freq

uenc

y e highest similarity score 0615on Sony Pictures Entertainment case

Similarity score100806040200

Similarity score

40000

30000

20000

10000

0

Freq

uenc

y

40000

30000

20000

10000

0

Mean = 0114 Var = 01500

e highest similarity score 069on DarkSeoul case

Mean = 0063Var = 00370

A B

(b)

Figure 12 (a) Probability distribution of the similarity score for any pair of randomly selected cases (b) distribution of the similarity valuebetween the collected website defacement cases with the DS case (A) and the distribution of the similarity value between the collectedwebsite defacement cases with the SPE case (B) +e similarity was calculated between each studied case and all other cases in our system

0

2

4

6

000 025 050 075

Freq

uenc

y

Mean value of the similarity score000 025 050 075

Figure 13 Distribution of the similarity score for randomly selected 100 hackers

Security and Communication Networks 17

improve the accuracy and usability of our proposedmethodology However at the time of writing the presentpaper we did not have access to open and public dataconcerning cybercrime

For that reason we tried to demonstrate the practica-bility of the proposed methodology as a proof of concept+erefore we focused on the dataset of the zone-horg thatincludes a large number of website defacement cases Al-though the zone-horg provides an extensive dataset on thepast incident events not all incidents can be included in ourstudy +erefore if a hacker penetrated some target orga-nizations by APT attacks and performed stealthy activitiessuch hacking activities would not be reported in the datasetof the zone-horg and the proposed methodology would notbe able to detect similar cases with reasonable confidence

6 Conclusion and Future Work

In this study the similarity of website defacement caseswas assessed through the similarity measure and theclustering processing using the CBR as a methodology+e collected raw data of the defaced web sitesrsquo resourceswas sanitized via data parsing and data cleaning processAlso based on the large size of real dataset data-drivenanalysis for the hacker profiling is achieved To this endthe case vector was designed and the significant featureswere chosen for applying to the case-based reasoning Fora successful cybercrime investigation hacker profiling viaclustering analysis is the most basic and importantprocess in order to find out the relevant incident casesand significant data on some prime incidents data-driven

Table 5 Follow-up investigation on the top three hackers with website defacement cases most similar to the DS case and SPE case +e casevector value means the hackerrsquos attack rate

DomainDS case SPE case

Hmei7 d3b_X StifLer Oaddah MTRiX EL_MuHaMMeDCom 7832 8581 10000 10000 8627 8298Edu 162 096 mdash mdash 176 191Net 340 320 mdash mdash 546 574Gov 1216 651 mdash mdash 106 mdashYear Hmei7 d3b_X StifLer Oaddah MTRiX EL_MuHaMMeD2002 mdash mdash mdash mdash 1074 mdash2003 mdash mdash mdash mdash 8908 mdash2006 mdash mdash mdash mdash mdash mdash2007 009 mdash mdash mdash 018 mdash2008 mdash mdash mdash mdash mdash mdash2009 315 mdash mdash mdash mdash 99572010 009 mdash mdash mdash mdash mdash2011 034 mdash mdash mdash mdash mdash2012 340 mdash mdash 10000 mdash mdash2013 3486 3917 10000 mdash mdash mdash2014 5808 5977 mdash mdash mdash 0432015 mdash 107 mdash mdash mdash mdash

d3b~x

Hmei7

StifLer

2009

2012

2013

2014

AustraliaBrazilFrance

Germany

IndonesiaItaly

KoreaNetherlandsNewZealand

PolandRussia

Thailand

Turkey

Unknown

com

gov

net

org

Unknown

No

Yes

Hacker Year ccTLD gTLD Attack

(a)

EL_M

uHaM

MeD

MTRiX

oaddah

2002

2003

2009

2012

BrazilCanada

DenmarkFranceGreece

HongKongItaly

Unknown

com

net

org

Unknown

No

Yes

Hacker Year ccTLD gTLD Attack

(b)

Figure 14 Follow-up investigation on the top three hackers with website defacement cases that are most similar to the DS case (a) and SPEcase (b)

18 Security and Communication Networks

and evidence-driven decision making should be thecritical process Also reducing the amount of data andtime to be analysed are important factors to deliver thehigh value of intelligence data

Although the obtained results appear to be sound andmeaningful it is difficult to evaluate the accuracy of theresults unless the attacker is captured Naturally theground-truth data with specific information about theinvolved hacking groups for verification are rare (ie noadversary claimed that the two attacks were the result oftheir actions) However it is noteworthy that our meth-odology provides a meaningful insight into the confidentialand undercover network of cybercrime as well especiallywhen there is a lack of information Also the proposedmethodology contributes to facilitate the analysis and re-ducing the time required for searching for possible suspectsof cybercrime We believe that the proposed system ismeaningful for further exploration and correlation ofvarious website defacement cases

As mentioned in Discussion and Limitations a cross-data analysis with other various data sources should bereviewed Said differently the use of additional online oroffline information acquired by human intelligence(HUMINT) or different types of signal intelligence(SIGINT) and sources may also help to reason compo-sition requirements of crime and reduce the category ofinvestigation Furthermore the proposed methodologycan be expanded into incident information for compat-ibility and information exchangeability with othercyberthreat intelligence system as the Structured +reatInformation eXpression (STIX) and Trusted AutomatedeXchange of Indicator Information (TAXII) which arekey strategic elements of the information-sharingsystem [54]

+ere are features such as the particular messages (iethanks-to notifier nationality religion and anniversary)or image and mp3 file in the web resources which aregathered from the zone-horg site Although these featuresare limited to only a small number of hackers of the webresources in future research we will try to study a close-knit network among them such as the hub hacking groupkey player and followers Furthermore we also plan tomore definitely classify and systemize the hackersrsquo intentsusing text mining and mood detection techniques +efindings of this prospective study will contribute mean-ingful insights to trace hackersrsquo behavioural patterns and toestimate their primary purpose and intent

Data Availability

+e web-hacking dataset applied to our paper can bedownloaded from the linked site below httpocslabhksecuritynetDatasetsweb-hacking-profiling

Conflicts of Interest

+e authors declare that there are no conflicts of interestregarding the publication of this paper

Acknowledgments

+is work was supported under the framework of internationalcooperation program managed by the National ResearchFoundation of Korea (No 2017K1A3A1A17092614)

References

[1] S S Response ldquoSwift attackersrsquo malware linked to more fi-nancial attacksrdquo 2016 httpswwwsymanteccomconnectblogsswift-attackers-malware-linked-more-financial-attacks

[2] S S Response ldquoWannacry ransomware attacks show strong linksto lazarus grouprdquo 2017 httpswwwsymanteccomconnectblogswannacry-ransomware-attacks-show-strong-links-lazarus-group

[3] K lab ldquoLazarus under the hoodrdquo 2018 httpsmediakasperskycontenthubcomwp-contentuploadssites4320180307180244Lazarus_Under_+e_Hood_PDF_finalpdf

[4] Operation Blockbuster ldquoDestructive malware reportrdquo 2016httpswwwoperationblockbustercomwp-contentuploads201602Operation-Blockbuster-Destructive-Malware-Reportpdf

[5] D Martin and SANS Institute InfoSec Reading Room ldquoTracingthe lineage of DarkSeoulrdquo 2016 httpswwwsansorgreading-roomwhitepaperscriticaltracing-lineage-darkseoul-36787

[6] D S C T U T Intelligence ldquoWiper malware threatanalysisrdquo 2013 httpswwwsecureworkscomresearchwiper-malware-analysis-attacking-korean-financial-sector

[7] R Sherstobitoff M L Itai Liba and O O T C James WalterldquoDissecting operation troy cyberespionage in South Koreardquo2013 httpswwwmcafeecomenterpriseen-usassetswhite-paperswp-dissecting-operation-troypdf

[8] N Horton andA DeSimone ldquoSonyrsquos nightmare before christmasthe 2014 North Korean cyber attack on Sony and lessons for USgovernment actions in cyberspacerdquo 2018 httpswwwjhuapleduContentdocumentsSonyNightmareBeforeChristmaspdf

[9] I K Lee and S R Ramsey 9e Korean Language StateUniversity of New York Albany NY USA 2000

[10] V Benjamin and H Chen ldquoSecuring cyberspace identifyingkey actors in hacker communitiesrdquo in Proceedings of the 2012IEEE International Conference on Intelligence and SecurityInformatics pp 24ndash29 Arlington VA USA June 2012

[11] Y Lu X Luo M Polgar et al ldquoSocial network analysis of acriminal hacker communityrdquo Journal of Computer In-formation Systems vol 51 no 2 pp 31ndash41 2010

[12] J-W Jang H Kang J Woo A Mohaisen and H K KimldquoAndro-autopsy anti-malware system based on similaritymatching of malware and malware creator-centric in-formationrdquo Digital Investigation vol 14 pp 17ndash35 2015

[13] J W Jang and H K Kim ldquoFunction-orientedmobile malwareanalysis as first aidrdquo Mobile Information Systems vol 2016Article ID 6707524 11 pages 2016

[14] Y Ki E Kim and H K Kim ldquoA novel approach to detectmalware based on api call sequence analysisrdquo InternationalJournal of Distributed Sensor Networks vol 11 no 6 ArticleID 659101 2015

[15] M L Han H C Han A R Kang et al ldquoWeb-hacking datasetfor the cyber criminal profilingrdquo 2016 httpocslabhksecuritynetDatasetsweb-hacking-profiling

[16] M L Han H C Han A R Kang B I Kwak A Mohaisenand H K Kim ldquoWAHP web-hacking profiling using case-based reasoningrdquo in Proceedings of the 2016 IEEE Conference

Security and Communication Networks 19

on Communications and Network Security (CNS) pp 344-345Philadelphia PA USA October 2016

[17] A Aamodt and E Plaza ldquoCase-based reasoning foundationalissues methodological variations and system approachesrdquo AICommunications vol 7 no 1 pp 39ndash59 1994

[18] D M L Martins and F B D Lima Neto ldquoHybrid intelligentdecision support using a semiotic case-based reasoning andself-organizing mapsrdquo IEEE Transactions on Systems Manand Cybernetics Systems no 99 pp 1ndash8 2017

[19] H K Kim K H Im and S C Park ldquoDSS for computersecurity incident response applying CBR and collaborativeresponserdquo Expert Systems with Applications vol 37 no 1pp 852ndash870 2010

[20] J-B Lamy B Sekar G Guezennec J Bouaud andB Seroussi ldquoExplainable artificial intelligence for breastcancer a visual case-based reasoning approachrdquo ArtificialIntelligence in Medicine vol 94 pp 42ndash53 2019

[21] M Relich and P Pawlewski ldquoA case-based reasoning ap-proach to cost estimation of new product developmentrdquoNeurocomputing vol 272 pp 40ndash45 2018

[22] E R Reyes S Negny G C Robles et al ldquoImprovement ofonline adaptation knowledge acquisition and reuse in case-based reasoning application to process engineering designrdquoEngineering Applications of Artificial Intelligence vol 41pp 1ndash16 2015

[23] H K Kim S-K Kim and S-H Kim ldquoDecision supportsystem for zero-day attack responserdquo Applied Mathematicsand Information Sciences vol 6 no 1 pp 221Sndash241S 2012

[24] G Horsman C Laing and P Vickers ldquoA case-based rea-soning method for locating evidence during digital forensicdevice triagerdquo Decision Support Systems vol 61 pp 69ndash782014

[25] G Horsman C Laing and P Vickers ldquoA case based reasoningsystem for automated forensic examinationsrdquo in Proceedings ofthe PGNET 2011 the 12th Annual Postgraduate Symposium onthe Convergence of Telecommunications Networking andBroadcasting pp 26ndash31 Liverpool UK June 2011

[26] Z Yin Y Gao and B Chen ldquoOn development of supple-mentary criminal analysis system based on cbr and ontologyrdquoin Proceedings of the 2010 International Conference onComputer Application and System Modeling (ICCASM 2010)vol 14 Taiyuan China October 2010

[27] A J Pinizzotto and N J Finkel ldquoCriminal personality pro-filing an outcome and process studyrdquo Law and HumanBehavior vol 14 no 3 pp 215ndash233 1990

[28] P Chen and J Kurland ldquoTime place and modus operandi asimple apriori algorithm experiment for crime pattern de-tectionrdquo in Proceedings of the 2018 9th International Con-ference on Information Intelligence Systems and Applications(IISA) pp 1ndash3 Zakynthos Greece July 2018

[29] C J R Collie and K Shalev Greene ldquoExamining modusoperandi in stranger child abduction a comparison ofattempted and completed casesrdquo Journal of InvestigativePsychology and Offender Profiling vol 16 no 2 pp 91ndash1092019

[30] V Benjamin B Zhang J F Nunamaker Jr and H ChenldquoExamining hacker participation length in cybercriminalinternet-relay-chat communitiesrdquo Journal of ManagementInformation Systems vol 33 no 2 pp 482ndash510 2016

[31] V Benjamin and H Chen ldquoTime-to-event modeling forpredicting hacker IRC community participant trajectoryrdquo inProceedings of the 2014 IEEE Joint Intelligence and SecurityInformatics Conference pp 25ndash32 +e Hague +e Nether-lands September 2014

[32] K Veena and K Meena ldquoIdentification of cyber criminal byanalysing the users profilerdquo International Journal of NetworkSecurity vol 20 no 4 pp 738ndash745 2018

[33] F Iqbal B C M Fung M Debbabi R Batool andA Marrington ldquoWordnet-based criminal networks miningfor cybercrime investigationrdquo IEEE Access vol 7pp 22740ndash22755 2019

[34] N Qazi and B L W Wong ldquoAn interactive human centereddata science approach towards crime pattern analysisrdquo In-formation Processing ampManagement vol 56 no 6 p 1020662019

[35] N Jain P Sharma R Anchan et al ldquoComputerized forensicapproach using data mining techniquesrdquo in Proceedings of theACM Symposium on Women in Research 2016 pp 55ndash60ACM New York NY USA 2016

[36] P M Cozens G Saville and D Hillier ldquoCrime preventionthrough environmental design (cpted) a review and modernbibliographyrdquo Property Management vol 23 no 5pp 328ndash356 2005

[37] H Hassani X Huang E S Silva andM Ghodsi ldquoA review ofdata mining applications in crimerdquo Statistical Analysis andData Mining 9e ASA Data Science Journal vol 9 no 3pp 139ndash154 2016

[38] A Sharma and S Sharma ldquoAn intelligent analysis of webcrime data using data miningrdquo International Journal of En-gineering and Innovative Technology (IJEIT) vol 2 no 32012

[39] S-T Li S-C Kuo and F-C Tsai ldquoAn intelligent decision-support model using FSOM and rule extraction for crimepreventionrdquo Expert Systems with Applications vol 37 no 10pp 7108ndash7119 2010

[40] Y-H Tseng Z-P Ho K-S Yang and C-C Chen ldquoMiningterm networks from text collections for crime investigationrdquoExpert Systems with Applications vol 39 no 11 pp 10082ndash10090 2012

[41] A Malathi and S S Baboo ldquoAn enhanced algorithm topredict a future crime using data miningrdquo InternationalJournal of Computer Applications vol 21 no 1 2011

[42] S Kapetanakis A Filippoupolitis G Loukas et al ldquoProfilingcyber attackers using case-based reasoningrdquo in Proceedings ofthe 19th UK Workshop on Case-Based Reasoning (UKCBR2014) Cambridge UK December 2014

[43] R Al-Zaidy B C Fung A M Youssef et al ldquoMining criminalnetworks from unstructured text documentsrdquo Digital In-vestigation vol 8 no 3-4 pp 147ndash160 2012

[44] M Zulfadhilah Y Prayudi and I Riadi ldquoCyber profilingusing log analysis and k-means clusteringrdquo InternationalJournal of Advanced Computer Science and Applicationsvol 7 no 7 pp 430ndash435 2016

[45] S V Nath ldquoCrime pattern detection using data miningrdquo inProceedings of the 2006 IEEEWICACM International Con-ference on Web Intelligence and Intelligent Agent TechnologyWorkshops pp 41ndash44 Hong Kong China December 2006

[46] ITPnet ldquoSyria Egypt crises spur escalation of me cyber at-tacksrdquo 2013 httpwwwitpnet594742-syria-egypt-crises-spur-escalation-of-me-cyber-attack

[47] A McEnery and R Xiao ldquoCharacter encoding in corpusconstructionrdquo in Developing Linguistic Corpora A Guide toGood Practice Oxbow Books Ltd Oxford UK 2005

[48] B Bos T Ccedilelik I Hickson et al ldquoCascading style sheets level2 revision 1 (CSS 21) specificationrdquo W3C Working Draft2005 httpwwww3orgTRCSS21

20 Security and Communication Networks

[49] W Stuckey ldquoMassive sony breach sheds light on murkyhacker universerdquo 2018 httpamericaaljazeeracomarticles20141224sony-hacker-universehtml

[50] S Gallagher ldquoSony pictures malware tied to SeoulldquoShamoonrdquo cyber-attacksrdquo 2018 httpsarstechnicacominformation-technology201412sony-pictures-malware-tied-to-seoul-shamoon-cyber-attacks

[51] J Pagliery ldquoSony hack signs point to North Koreardquo 2018httpsmoneycnncom20141205technologysecuritysony-hack-north-korea-employeeindexhtml

[52] K Ketler ldquoCase-based reasoning an introductionrdquo ExpertSystems with Applications vol 6 no 1 pp 3ndash8 1993

[53] M Rosvall and C T Bergstrom ldquoMapping change in largenetworksrdquo PLoS One vol 5 no 1 Article ID e8694 2010

[54] OASIS ldquoSTIXTAXII standardsrdquo 2017-2018 httpsoasis-opengithubiocti-documentation

Security and Communication Networks 21

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 15: CBR-Based Decision Support Methodology for Cybercrime

Korean Broadcasting System (KBS) homepage +ey leftunique images and many messages on the defaced websites+e three Calaveras image (ie skull image) used in the LGU+rsquos defaced website appeared on many European websites+e character encoding set of the message was the WesternEuropean language system Based on these insights we couldinfer that the hackersrsquo background is European ldquoHASTATIrdquowas the word written on the KBS homepage meaning theforefront line of the Roman troops hinting that the DScyberattack could be a starting point rather than a transientattack it was a persistent one Even if we excluded otherimages and messages as well as other features from thesimilarity processes due to the unanticipated loss or absenceof data one could establish the similarity and intent of theattackers with reasonable confidence However given thesufficiently large hacker profiling source such abundant datacould support and enhance the accuracy of inference Fig-ure 11 shows the screenshots of the defaced websites at thattime

In the SPE case similarly to the DS case some imagesand messages were left on the computers of SPE Regardingcolour skulls image and misspellings the imagesFigure 11(c) used in the SPE cases took on the characteristicssimilar to those of the images Figure 11(b) used in the DScases As shown in Figure 11 the colour schemes in greenand red and the visual similarities seen in skull image areother crucial elements for crime tracing In both the DS andSPE cases the phrase such as ldquothis is the beginningrdquo andldquoyour datardquo were commonly found in the messages How-ever given the intentional hacking nature of forging orhiding their identity motivation and location some experts

say that these characteristics are not the conclusive proofthat Sony has been attacked by the same hacker [49ndash51]

For the evaluation of the results of the case study we firstmeasured the similarity between the new website de-facement cases (ie the DS and SPE cases) and the collectedexisting cases in the database +is approach coheres withthe CBR process used in cybercrime investigation (seeFigure 2) Two new website defacement cases the DS and theSPE were applied as RC and the similarity score for each ofthese two cases was computed using the similarity measure(see equation (1)) proposed in Section 331 Provided thatbecause the DS and SPE cases do the function of the targetcases as an input value we considered a direct comparisonbetween the DS and SPE cases for the similarity score wasnot appropriate [52]

+e similarity measure mentioned in the previousparagraph is based on the metadata released by an analysisreport of the DS and SPE real cases We summarized furtherthe characteristics and metadata associated with them inTable 4 +e similarity score was derived through com-parison between the presented metadata of the DS and SPEcases and all cases in the cases-centric DB We gave the mostsimilar top three cases among the result of the similarityscore (see the right side in Table) Notifier Hmei7 and d3b_Xare among the cases that belonged to Clusters 0 and 8 whichwere the two clusters that exhibited identical characteristicsIt can thus be understood that they used the encoding systempertinent to Central European languages based on the Latinlanguage system and typically launched attacks against aprofit organization located in Western Europe Notifieroaddah MTRiX and EL_MuHaMMeD were all classified

Top 1 Top 3 Top 5 Top 10 Top 15 Top 20 Top 25 Top 30Criterion of the top N ()

Num

ber o

f ide

ntifi

ed h

acke

rs

0

20

40

60

80

100

Ratio of the attack cases ()506070

8090100

Figure 10 +e number of identified hackers in the top N scope among the randomly selected 100 hackers

Security and Communication Networks 15

as the same cluster (Cluster 7) where the hackers of Cluster 7used the encoding system pertinent to Arabic and Chineselanguages and typically attacked against the profit organi-zation located in Western Europe

Next to ensure the objectivity of the similarity scorebased on the case study by the DS and SPE we computed thesimilarity score of any randomly selected pair from thewhole case Figure 12(a) shows the distribution of thesimilarity score of the randomly selected cases We took thedistribution of the similarity score using the central limittheorem which describes the average distribution of ran-dom samples extracted from a finite population +e dis-tribution shows that the calculation of the similarity score ofthe randomly selected two website defacement cases wasrepeatedly performed for 10000 times +e similarity scoresof any randomly selected pair of cases were typically dis-tributed around 03 +is result (Figure 12(a)) substantiatesthat the similarity scores are not low even if the similarityscores of the DS and SPE cases (Figure 12(b)) do not appear

numerically high Figure 12(b) shows the similarity scores ofthe DS and SPE cases+e top score of the similarity was 069in the DS case and all measured cases concentrated aroundthe similarity score (X-axis) of 00 to 015 and of 05 to 06 Inthe SPE case the top score of the similarity was 0615 and allmeasured cases concentrated around the similarity score (X-axis) of 00 to 02

Figure 13 shows the distribution of the similarity scorefor randomly selected 100 hackers mentioned in Section 41To know the mean value of the similarity score for eachhacker case we calculated the similarity score from thehackerrsquos own past cases Cases used for the similarity scoremeans not all cases in the cases-centric DB but just the pastcases conducted by the hacker in the cases-centric DB +emean value of the similarity scores in the hackers is 05233+e similarity scores of the tested cases in Table 4 is abovethe mean value +us the similarity scores for each hackeradequately underpin the similarity scores from the TCs inDS and SPE

(a) (b) (c)

Figure 11 A snippet of website defacement cases by a comparison of examples of the DS and SPE the defaced LGU+ groupware homepage(a) and KBS homepage (b) in the DS case and the defaced website in SPE case (c)

Table 4 Further characteristics and metadata associated with the DS and SPE cases

Retrieved case Tested cases

Case name NotifierDarkSeoul (DS) Hmei7 d3b_X StifLer

Encoding Windows-1252 Windows-1252 Windows-1252 ISO-8859-9IP address 203248195178 2038623868 2031243766 77921083Domain gyunggionnet21com httpwwwgarychengcom healthajkgovpk yapikimyasallaricomtrDate 20 Mar 2013 6 Feb 2014 4 Feb 2014 8 Jun 2013OS Windows Windows Windows WindowsSimilarity mdash 0690 0675 0665Cluster mdash 0 8 4

Retrieved case Tested casesCase name Notifier

Sony pictures Entertainment (SPE) Oaddah MTRiX EL_MuHaMMeDEncoding EUC-KR EUC-CN GB2312 GB2312 GB2312IP address 203131222102 2031241555 20829198 2081164534Domain httpwwwsonypicturesstockfootagecom httpwwwhzkcggcom daxdigitalromcom digitalairstripnetDate 24 Nov 2014 14 Jun 2012 16 Dec 2002 18 June 2009OS Windows Windows Windows WindowsSimilarity mdash 0615 0615 0600Cluster mdash 7 7 7+e metadata are arranged according to the defined case vector corresponding with the DS and SPE cases on the left side (shown in part in boldface type)

16 Security and Communication Networks

43 Follow-Up Investigation A case study is a researchmethod involving an in-depth and detailed investigationof a subject of study as well as its related contextualmethodology Hence we conducted follow-up in-vestigations of the most similar top three hackers asmentioned above in Table 4 According to the resultsspecifically over 93 percent of the hackerrsquos attacks weresimilar to the DS case that occurred in 2013 and 2014+eir major targets were com domain sites and theytargeted primarily Germany Italy New Zealand RussiaTurkey Taiwan and South Korea (see Table 5) Twohackers (ie Hmei7 and d3b_X) primarily attackedgovernment agencies Interestingly 20 percent of theattacks by the hackers named d3b_X targeted SouthKorea In the SPE incident the similar hackerrsquos attacksoccurred throughout the period from 2002 to 2014 +ehackers named MTRiX and EL_MuHaMMeD in-tensively executed such attacks in 2003 and 2009 +eirmajor targets were com (or co) and org domain sitesand they targeted primarily Brazil Canada DenmarkFrance Greece Hong Kong and Italy (see Table 5) Twohackers (ie MTRiX and EL_MuHaMMeD) primarilyattacked commercial agencies and additionally attackedthe public and network agencies As shown in Figure 14 to

describe the follow-up investigation more discernibly andto focus on the attack flow we used an alluvial diagramwhich is a type of Sankey diagram developed to representchanges in a network structure over time [53] It shows theinvestigation of the top three hackers with website de-facement cases most similar to the DS case and SPE case+e case vectors were based on the attack year ccTLD andgTLD+e thickness of the attack flow in this figure meansthe degree of attack +is network visualization methodcould support an investigator to understand the flow andcore of the crime clearly by listing the multidimensionalevidence that is complicatedly entangled or hidden suchthat it does not look presentable

5 Limitations and Discussion

+e CBR algorithm has the disadvantage that the perfor-mance evaluation may be degraded if the property de-scribing the case is inappropriate +erefore in order toobtain more accurate results cross-data analysis with othervarious data sources should be considered For examplecybercrime statistics data from law enforcement agenciesthreat intelligence data from malware analysis groups andvulnerability databases could be useful resources to

100806040200Similarity score

600

400

200

0

Freq

uenc

y

Mean = 02930 Var = 00866

(a)

100806040200

Freq

uenc

y e highest similarity score 0615on Sony Pictures Entertainment case

Similarity score100806040200

Similarity score

40000

30000

20000

10000

0

Freq

uenc

y

40000

30000

20000

10000

0

Mean = 0114 Var = 01500

e highest similarity score 069on DarkSeoul case

Mean = 0063Var = 00370

A B

(b)

Figure 12 (a) Probability distribution of the similarity score for any pair of randomly selected cases (b) distribution of the similarity valuebetween the collected website defacement cases with the DS case (A) and the distribution of the similarity value between the collectedwebsite defacement cases with the SPE case (B) +e similarity was calculated between each studied case and all other cases in our system

0

2

4

6

000 025 050 075

Freq

uenc

y

Mean value of the similarity score000 025 050 075

Figure 13 Distribution of the similarity score for randomly selected 100 hackers

Security and Communication Networks 17

improve the accuracy and usability of our proposedmethodology However at the time of writing the presentpaper we did not have access to open and public dataconcerning cybercrime

For that reason we tried to demonstrate the practica-bility of the proposed methodology as a proof of concept+erefore we focused on the dataset of the zone-horg thatincludes a large number of website defacement cases Al-though the zone-horg provides an extensive dataset on thepast incident events not all incidents can be included in ourstudy +erefore if a hacker penetrated some target orga-nizations by APT attacks and performed stealthy activitiessuch hacking activities would not be reported in the datasetof the zone-horg and the proposed methodology would notbe able to detect similar cases with reasonable confidence

6 Conclusion and Future Work

In this study the similarity of website defacement caseswas assessed through the similarity measure and theclustering processing using the CBR as a methodology+e collected raw data of the defaced web sitesrsquo resourceswas sanitized via data parsing and data cleaning processAlso based on the large size of real dataset data-drivenanalysis for the hacker profiling is achieved To this endthe case vector was designed and the significant featureswere chosen for applying to the case-based reasoning Fora successful cybercrime investigation hacker profiling viaclustering analysis is the most basic and importantprocess in order to find out the relevant incident casesand significant data on some prime incidents data-driven

Table 5 Follow-up investigation on the top three hackers with website defacement cases most similar to the DS case and SPE case +e casevector value means the hackerrsquos attack rate

DomainDS case SPE case

Hmei7 d3b_X StifLer Oaddah MTRiX EL_MuHaMMeDCom 7832 8581 10000 10000 8627 8298Edu 162 096 mdash mdash 176 191Net 340 320 mdash mdash 546 574Gov 1216 651 mdash mdash 106 mdashYear Hmei7 d3b_X StifLer Oaddah MTRiX EL_MuHaMMeD2002 mdash mdash mdash mdash 1074 mdash2003 mdash mdash mdash mdash 8908 mdash2006 mdash mdash mdash mdash mdash mdash2007 009 mdash mdash mdash 018 mdash2008 mdash mdash mdash mdash mdash mdash2009 315 mdash mdash mdash mdash 99572010 009 mdash mdash mdash mdash mdash2011 034 mdash mdash mdash mdash mdash2012 340 mdash mdash 10000 mdash mdash2013 3486 3917 10000 mdash mdash mdash2014 5808 5977 mdash mdash mdash 0432015 mdash 107 mdash mdash mdash mdash

d3b~x

Hmei7

StifLer

2009

2012

2013

2014

AustraliaBrazilFrance

Germany

IndonesiaItaly

KoreaNetherlandsNewZealand

PolandRussia

Thailand

Turkey

Unknown

com

gov

net

org

Unknown

No

Yes

Hacker Year ccTLD gTLD Attack

(a)

EL_M

uHaM

MeD

MTRiX

oaddah

2002

2003

2009

2012

BrazilCanada

DenmarkFranceGreece

HongKongItaly

Unknown

com

net

org

Unknown

No

Yes

Hacker Year ccTLD gTLD Attack

(b)

Figure 14 Follow-up investigation on the top three hackers with website defacement cases that are most similar to the DS case (a) and SPEcase (b)

18 Security and Communication Networks

and evidence-driven decision making should be thecritical process Also reducing the amount of data andtime to be analysed are important factors to deliver thehigh value of intelligence data

Although the obtained results appear to be sound andmeaningful it is difficult to evaluate the accuracy of theresults unless the attacker is captured Naturally theground-truth data with specific information about theinvolved hacking groups for verification are rare (ie noadversary claimed that the two attacks were the result oftheir actions) However it is noteworthy that our meth-odology provides a meaningful insight into the confidentialand undercover network of cybercrime as well especiallywhen there is a lack of information Also the proposedmethodology contributes to facilitate the analysis and re-ducing the time required for searching for possible suspectsof cybercrime We believe that the proposed system ismeaningful for further exploration and correlation ofvarious website defacement cases

As mentioned in Discussion and Limitations a cross-data analysis with other various data sources should bereviewed Said differently the use of additional online oroffline information acquired by human intelligence(HUMINT) or different types of signal intelligence(SIGINT) and sources may also help to reason compo-sition requirements of crime and reduce the category ofinvestigation Furthermore the proposed methodologycan be expanded into incident information for compat-ibility and information exchangeability with othercyberthreat intelligence system as the Structured +reatInformation eXpression (STIX) and Trusted AutomatedeXchange of Indicator Information (TAXII) which arekey strategic elements of the information-sharingsystem [54]

+ere are features such as the particular messages (iethanks-to notifier nationality religion and anniversary)or image and mp3 file in the web resources which aregathered from the zone-horg site Although these featuresare limited to only a small number of hackers of the webresources in future research we will try to study a close-knit network among them such as the hub hacking groupkey player and followers Furthermore we also plan tomore definitely classify and systemize the hackersrsquo intentsusing text mining and mood detection techniques +efindings of this prospective study will contribute mean-ingful insights to trace hackersrsquo behavioural patterns and toestimate their primary purpose and intent

Data Availability

+e web-hacking dataset applied to our paper can bedownloaded from the linked site below httpocslabhksecuritynetDatasetsweb-hacking-profiling

Conflicts of Interest

+e authors declare that there are no conflicts of interestregarding the publication of this paper

Acknowledgments

+is work was supported under the framework of internationalcooperation program managed by the National ResearchFoundation of Korea (No 2017K1A3A1A17092614)

References

[1] S S Response ldquoSwift attackersrsquo malware linked to more fi-nancial attacksrdquo 2016 httpswwwsymanteccomconnectblogsswift-attackers-malware-linked-more-financial-attacks

[2] S S Response ldquoWannacry ransomware attacks show strong linksto lazarus grouprdquo 2017 httpswwwsymanteccomconnectblogswannacry-ransomware-attacks-show-strong-links-lazarus-group

[3] K lab ldquoLazarus under the hoodrdquo 2018 httpsmediakasperskycontenthubcomwp-contentuploadssites4320180307180244Lazarus_Under_+e_Hood_PDF_finalpdf

[4] Operation Blockbuster ldquoDestructive malware reportrdquo 2016httpswwwoperationblockbustercomwp-contentuploads201602Operation-Blockbuster-Destructive-Malware-Reportpdf

[5] D Martin and SANS Institute InfoSec Reading Room ldquoTracingthe lineage of DarkSeoulrdquo 2016 httpswwwsansorgreading-roomwhitepaperscriticaltracing-lineage-darkseoul-36787

[6] D S C T U T Intelligence ldquoWiper malware threatanalysisrdquo 2013 httpswwwsecureworkscomresearchwiper-malware-analysis-attacking-korean-financial-sector

[7] R Sherstobitoff M L Itai Liba and O O T C James WalterldquoDissecting operation troy cyberespionage in South Koreardquo2013 httpswwwmcafeecomenterpriseen-usassetswhite-paperswp-dissecting-operation-troypdf

[8] N Horton andA DeSimone ldquoSonyrsquos nightmare before christmasthe 2014 North Korean cyber attack on Sony and lessons for USgovernment actions in cyberspacerdquo 2018 httpswwwjhuapleduContentdocumentsSonyNightmareBeforeChristmaspdf

[9] I K Lee and S R Ramsey 9e Korean Language StateUniversity of New York Albany NY USA 2000

[10] V Benjamin and H Chen ldquoSecuring cyberspace identifyingkey actors in hacker communitiesrdquo in Proceedings of the 2012IEEE International Conference on Intelligence and SecurityInformatics pp 24ndash29 Arlington VA USA June 2012

[11] Y Lu X Luo M Polgar et al ldquoSocial network analysis of acriminal hacker communityrdquo Journal of Computer In-formation Systems vol 51 no 2 pp 31ndash41 2010

[12] J-W Jang H Kang J Woo A Mohaisen and H K KimldquoAndro-autopsy anti-malware system based on similaritymatching of malware and malware creator-centric in-formationrdquo Digital Investigation vol 14 pp 17ndash35 2015

[13] J W Jang and H K Kim ldquoFunction-orientedmobile malwareanalysis as first aidrdquo Mobile Information Systems vol 2016Article ID 6707524 11 pages 2016

[14] Y Ki E Kim and H K Kim ldquoA novel approach to detectmalware based on api call sequence analysisrdquo InternationalJournal of Distributed Sensor Networks vol 11 no 6 ArticleID 659101 2015

[15] M L Han H C Han A R Kang et al ldquoWeb-hacking datasetfor the cyber criminal profilingrdquo 2016 httpocslabhksecuritynetDatasetsweb-hacking-profiling

[16] M L Han H C Han A R Kang B I Kwak A Mohaisenand H K Kim ldquoWAHP web-hacking profiling using case-based reasoningrdquo in Proceedings of the 2016 IEEE Conference

Security and Communication Networks 19

on Communications and Network Security (CNS) pp 344-345Philadelphia PA USA October 2016

[17] A Aamodt and E Plaza ldquoCase-based reasoning foundationalissues methodological variations and system approachesrdquo AICommunications vol 7 no 1 pp 39ndash59 1994

[18] D M L Martins and F B D Lima Neto ldquoHybrid intelligentdecision support using a semiotic case-based reasoning andself-organizing mapsrdquo IEEE Transactions on Systems Manand Cybernetics Systems no 99 pp 1ndash8 2017

[19] H K Kim K H Im and S C Park ldquoDSS for computersecurity incident response applying CBR and collaborativeresponserdquo Expert Systems with Applications vol 37 no 1pp 852ndash870 2010

[20] J-B Lamy B Sekar G Guezennec J Bouaud andB Seroussi ldquoExplainable artificial intelligence for breastcancer a visual case-based reasoning approachrdquo ArtificialIntelligence in Medicine vol 94 pp 42ndash53 2019

[21] M Relich and P Pawlewski ldquoA case-based reasoning ap-proach to cost estimation of new product developmentrdquoNeurocomputing vol 272 pp 40ndash45 2018

[22] E R Reyes S Negny G C Robles et al ldquoImprovement ofonline adaptation knowledge acquisition and reuse in case-based reasoning application to process engineering designrdquoEngineering Applications of Artificial Intelligence vol 41pp 1ndash16 2015

[23] H K Kim S-K Kim and S-H Kim ldquoDecision supportsystem for zero-day attack responserdquo Applied Mathematicsand Information Sciences vol 6 no 1 pp 221Sndash241S 2012

[24] G Horsman C Laing and P Vickers ldquoA case-based rea-soning method for locating evidence during digital forensicdevice triagerdquo Decision Support Systems vol 61 pp 69ndash782014

[25] G Horsman C Laing and P Vickers ldquoA case based reasoningsystem for automated forensic examinationsrdquo in Proceedings ofthe PGNET 2011 the 12th Annual Postgraduate Symposium onthe Convergence of Telecommunications Networking andBroadcasting pp 26ndash31 Liverpool UK June 2011

[26] Z Yin Y Gao and B Chen ldquoOn development of supple-mentary criminal analysis system based on cbr and ontologyrdquoin Proceedings of the 2010 International Conference onComputer Application and System Modeling (ICCASM 2010)vol 14 Taiyuan China October 2010

[27] A J Pinizzotto and N J Finkel ldquoCriminal personality pro-filing an outcome and process studyrdquo Law and HumanBehavior vol 14 no 3 pp 215ndash233 1990

[28] P Chen and J Kurland ldquoTime place and modus operandi asimple apriori algorithm experiment for crime pattern de-tectionrdquo in Proceedings of the 2018 9th International Con-ference on Information Intelligence Systems and Applications(IISA) pp 1ndash3 Zakynthos Greece July 2018

[29] C J R Collie and K Shalev Greene ldquoExamining modusoperandi in stranger child abduction a comparison ofattempted and completed casesrdquo Journal of InvestigativePsychology and Offender Profiling vol 16 no 2 pp 91ndash1092019

[30] V Benjamin B Zhang J F Nunamaker Jr and H ChenldquoExamining hacker participation length in cybercriminalinternet-relay-chat communitiesrdquo Journal of ManagementInformation Systems vol 33 no 2 pp 482ndash510 2016

[31] V Benjamin and H Chen ldquoTime-to-event modeling forpredicting hacker IRC community participant trajectoryrdquo inProceedings of the 2014 IEEE Joint Intelligence and SecurityInformatics Conference pp 25ndash32 +e Hague +e Nether-lands September 2014

[32] K Veena and K Meena ldquoIdentification of cyber criminal byanalysing the users profilerdquo International Journal of NetworkSecurity vol 20 no 4 pp 738ndash745 2018

[33] F Iqbal B C M Fung M Debbabi R Batool andA Marrington ldquoWordnet-based criminal networks miningfor cybercrime investigationrdquo IEEE Access vol 7pp 22740ndash22755 2019

[34] N Qazi and B L W Wong ldquoAn interactive human centereddata science approach towards crime pattern analysisrdquo In-formation Processing ampManagement vol 56 no 6 p 1020662019

[35] N Jain P Sharma R Anchan et al ldquoComputerized forensicapproach using data mining techniquesrdquo in Proceedings of theACM Symposium on Women in Research 2016 pp 55ndash60ACM New York NY USA 2016

[36] P M Cozens G Saville and D Hillier ldquoCrime preventionthrough environmental design (cpted) a review and modernbibliographyrdquo Property Management vol 23 no 5pp 328ndash356 2005

[37] H Hassani X Huang E S Silva andM Ghodsi ldquoA review ofdata mining applications in crimerdquo Statistical Analysis andData Mining 9e ASA Data Science Journal vol 9 no 3pp 139ndash154 2016

[38] A Sharma and S Sharma ldquoAn intelligent analysis of webcrime data using data miningrdquo International Journal of En-gineering and Innovative Technology (IJEIT) vol 2 no 32012

[39] S-T Li S-C Kuo and F-C Tsai ldquoAn intelligent decision-support model using FSOM and rule extraction for crimepreventionrdquo Expert Systems with Applications vol 37 no 10pp 7108ndash7119 2010

[40] Y-H Tseng Z-P Ho K-S Yang and C-C Chen ldquoMiningterm networks from text collections for crime investigationrdquoExpert Systems with Applications vol 39 no 11 pp 10082ndash10090 2012

[41] A Malathi and S S Baboo ldquoAn enhanced algorithm topredict a future crime using data miningrdquo InternationalJournal of Computer Applications vol 21 no 1 2011

[42] S Kapetanakis A Filippoupolitis G Loukas et al ldquoProfilingcyber attackers using case-based reasoningrdquo in Proceedings ofthe 19th UK Workshop on Case-Based Reasoning (UKCBR2014) Cambridge UK December 2014

[43] R Al-Zaidy B C Fung A M Youssef et al ldquoMining criminalnetworks from unstructured text documentsrdquo Digital In-vestigation vol 8 no 3-4 pp 147ndash160 2012

[44] M Zulfadhilah Y Prayudi and I Riadi ldquoCyber profilingusing log analysis and k-means clusteringrdquo InternationalJournal of Advanced Computer Science and Applicationsvol 7 no 7 pp 430ndash435 2016

[45] S V Nath ldquoCrime pattern detection using data miningrdquo inProceedings of the 2006 IEEEWICACM International Con-ference on Web Intelligence and Intelligent Agent TechnologyWorkshops pp 41ndash44 Hong Kong China December 2006

[46] ITPnet ldquoSyria Egypt crises spur escalation of me cyber at-tacksrdquo 2013 httpwwwitpnet594742-syria-egypt-crises-spur-escalation-of-me-cyber-attack

[47] A McEnery and R Xiao ldquoCharacter encoding in corpusconstructionrdquo in Developing Linguistic Corpora A Guide toGood Practice Oxbow Books Ltd Oxford UK 2005

[48] B Bos T Ccedilelik I Hickson et al ldquoCascading style sheets level2 revision 1 (CSS 21) specificationrdquo W3C Working Draft2005 httpwwww3orgTRCSS21

20 Security and Communication Networks

[49] W Stuckey ldquoMassive sony breach sheds light on murkyhacker universerdquo 2018 httpamericaaljazeeracomarticles20141224sony-hacker-universehtml

[50] S Gallagher ldquoSony pictures malware tied to SeoulldquoShamoonrdquo cyber-attacksrdquo 2018 httpsarstechnicacominformation-technology201412sony-pictures-malware-tied-to-seoul-shamoon-cyber-attacks

[51] J Pagliery ldquoSony hack signs point to North Koreardquo 2018httpsmoneycnncom20141205technologysecuritysony-hack-north-korea-employeeindexhtml

[52] K Ketler ldquoCase-based reasoning an introductionrdquo ExpertSystems with Applications vol 6 no 1 pp 3ndash8 1993

[53] M Rosvall and C T Bergstrom ldquoMapping change in largenetworksrdquo PLoS One vol 5 no 1 Article ID e8694 2010

[54] OASIS ldquoSTIXTAXII standardsrdquo 2017-2018 httpsoasis-opengithubiocti-documentation

Security and Communication Networks 21

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 16: CBR-Based Decision Support Methodology for Cybercrime

as the same cluster (Cluster 7) where the hackers of Cluster 7used the encoding system pertinent to Arabic and Chineselanguages and typically attacked against the profit organi-zation located in Western Europe

Next to ensure the objectivity of the similarity scorebased on the case study by the DS and SPE we computed thesimilarity score of any randomly selected pair from thewhole case Figure 12(a) shows the distribution of thesimilarity score of the randomly selected cases We took thedistribution of the similarity score using the central limittheorem which describes the average distribution of ran-dom samples extracted from a finite population +e dis-tribution shows that the calculation of the similarity score ofthe randomly selected two website defacement cases wasrepeatedly performed for 10000 times +e similarity scoresof any randomly selected pair of cases were typically dis-tributed around 03 +is result (Figure 12(a)) substantiatesthat the similarity scores are not low even if the similarityscores of the DS and SPE cases (Figure 12(b)) do not appear

numerically high Figure 12(b) shows the similarity scores ofthe DS and SPE cases+e top score of the similarity was 069in the DS case and all measured cases concentrated aroundthe similarity score (X-axis) of 00 to 015 and of 05 to 06 Inthe SPE case the top score of the similarity was 0615 and allmeasured cases concentrated around the similarity score (X-axis) of 00 to 02

Figure 13 shows the distribution of the similarity scorefor randomly selected 100 hackers mentioned in Section 41To know the mean value of the similarity score for eachhacker case we calculated the similarity score from thehackerrsquos own past cases Cases used for the similarity scoremeans not all cases in the cases-centric DB but just the pastcases conducted by the hacker in the cases-centric DB +emean value of the similarity scores in the hackers is 05233+e similarity scores of the tested cases in Table 4 is abovethe mean value +us the similarity scores for each hackeradequately underpin the similarity scores from the TCs inDS and SPE

(a) (b) (c)

Figure 11 A snippet of website defacement cases by a comparison of examples of the DS and SPE the defaced LGU+ groupware homepage(a) and KBS homepage (b) in the DS case and the defaced website in SPE case (c)

Table 4 Further characteristics and metadata associated with the DS and SPE cases

Retrieved case Tested cases

Case name NotifierDarkSeoul (DS) Hmei7 d3b_X StifLer

Encoding Windows-1252 Windows-1252 Windows-1252 ISO-8859-9IP address 203248195178 2038623868 2031243766 77921083Domain gyunggionnet21com httpwwwgarychengcom healthajkgovpk yapikimyasallaricomtrDate 20 Mar 2013 6 Feb 2014 4 Feb 2014 8 Jun 2013OS Windows Windows Windows WindowsSimilarity mdash 0690 0675 0665Cluster mdash 0 8 4

Retrieved case Tested casesCase name Notifier

Sony pictures Entertainment (SPE) Oaddah MTRiX EL_MuHaMMeDEncoding EUC-KR EUC-CN GB2312 GB2312 GB2312IP address 203131222102 2031241555 20829198 2081164534Domain httpwwwsonypicturesstockfootagecom httpwwwhzkcggcom daxdigitalromcom digitalairstripnetDate 24 Nov 2014 14 Jun 2012 16 Dec 2002 18 June 2009OS Windows Windows Windows WindowsSimilarity mdash 0615 0615 0600Cluster mdash 7 7 7+e metadata are arranged according to the defined case vector corresponding with the DS and SPE cases on the left side (shown in part in boldface type)

16 Security and Communication Networks

43 Follow-Up Investigation A case study is a researchmethod involving an in-depth and detailed investigationof a subject of study as well as its related contextualmethodology Hence we conducted follow-up in-vestigations of the most similar top three hackers asmentioned above in Table 4 According to the resultsspecifically over 93 percent of the hackerrsquos attacks weresimilar to the DS case that occurred in 2013 and 2014+eir major targets were com domain sites and theytargeted primarily Germany Italy New Zealand RussiaTurkey Taiwan and South Korea (see Table 5) Twohackers (ie Hmei7 and d3b_X) primarily attackedgovernment agencies Interestingly 20 percent of theattacks by the hackers named d3b_X targeted SouthKorea In the SPE incident the similar hackerrsquos attacksoccurred throughout the period from 2002 to 2014 +ehackers named MTRiX and EL_MuHaMMeD in-tensively executed such attacks in 2003 and 2009 +eirmajor targets were com (or co) and org domain sitesand they targeted primarily Brazil Canada DenmarkFrance Greece Hong Kong and Italy (see Table 5) Twohackers (ie MTRiX and EL_MuHaMMeD) primarilyattacked commercial agencies and additionally attackedthe public and network agencies As shown in Figure 14 to

describe the follow-up investigation more discernibly andto focus on the attack flow we used an alluvial diagramwhich is a type of Sankey diagram developed to representchanges in a network structure over time [53] It shows theinvestigation of the top three hackers with website de-facement cases most similar to the DS case and SPE case+e case vectors were based on the attack year ccTLD andgTLD+e thickness of the attack flow in this figure meansthe degree of attack +is network visualization methodcould support an investigator to understand the flow andcore of the crime clearly by listing the multidimensionalevidence that is complicatedly entangled or hidden suchthat it does not look presentable

5 Limitations and Discussion

+e CBR algorithm has the disadvantage that the perfor-mance evaluation may be degraded if the property de-scribing the case is inappropriate +erefore in order toobtain more accurate results cross-data analysis with othervarious data sources should be considered For examplecybercrime statistics data from law enforcement agenciesthreat intelligence data from malware analysis groups andvulnerability databases could be useful resources to

100806040200Similarity score

600

400

200

0

Freq

uenc

y

Mean = 02930 Var = 00866

(a)

100806040200

Freq

uenc

y e highest similarity score 0615on Sony Pictures Entertainment case

Similarity score100806040200

Similarity score

40000

30000

20000

10000

0

Freq

uenc

y

40000

30000

20000

10000

0

Mean = 0114 Var = 01500

e highest similarity score 069on DarkSeoul case

Mean = 0063Var = 00370

A B

(b)

Figure 12 (a) Probability distribution of the similarity score for any pair of randomly selected cases (b) distribution of the similarity valuebetween the collected website defacement cases with the DS case (A) and the distribution of the similarity value between the collectedwebsite defacement cases with the SPE case (B) +e similarity was calculated between each studied case and all other cases in our system

0

2

4

6

000 025 050 075

Freq

uenc

y

Mean value of the similarity score000 025 050 075

Figure 13 Distribution of the similarity score for randomly selected 100 hackers

Security and Communication Networks 17

improve the accuracy and usability of our proposedmethodology However at the time of writing the presentpaper we did not have access to open and public dataconcerning cybercrime

For that reason we tried to demonstrate the practica-bility of the proposed methodology as a proof of concept+erefore we focused on the dataset of the zone-horg thatincludes a large number of website defacement cases Al-though the zone-horg provides an extensive dataset on thepast incident events not all incidents can be included in ourstudy +erefore if a hacker penetrated some target orga-nizations by APT attacks and performed stealthy activitiessuch hacking activities would not be reported in the datasetof the zone-horg and the proposed methodology would notbe able to detect similar cases with reasonable confidence

6 Conclusion and Future Work

In this study the similarity of website defacement caseswas assessed through the similarity measure and theclustering processing using the CBR as a methodology+e collected raw data of the defaced web sitesrsquo resourceswas sanitized via data parsing and data cleaning processAlso based on the large size of real dataset data-drivenanalysis for the hacker profiling is achieved To this endthe case vector was designed and the significant featureswere chosen for applying to the case-based reasoning Fora successful cybercrime investigation hacker profiling viaclustering analysis is the most basic and importantprocess in order to find out the relevant incident casesand significant data on some prime incidents data-driven

Table 5 Follow-up investigation on the top three hackers with website defacement cases most similar to the DS case and SPE case +e casevector value means the hackerrsquos attack rate

DomainDS case SPE case

Hmei7 d3b_X StifLer Oaddah MTRiX EL_MuHaMMeDCom 7832 8581 10000 10000 8627 8298Edu 162 096 mdash mdash 176 191Net 340 320 mdash mdash 546 574Gov 1216 651 mdash mdash 106 mdashYear Hmei7 d3b_X StifLer Oaddah MTRiX EL_MuHaMMeD2002 mdash mdash mdash mdash 1074 mdash2003 mdash mdash mdash mdash 8908 mdash2006 mdash mdash mdash mdash mdash mdash2007 009 mdash mdash mdash 018 mdash2008 mdash mdash mdash mdash mdash mdash2009 315 mdash mdash mdash mdash 99572010 009 mdash mdash mdash mdash mdash2011 034 mdash mdash mdash mdash mdash2012 340 mdash mdash 10000 mdash mdash2013 3486 3917 10000 mdash mdash mdash2014 5808 5977 mdash mdash mdash 0432015 mdash 107 mdash mdash mdash mdash

d3b~x

Hmei7

StifLer

2009

2012

2013

2014

AustraliaBrazilFrance

Germany

IndonesiaItaly

KoreaNetherlandsNewZealand

PolandRussia

Thailand

Turkey

Unknown

com

gov

net

org

Unknown

No

Yes

Hacker Year ccTLD gTLD Attack

(a)

EL_M

uHaM

MeD

MTRiX

oaddah

2002

2003

2009

2012

BrazilCanada

DenmarkFranceGreece

HongKongItaly

Unknown

com

net

org

Unknown

No

Yes

Hacker Year ccTLD gTLD Attack

(b)

Figure 14 Follow-up investigation on the top three hackers with website defacement cases that are most similar to the DS case (a) and SPEcase (b)

18 Security and Communication Networks

and evidence-driven decision making should be thecritical process Also reducing the amount of data andtime to be analysed are important factors to deliver thehigh value of intelligence data

Although the obtained results appear to be sound andmeaningful it is difficult to evaluate the accuracy of theresults unless the attacker is captured Naturally theground-truth data with specific information about theinvolved hacking groups for verification are rare (ie noadversary claimed that the two attacks were the result oftheir actions) However it is noteworthy that our meth-odology provides a meaningful insight into the confidentialand undercover network of cybercrime as well especiallywhen there is a lack of information Also the proposedmethodology contributes to facilitate the analysis and re-ducing the time required for searching for possible suspectsof cybercrime We believe that the proposed system ismeaningful for further exploration and correlation ofvarious website defacement cases

As mentioned in Discussion and Limitations a cross-data analysis with other various data sources should bereviewed Said differently the use of additional online oroffline information acquired by human intelligence(HUMINT) or different types of signal intelligence(SIGINT) and sources may also help to reason compo-sition requirements of crime and reduce the category ofinvestigation Furthermore the proposed methodologycan be expanded into incident information for compat-ibility and information exchangeability with othercyberthreat intelligence system as the Structured +reatInformation eXpression (STIX) and Trusted AutomatedeXchange of Indicator Information (TAXII) which arekey strategic elements of the information-sharingsystem [54]

+ere are features such as the particular messages (iethanks-to notifier nationality religion and anniversary)or image and mp3 file in the web resources which aregathered from the zone-horg site Although these featuresare limited to only a small number of hackers of the webresources in future research we will try to study a close-knit network among them such as the hub hacking groupkey player and followers Furthermore we also plan tomore definitely classify and systemize the hackersrsquo intentsusing text mining and mood detection techniques +efindings of this prospective study will contribute mean-ingful insights to trace hackersrsquo behavioural patterns and toestimate their primary purpose and intent

Data Availability

+e web-hacking dataset applied to our paper can bedownloaded from the linked site below httpocslabhksecuritynetDatasetsweb-hacking-profiling

Conflicts of Interest

+e authors declare that there are no conflicts of interestregarding the publication of this paper

Acknowledgments

+is work was supported under the framework of internationalcooperation program managed by the National ResearchFoundation of Korea (No 2017K1A3A1A17092614)

References

[1] S S Response ldquoSwift attackersrsquo malware linked to more fi-nancial attacksrdquo 2016 httpswwwsymanteccomconnectblogsswift-attackers-malware-linked-more-financial-attacks

[2] S S Response ldquoWannacry ransomware attacks show strong linksto lazarus grouprdquo 2017 httpswwwsymanteccomconnectblogswannacry-ransomware-attacks-show-strong-links-lazarus-group

[3] K lab ldquoLazarus under the hoodrdquo 2018 httpsmediakasperskycontenthubcomwp-contentuploadssites4320180307180244Lazarus_Under_+e_Hood_PDF_finalpdf

[4] Operation Blockbuster ldquoDestructive malware reportrdquo 2016httpswwwoperationblockbustercomwp-contentuploads201602Operation-Blockbuster-Destructive-Malware-Reportpdf

[5] D Martin and SANS Institute InfoSec Reading Room ldquoTracingthe lineage of DarkSeoulrdquo 2016 httpswwwsansorgreading-roomwhitepaperscriticaltracing-lineage-darkseoul-36787

[6] D S C T U T Intelligence ldquoWiper malware threatanalysisrdquo 2013 httpswwwsecureworkscomresearchwiper-malware-analysis-attacking-korean-financial-sector

[7] R Sherstobitoff M L Itai Liba and O O T C James WalterldquoDissecting operation troy cyberespionage in South Koreardquo2013 httpswwwmcafeecomenterpriseen-usassetswhite-paperswp-dissecting-operation-troypdf

[8] N Horton andA DeSimone ldquoSonyrsquos nightmare before christmasthe 2014 North Korean cyber attack on Sony and lessons for USgovernment actions in cyberspacerdquo 2018 httpswwwjhuapleduContentdocumentsSonyNightmareBeforeChristmaspdf

[9] I K Lee and S R Ramsey 9e Korean Language StateUniversity of New York Albany NY USA 2000

[10] V Benjamin and H Chen ldquoSecuring cyberspace identifyingkey actors in hacker communitiesrdquo in Proceedings of the 2012IEEE International Conference on Intelligence and SecurityInformatics pp 24ndash29 Arlington VA USA June 2012

[11] Y Lu X Luo M Polgar et al ldquoSocial network analysis of acriminal hacker communityrdquo Journal of Computer In-formation Systems vol 51 no 2 pp 31ndash41 2010

[12] J-W Jang H Kang J Woo A Mohaisen and H K KimldquoAndro-autopsy anti-malware system based on similaritymatching of malware and malware creator-centric in-formationrdquo Digital Investigation vol 14 pp 17ndash35 2015

[13] J W Jang and H K Kim ldquoFunction-orientedmobile malwareanalysis as first aidrdquo Mobile Information Systems vol 2016Article ID 6707524 11 pages 2016

[14] Y Ki E Kim and H K Kim ldquoA novel approach to detectmalware based on api call sequence analysisrdquo InternationalJournal of Distributed Sensor Networks vol 11 no 6 ArticleID 659101 2015

[15] M L Han H C Han A R Kang et al ldquoWeb-hacking datasetfor the cyber criminal profilingrdquo 2016 httpocslabhksecuritynetDatasetsweb-hacking-profiling

[16] M L Han H C Han A R Kang B I Kwak A Mohaisenand H K Kim ldquoWAHP web-hacking profiling using case-based reasoningrdquo in Proceedings of the 2016 IEEE Conference

Security and Communication Networks 19

on Communications and Network Security (CNS) pp 344-345Philadelphia PA USA October 2016

[17] A Aamodt and E Plaza ldquoCase-based reasoning foundationalissues methodological variations and system approachesrdquo AICommunications vol 7 no 1 pp 39ndash59 1994

[18] D M L Martins and F B D Lima Neto ldquoHybrid intelligentdecision support using a semiotic case-based reasoning andself-organizing mapsrdquo IEEE Transactions on Systems Manand Cybernetics Systems no 99 pp 1ndash8 2017

[19] H K Kim K H Im and S C Park ldquoDSS for computersecurity incident response applying CBR and collaborativeresponserdquo Expert Systems with Applications vol 37 no 1pp 852ndash870 2010

[20] J-B Lamy B Sekar G Guezennec J Bouaud andB Seroussi ldquoExplainable artificial intelligence for breastcancer a visual case-based reasoning approachrdquo ArtificialIntelligence in Medicine vol 94 pp 42ndash53 2019

[21] M Relich and P Pawlewski ldquoA case-based reasoning ap-proach to cost estimation of new product developmentrdquoNeurocomputing vol 272 pp 40ndash45 2018

[22] E R Reyes S Negny G C Robles et al ldquoImprovement ofonline adaptation knowledge acquisition and reuse in case-based reasoning application to process engineering designrdquoEngineering Applications of Artificial Intelligence vol 41pp 1ndash16 2015

[23] H K Kim S-K Kim and S-H Kim ldquoDecision supportsystem for zero-day attack responserdquo Applied Mathematicsand Information Sciences vol 6 no 1 pp 221Sndash241S 2012

[24] G Horsman C Laing and P Vickers ldquoA case-based rea-soning method for locating evidence during digital forensicdevice triagerdquo Decision Support Systems vol 61 pp 69ndash782014

[25] G Horsman C Laing and P Vickers ldquoA case based reasoningsystem for automated forensic examinationsrdquo in Proceedings ofthe PGNET 2011 the 12th Annual Postgraduate Symposium onthe Convergence of Telecommunications Networking andBroadcasting pp 26ndash31 Liverpool UK June 2011

[26] Z Yin Y Gao and B Chen ldquoOn development of supple-mentary criminal analysis system based on cbr and ontologyrdquoin Proceedings of the 2010 International Conference onComputer Application and System Modeling (ICCASM 2010)vol 14 Taiyuan China October 2010

[27] A J Pinizzotto and N J Finkel ldquoCriminal personality pro-filing an outcome and process studyrdquo Law and HumanBehavior vol 14 no 3 pp 215ndash233 1990

[28] P Chen and J Kurland ldquoTime place and modus operandi asimple apriori algorithm experiment for crime pattern de-tectionrdquo in Proceedings of the 2018 9th International Con-ference on Information Intelligence Systems and Applications(IISA) pp 1ndash3 Zakynthos Greece July 2018

[29] C J R Collie and K Shalev Greene ldquoExamining modusoperandi in stranger child abduction a comparison ofattempted and completed casesrdquo Journal of InvestigativePsychology and Offender Profiling vol 16 no 2 pp 91ndash1092019

[30] V Benjamin B Zhang J F Nunamaker Jr and H ChenldquoExamining hacker participation length in cybercriminalinternet-relay-chat communitiesrdquo Journal of ManagementInformation Systems vol 33 no 2 pp 482ndash510 2016

[31] V Benjamin and H Chen ldquoTime-to-event modeling forpredicting hacker IRC community participant trajectoryrdquo inProceedings of the 2014 IEEE Joint Intelligence and SecurityInformatics Conference pp 25ndash32 +e Hague +e Nether-lands September 2014

[32] K Veena and K Meena ldquoIdentification of cyber criminal byanalysing the users profilerdquo International Journal of NetworkSecurity vol 20 no 4 pp 738ndash745 2018

[33] F Iqbal B C M Fung M Debbabi R Batool andA Marrington ldquoWordnet-based criminal networks miningfor cybercrime investigationrdquo IEEE Access vol 7pp 22740ndash22755 2019

[34] N Qazi and B L W Wong ldquoAn interactive human centereddata science approach towards crime pattern analysisrdquo In-formation Processing ampManagement vol 56 no 6 p 1020662019

[35] N Jain P Sharma R Anchan et al ldquoComputerized forensicapproach using data mining techniquesrdquo in Proceedings of theACM Symposium on Women in Research 2016 pp 55ndash60ACM New York NY USA 2016

[36] P M Cozens G Saville and D Hillier ldquoCrime preventionthrough environmental design (cpted) a review and modernbibliographyrdquo Property Management vol 23 no 5pp 328ndash356 2005

[37] H Hassani X Huang E S Silva andM Ghodsi ldquoA review ofdata mining applications in crimerdquo Statistical Analysis andData Mining 9e ASA Data Science Journal vol 9 no 3pp 139ndash154 2016

[38] A Sharma and S Sharma ldquoAn intelligent analysis of webcrime data using data miningrdquo International Journal of En-gineering and Innovative Technology (IJEIT) vol 2 no 32012

[39] S-T Li S-C Kuo and F-C Tsai ldquoAn intelligent decision-support model using FSOM and rule extraction for crimepreventionrdquo Expert Systems with Applications vol 37 no 10pp 7108ndash7119 2010

[40] Y-H Tseng Z-P Ho K-S Yang and C-C Chen ldquoMiningterm networks from text collections for crime investigationrdquoExpert Systems with Applications vol 39 no 11 pp 10082ndash10090 2012

[41] A Malathi and S S Baboo ldquoAn enhanced algorithm topredict a future crime using data miningrdquo InternationalJournal of Computer Applications vol 21 no 1 2011

[42] S Kapetanakis A Filippoupolitis G Loukas et al ldquoProfilingcyber attackers using case-based reasoningrdquo in Proceedings ofthe 19th UK Workshop on Case-Based Reasoning (UKCBR2014) Cambridge UK December 2014

[43] R Al-Zaidy B C Fung A M Youssef et al ldquoMining criminalnetworks from unstructured text documentsrdquo Digital In-vestigation vol 8 no 3-4 pp 147ndash160 2012

[44] M Zulfadhilah Y Prayudi and I Riadi ldquoCyber profilingusing log analysis and k-means clusteringrdquo InternationalJournal of Advanced Computer Science and Applicationsvol 7 no 7 pp 430ndash435 2016

[45] S V Nath ldquoCrime pattern detection using data miningrdquo inProceedings of the 2006 IEEEWICACM International Con-ference on Web Intelligence and Intelligent Agent TechnologyWorkshops pp 41ndash44 Hong Kong China December 2006

[46] ITPnet ldquoSyria Egypt crises spur escalation of me cyber at-tacksrdquo 2013 httpwwwitpnet594742-syria-egypt-crises-spur-escalation-of-me-cyber-attack

[47] A McEnery and R Xiao ldquoCharacter encoding in corpusconstructionrdquo in Developing Linguistic Corpora A Guide toGood Practice Oxbow Books Ltd Oxford UK 2005

[48] B Bos T Ccedilelik I Hickson et al ldquoCascading style sheets level2 revision 1 (CSS 21) specificationrdquo W3C Working Draft2005 httpwwww3orgTRCSS21

20 Security and Communication Networks

[49] W Stuckey ldquoMassive sony breach sheds light on murkyhacker universerdquo 2018 httpamericaaljazeeracomarticles20141224sony-hacker-universehtml

[50] S Gallagher ldquoSony pictures malware tied to SeoulldquoShamoonrdquo cyber-attacksrdquo 2018 httpsarstechnicacominformation-technology201412sony-pictures-malware-tied-to-seoul-shamoon-cyber-attacks

[51] J Pagliery ldquoSony hack signs point to North Koreardquo 2018httpsmoneycnncom20141205technologysecuritysony-hack-north-korea-employeeindexhtml

[52] K Ketler ldquoCase-based reasoning an introductionrdquo ExpertSystems with Applications vol 6 no 1 pp 3ndash8 1993

[53] M Rosvall and C T Bergstrom ldquoMapping change in largenetworksrdquo PLoS One vol 5 no 1 Article ID e8694 2010

[54] OASIS ldquoSTIXTAXII standardsrdquo 2017-2018 httpsoasis-opengithubiocti-documentation

Security and Communication Networks 21

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 17: CBR-Based Decision Support Methodology for Cybercrime

43 Follow-Up Investigation A case study is a researchmethod involving an in-depth and detailed investigationof a subject of study as well as its related contextualmethodology Hence we conducted follow-up in-vestigations of the most similar top three hackers asmentioned above in Table 4 According to the resultsspecifically over 93 percent of the hackerrsquos attacks weresimilar to the DS case that occurred in 2013 and 2014+eir major targets were com domain sites and theytargeted primarily Germany Italy New Zealand RussiaTurkey Taiwan and South Korea (see Table 5) Twohackers (ie Hmei7 and d3b_X) primarily attackedgovernment agencies Interestingly 20 percent of theattacks by the hackers named d3b_X targeted SouthKorea In the SPE incident the similar hackerrsquos attacksoccurred throughout the period from 2002 to 2014 +ehackers named MTRiX and EL_MuHaMMeD in-tensively executed such attacks in 2003 and 2009 +eirmajor targets were com (or co) and org domain sitesand they targeted primarily Brazil Canada DenmarkFrance Greece Hong Kong and Italy (see Table 5) Twohackers (ie MTRiX and EL_MuHaMMeD) primarilyattacked commercial agencies and additionally attackedthe public and network agencies As shown in Figure 14 to

describe the follow-up investigation more discernibly andto focus on the attack flow we used an alluvial diagramwhich is a type of Sankey diagram developed to representchanges in a network structure over time [53] It shows theinvestigation of the top three hackers with website de-facement cases most similar to the DS case and SPE case+e case vectors were based on the attack year ccTLD andgTLD+e thickness of the attack flow in this figure meansthe degree of attack +is network visualization methodcould support an investigator to understand the flow andcore of the crime clearly by listing the multidimensionalevidence that is complicatedly entangled or hidden suchthat it does not look presentable

5 Limitations and Discussion

+e CBR algorithm has the disadvantage that the perfor-mance evaluation may be degraded if the property de-scribing the case is inappropriate +erefore in order toobtain more accurate results cross-data analysis with othervarious data sources should be considered For examplecybercrime statistics data from law enforcement agenciesthreat intelligence data from malware analysis groups andvulnerability databases could be useful resources to

100806040200Similarity score

600

400

200

0

Freq

uenc

y

Mean = 02930 Var = 00866

(a)

100806040200

Freq

uenc

y e highest similarity score 0615on Sony Pictures Entertainment case

Similarity score100806040200

Similarity score

40000

30000

20000

10000

0

Freq

uenc

y

40000

30000

20000

10000

0

Mean = 0114 Var = 01500

e highest similarity score 069on DarkSeoul case

Mean = 0063Var = 00370

A B

(b)

Figure 12 (a) Probability distribution of the similarity score for any pair of randomly selected cases (b) distribution of the similarity valuebetween the collected website defacement cases with the DS case (A) and the distribution of the similarity value between the collectedwebsite defacement cases with the SPE case (B) +e similarity was calculated between each studied case and all other cases in our system

0

2

4

6

000 025 050 075

Freq

uenc

y

Mean value of the similarity score000 025 050 075

Figure 13 Distribution of the similarity score for randomly selected 100 hackers

Security and Communication Networks 17

improve the accuracy and usability of our proposedmethodology However at the time of writing the presentpaper we did not have access to open and public dataconcerning cybercrime

For that reason we tried to demonstrate the practica-bility of the proposed methodology as a proof of concept+erefore we focused on the dataset of the zone-horg thatincludes a large number of website defacement cases Al-though the zone-horg provides an extensive dataset on thepast incident events not all incidents can be included in ourstudy +erefore if a hacker penetrated some target orga-nizations by APT attacks and performed stealthy activitiessuch hacking activities would not be reported in the datasetof the zone-horg and the proposed methodology would notbe able to detect similar cases with reasonable confidence

6 Conclusion and Future Work

In this study the similarity of website defacement caseswas assessed through the similarity measure and theclustering processing using the CBR as a methodology+e collected raw data of the defaced web sitesrsquo resourceswas sanitized via data parsing and data cleaning processAlso based on the large size of real dataset data-drivenanalysis for the hacker profiling is achieved To this endthe case vector was designed and the significant featureswere chosen for applying to the case-based reasoning Fora successful cybercrime investigation hacker profiling viaclustering analysis is the most basic and importantprocess in order to find out the relevant incident casesand significant data on some prime incidents data-driven

Table 5 Follow-up investigation on the top three hackers with website defacement cases most similar to the DS case and SPE case +e casevector value means the hackerrsquos attack rate

DomainDS case SPE case

Hmei7 d3b_X StifLer Oaddah MTRiX EL_MuHaMMeDCom 7832 8581 10000 10000 8627 8298Edu 162 096 mdash mdash 176 191Net 340 320 mdash mdash 546 574Gov 1216 651 mdash mdash 106 mdashYear Hmei7 d3b_X StifLer Oaddah MTRiX EL_MuHaMMeD2002 mdash mdash mdash mdash 1074 mdash2003 mdash mdash mdash mdash 8908 mdash2006 mdash mdash mdash mdash mdash mdash2007 009 mdash mdash mdash 018 mdash2008 mdash mdash mdash mdash mdash mdash2009 315 mdash mdash mdash mdash 99572010 009 mdash mdash mdash mdash mdash2011 034 mdash mdash mdash mdash mdash2012 340 mdash mdash 10000 mdash mdash2013 3486 3917 10000 mdash mdash mdash2014 5808 5977 mdash mdash mdash 0432015 mdash 107 mdash mdash mdash mdash

d3b~x

Hmei7

StifLer

2009

2012

2013

2014

AustraliaBrazilFrance

Germany

IndonesiaItaly

KoreaNetherlandsNewZealand

PolandRussia

Thailand

Turkey

Unknown

com

gov

net

org

Unknown

No

Yes

Hacker Year ccTLD gTLD Attack

(a)

EL_M

uHaM

MeD

MTRiX

oaddah

2002

2003

2009

2012

BrazilCanada

DenmarkFranceGreece

HongKongItaly

Unknown

com

net

org

Unknown

No

Yes

Hacker Year ccTLD gTLD Attack

(b)

Figure 14 Follow-up investigation on the top three hackers with website defacement cases that are most similar to the DS case (a) and SPEcase (b)

18 Security and Communication Networks

and evidence-driven decision making should be thecritical process Also reducing the amount of data andtime to be analysed are important factors to deliver thehigh value of intelligence data

Although the obtained results appear to be sound andmeaningful it is difficult to evaluate the accuracy of theresults unless the attacker is captured Naturally theground-truth data with specific information about theinvolved hacking groups for verification are rare (ie noadversary claimed that the two attacks were the result oftheir actions) However it is noteworthy that our meth-odology provides a meaningful insight into the confidentialand undercover network of cybercrime as well especiallywhen there is a lack of information Also the proposedmethodology contributes to facilitate the analysis and re-ducing the time required for searching for possible suspectsof cybercrime We believe that the proposed system ismeaningful for further exploration and correlation ofvarious website defacement cases

As mentioned in Discussion and Limitations a cross-data analysis with other various data sources should bereviewed Said differently the use of additional online oroffline information acquired by human intelligence(HUMINT) or different types of signal intelligence(SIGINT) and sources may also help to reason compo-sition requirements of crime and reduce the category ofinvestigation Furthermore the proposed methodologycan be expanded into incident information for compat-ibility and information exchangeability with othercyberthreat intelligence system as the Structured +reatInformation eXpression (STIX) and Trusted AutomatedeXchange of Indicator Information (TAXII) which arekey strategic elements of the information-sharingsystem [54]

+ere are features such as the particular messages (iethanks-to notifier nationality religion and anniversary)or image and mp3 file in the web resources which aregathered from the zone-horg site Although these featuresare limited to only a small number of hackers of the webresources in future research we will try to study a close-knit network among them such as the hub hacking groupkey player and followers Furthermore we also plan tomore definitely classify and systemize the hackersrsquo intentsusing text mining and mood detection techniques +efindings of this prospective study will contribute mean-ingful insights to trace hackersrsquo behavioural patterns and toestimate their primary purpose and intent

Data Availability

+e web-hacking dataset applied to our paper can bedownloaded from the linked site below httpocslabhksecuritynetDatasetsweb-hacking-profiling

Conflicts of Interest

+e authors declare that there are no conflicts of interestregarding the publication of this paper

Acknowledgments

+is work was supported under the framework of internationalcooperation program managed by the National ResearchFoundation of Korea (No 2017K1A3A1A17092614)

References

[1] S S Response ldquoSwift attackersrsquo malware linked to more fi-nancial attacksrdquo 2016 httpswwwsymanteccomconnectblogsswift-attackers-malware-linked-more-financial-attacks

[2] S S Response ldquoWannacry ransomware attacks show strong linksto lazarus grouprdquo 2017 httpswwwsymanteccomconnectblogswannacry-ransomware-attacks-show-strong-links-lazarus-group

[3] K lab ldquoLazarus under the hoodrdquo 2018 httpsmediakasperskycontenthubcomwp-contentuploadssites4320180307180244Lazarus_Under_+e_Hood_PDF_finalpdf

[4] Operation Blockbuster ldquoDestructive malware reportrdquo 2016httpswwwoperationblockbustercomwp-contentuploads201602Operation-Blockbuster-Destructive-Malware-Reportpdf

[5] D Martin and SANS Institute InfoSec Reading Room ldquoTracingthe lineage of DarkSeoulrdquo 2016 httpswwwsansorgreading-roomwhitepaperscriticaltracing-lineage-darkseoul-36787

[6] D S C T U T Intelligence ldquoWiper malware threatanalysisrdquo 2013 httpswwwsecureworkscomresearchwiper-malware-analysis-attacking-korean-financial-sector

[7] R Sherstobitoff M L Itai Liba and O O T C James WalterldquoDissecting operation troy cyberespionage in South Koreardquo2013 httpswwwmcafeecomenterpriseen-usassetswhite-paperswp-dissecting-operation-troypdf

[8] N Horton andA DeSimone ldquoSonyrsquos nightmare before christmasthe 2014 North Korean cyber attack on Sony and lessons for USgovernment actions in cyberspacerdquo 2018 httpswwwjhuapleduContentdocumentsSonyNightmareBeforeChristmaspdf

[9] I K Lee and S R Ramsey 9e Korean Language StateUniversity of New York Albany NY USA 2000

[10] V Benjamin and H Chen ldquoSecuring cyberspace identifyingkey actors in hacker communitiesrdquo in Proceedings of the 2012IEEE International Conference on Intelligence and SecurityInformatics pp 24ndash29 Arlington VA USA June 2012

[11] Y Lu X Luo M Polgar et al ldquoSocial network analysis of acriminal hacker communityrdquo Journal of Computer In-formation Systems vol 51 no 2 pp 31ndash41 2010

[12] J-W Jang H Kang J Woo A Mohaisen and H K KimldquoAndro-autopsy anti-malware system based on similaritymatching of malware and malware creator-centric in-formationrdquo Digital Investigation vol 14 pp 17ndash35 2015

[13] J W Jang and H K Kim ldquoFunction-orientedmobile malwareanalysis as first aidrdquo Mobile Information Systems vol 2016Article ID 6707524 11 pages 2016

[14] Y Ki E Kim and H K Kim ldquoA novel approach to detectmalware based on api call sequence analysisrdquo InternationalJournal of Distributed Sensor Networks vol 11 no 6 ArticleID 659101 2015

[15] M L Han H C Han A R Kang et al ldquoWeb-hacking datasetfor the cyber criminal profilingrdquo 2016 httpocslabhksecuritynetDatasetsweb-hacking-profiling

[16] M L Han H C Han A R Kang B I Kwak A Mohaisenand H K Kim ldquoWAHP web-hacking profiling using case-based reasoningrdquo in Proceedings of the 2016 IEEE Conference

Security and Communication Networks 19

on Communications and Network Security (CNS) pp 344-345Philadelphia PA USA October 2016

[17] A Aamodt and E Plaza ldquoCase-based reasoning foundationalissues methodological variations and system approachesrdquo AICommunications vol 7 no 1 pp 39ndash59 1994

[18] D M L Martins and F B D Lima Neto ldquoHybrid intelligentdecision support using a semiotic case-based reasoning andself-organizing mapsrdquo IEEE Transactions on Systems Manand Cybernetics Systems no 99 pp 1ndash8 2017

[19] H K Kim K H Im and S C Park ldquoDSS for computersecurity incident response applying CBR and collaborativeresponserdquo Expert Systems with Applications vol 37 no 1pp 852ndash870 2010

[20] J-B Lamy B Sekar G Guezennec J Bouaud andB Seroussi ldquoExplainable artificial intelligence for breastcancer a visual case-based reasoning approachrdquo ArtificialIntelligence in Medicine vol 94 pp 42ndash53 2019

[21] M Relich and P Pawlewski ldquoA case-based reasoning ap-proach to cost estimation of new product developmentrdquoNeurocomputing vol 272 pp 40ndash45 2018

[22] E R Reyes S Negny G C Robles et al ldquoImprovement ofonline adaptation knowledge acquisition and reuse in case-based reasoning application to process engineering designrdquoEngineering Applications of Artificial Intelligence vol 41pp 1ndash16 2015

[23] H K Kim S-K Kim and S-H Kim ldquoDecision supportsystem for zero-day attack responserdquo Applied Mathematicsand Information Sciences vol 6 no 1 pp 221Sndash241S 2012

[24] G Horsman C Laing and P Vickers ldquoA case-based rea-soning method for locating evidence during digital forensicdevice triagerdquo Decision Support Systems vol 61 pp 69ndash782014

[25] G Horsman C Laing and P Vickers ldquoA case based reasoningsystem for automated forensic examinationsrdquo in Proceedings ofthe PGNET 2011 the 12th Annual Postgraduate Symposium onthe Convergence of Telecommunications Networking andBroadcasting pp 26ndash31 Liverpool UK June 2011

[26] Z Yin Y Gao and B Chen ldquoOn development of supple-mentary criminal analysis system based on cbr and ontologyrdquoin Proceedings of the 2010 International Conference onComputer Application and System Modeling (ICCASM 2010)vol 14 Taiyuan China October 2010

[27] A J Pinizzotto and N J Finkel ldquoCriminal personality pro-filing an outcome and process studyrdquo Law and HumanBehavior vol 14 no 3 pp 215ndash233 1990

[28] P Chen and J Kurland ldquoTime place and modus operandi asimple apriori algorithm experiment for crime pattern de-tectionrdquo in Proceedings of the 2018 9th International Con-ference on Information Intelligence Systems and Applications(IISA) pp 1ndash3 Zakynthos Greece July 2018

[29] C J R Collie and K Shalev Greene ldquoExamining modusoperandi in stranger child abduction a comparison ofattempted and completed casesrdquo Journal of InvestigativePsychology and Offender Profiling vol 16 no 2 pp 91ndash1092019

[30] V Benjamin B Zhang J F Nunamaker Jr and H ChenldquoExamining hacker participation length in cybercriminalinternet-relay-chat communitiesrdquo Journal of ManagementInformation Systems vol 33 no 2 pp 482ndash510 2016

[31] V Benjamin and H Chen ldquoTime-to-event modeling forpredicting hacker IRC community participant trajectoryrdquo inProceedings of the 2014 IEEE Joint Intelligence and SecurityInformatics Conference pp 25ndash32 +e Hague +e Nether-lands September 2014

[32] K Veena and K Meena ldquoIdentification of cyber criminal byanalysing the users profilerdquo International Journal of NetworkSecurity vol 20 no 4 pp 738ndash745 2018

[33] F Iqbal B C M Fung M Debbabi R Batool andA Marrington ldquoWordnet-based criminal networks miningfor cybercrime investigationrdquo IEEE Access vol 7pp 22740ndash22755 2019

[34] N Qazi and B L W Wong ldquoAn interactive human centereddata science approach towards crime pattern analysisrdquo In-formation Processing ampManagement vol 56 no 6 p 1020662019

[35] N Jain P Sharma R Anchan et al ldquoComputerized forensicapproach using data mining techniquesrdquo in Proceedings of theACM Symposium on Women in Research 2016 pp 55ndash60ACM New York NY USA 2016

[36] P M Cozens G Saville and D Hillier ldquoCrime preventionthrough environmental design (cpted) a review and modernbibliographyrdquo Property Management vol 23 no 5pp 328ndash356 2005

[37] H Hassani X Huang E S Silva andM Ghodsi ldquoA review ofdata mining applications in crimerdquo Statistical Analysis andData Mining 9e ASA Data Science Journal vol 9 no 3pp 139ndash154 2016

[38] A Sharma and S Sharma ldquoAn intelligent analysis of webcrime data using data miningrdquo International Journal of En-gineering and Innovative Technology (IJEIT) vol 2 no 32012

[39] S-T Li S-C Kuo and F-C Tsai ldquoAn intelligent decision-support model using FSOM and rule extraction for crimepreventionrdquo Expert Systems with Applications vol 37 no 10pp 7108ndash7119 2010

[40] Y-H Tseng Z-P Ho K-S Yang and C-C Chen ldquoMiningterm networks from text collections for crime investigationrdquoExpert Systems with Applications vol 39 no 11 pp 10082ndash10090 2012

[41] A Malathi and S S Baboo ldquoAn enhanced algorithm topredict a future crime using data miningrdquo InternationalJournal of Computer Applications vol 21 no 1 2011

[42] S Kapetanakis A Filippoupolitis G Loukas et al ldquoProfilingcyber attackers using case-based reasoningrdquo in Proceedings ofthe 19th UK Workshop on Case-Based Reasoning (UKCBR2014) Cambridge UK December 2014

[43] R Al-Zaidy B C Fung A M Youssef et al ldquoMining criminalnetworks from unstructured text documentsrdquo Digital In-vestigation vol 8 no 3-4 pp 147ndash160 2012

[44] M Zulfadhilah Y Prayudi and I Riadi ldquoCyber profilingusing log analysis and k-means clusteringrdquo InternationalJournal of Advanced Computer Science and Applicationsvol 7 no 7 pp 430ndash435 2016

[45] S V Nath ldquoCrime pattern detection using data miningrdquo inProceedings of the 2006 IEEEWICACM International Con-ference on Web Intelligence and Intelligent Agent TechnologyWorkshops pp 41ndash44 Hong Kong China December 2006

[46] ITPnet ldquoSyria Egypt crises spur escalation of me cyber at-tacksrdquo 2013 httpwwwitpnet594742-syria-egypt-crises-spur-escalation-of-me-cyber-attack

[47] A McEnery and R Xiao ldquoCharacter encoding in corpusconstructionrdquo in Developing Linguistic Corpora A Guide toGood Practice Oxbow Books Ltd Oxford UK 2005

[48] B Bos T Ccedilelik I Hickson et al ldquoCascading style sheets level2 revision 1 (CSS 21) specificationrdquo W3C Working Draft2005 httpwwww3orgTRCSS21

20 Security and Communication Networks

[49] W Stuckey ldquoMassive sony breach sheds light on murkyhacker universerdquo 2018 httpamericaaljazeeracomarticles20141224sony-hacker-universehtml

[50] S Gallagher ldquoSony pictures malware tied to SeoulldquoShamoonrdquo cyber-attacksrdquo 2018 httpsarstechnicacominformation-technology201412sony-pictures-malware-tied-to-seoul-shamoon-cyber-attacks

[51] J Pagliery ldquoSony hack signs point to North Koreardquo 2018httpsmoneycnncom20141205technologysecuritysony-hack-north-korea-employeeindexhtml

[52] K Ketler ldquoCase-based reasoning an introductionrdquo ExpertSystems with Applications vol 6 no 1 pp 3ndash8 1993

[53] M Rosvall and C T Bergstrom ldquoMapping change in largenetworksrdquo PLoS One vol 5 no 1 Article ID e8694 2010

[54] OASIS ldquoSTIXTAXII standardsrdquo 2017-2018 httpsoasis-opengithubiocti-documentation

Security and Communication Networks 21

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 18: CBR-Based Decision Support Methodology for Cybercrime

improve the accuracy and usability of our proposedmethodology However at the time of writing the presentpaper we did not have access to open and public dataconcerning cybercrime

For that reason we tried to demonstrate the practica-bility of the proposed methodology as a proof of concept+erefore we focused on the dataset of the zone-horg thatincludes a large number of website defacement cases Al-though the zone-horg provides an extensive dataset on thepast incident events not all incidents can be included in ourstudy +erefore if a hacker penetrated some target orga-nizations by APT attacks and performed stealthy activitiessuch hacking activities would not be reported in the datasetof the zone-horg and the proposed methodology would notbe able to detect similar cases with reasonable confidence

6 Conclusion and Future Work

In this study the similarity of website defacement caseswas assessed through the similarity measure and theclustering processing using the CBR as a methodology+e collected raw data of the defaced web sitesrsquo resourceswas sanitized via data parsing and data cleaning processAlso based on the large size of real dataset data-drivenanalysis for the hacker profiling is achieved To this endthe case vector was designed and the significant featureswere chosen for applying to the case-based reasoning Fora successful cybercrime investigation hacker profiling viaclustering analysis is the most basic and importantprocess in order to find out the relevant incident casesand significant data on some prime incidents data-driven

Table 5 Follow-up investigation on the top three hackers with website defacement cases most similar to the DS case and SPE case +e casevector value means the hackerrsquos attack rate

DomainDS case SPE case

Hmei7 d3b_X StifLer Oaddah MTRiX EL_MuHaMMeDCom 7832 8581 10000 10000 8627 8298Edu 162 096 mdash mdash 176 191Net 340 320 mdash mdash 546 574Gov 1216 651 mdash mdash 106 mdashYear Hmei7 d3b_X StifLer Oaddah MTRiX EL_MuHaMMeD2002 mdash mdash mdash mdash 1074 mdash2003 mdash mdash mdash mdash 8908 mdash2006 mdash mdash mdash mdash mdash mdash2007 009 mdash mdash mdash 018 mdash2008 mdash mdash mdash mdash mdash mdash2009 315 mdash mdash mdash mdash 99572010 009 mdash mdash mdash mdash mdash2011 034 mdash mdash mdash mdash mdash2012 340 mdash mdash 10000 mdash mdash2013 3486 3917 10000 mdash mdash mdash2014 5808 5977 mdash mdash mdash 0432015 mdash 107 mdash mdash mdash mdash

d3b~x

Hmei7

StifLer

2009

2012

2013

2014

AustraliaBrazilFrance

Germany

IndonesiaItaly

KoreaNetherlandsNewZealand

PolandRussia

Thailand

Turkey

Unknown

com

gov

net

org

Unknown

No

Yes

Hacker Year ccTLD gTLD Attack

(a)

EL_M

uHaM

MeD

MTRiX

oaddah

2002

2003

2009

2012

BrazilCanada

DenmarkFranceGreece

HongKongItaly

Unknown

com

net

org

Unknown

No

Yes

Hacker Year ccTLD gTLD Attack

(b)

Figure 14 Follow-up investigation on the top three hackers with website defacement cases that are most similar to the DS case (a) and SPEcase (b)

18 Security and Communication Networks

and evidence-driven decision making should be thecritical process Also reducing the amount of data andtime to be analysed are important factors to deliver thehigh value of intelligence data

Although the obtained results appear to be sound andmeaningful it is difficult to evaluate the accuracy of theresults unless the attacker is captured Naturally theground-truth data with specific information about theinvolved hacking groups for verification are rare (ie noadversary claimed that the two attacks were the result oftheir actions) However it is noteworthy that our meth-odology provides a meaningful insight into the confidentialand undercover network of cybercrime as well especiallywhen there is a lack of information Also the proposedmethodology contributes to facilitate the analysis and re-ducing the time required for searching for possible suspectsof cybercrime We believe that the proposed system ismeaningful for further exploration and correlation ofvarious website defacement cases

As mentioned in Discussion and Limitations a cross-data analysis with other various data sources should bereviewed Said differently the use of additional online oroffline information acquired by human intelligence(HUMINT) or different types of signal intelligence(SIGINT) and sources may also help to reason compo-sition requirements of crime and reduce the category ofinvestigation Furthermore the proposed methodologycan be expanded into incident information for compat-ibility and information exchangeability with othercyberthreat intelligence system as the Structured +reatInformation eXpression (STIX) and Trusted AutomatedeXchange of Indicator Information (TAXII) which arekey strategic elements of the information-sharingsystem [54]

+ere are features such as the particular messages (iethanks-to notifier nationality religion and anniversary)or image and mp3 file in the web resources which aregathered from the zone-horg site Although these featuresare limited to only a small number of hackers of the webresources in future research we will try to study a close-knit network among them such as the hub hacking groupkey player and followers Furthermore we also plan tomore definitely classify and systemize the hackersrsquo intentsusing text mining and mood detection techniques +efindings of this prospective study will contribute mean-ingful insights to trace hackersrsquo behavioural patterns and toestimate their primary purpose and intent

Data Availability

+e web-hacking dataset applied to our paper can bedownloaded from the linked site below httpocslabhksecuritynetDatasetsweb-hacking-profiling

Conflicts of Interest

+e authors declare that there are no conflicts of interestregarding the publication of this paper

Acknowledgments

+is work was supported under the framework of internationalcooperation program managed by the National ResearchFoundation of Korea (No 2017K1A3A1A17092614)

References

[1] S S Response ldquoSwift attackersrsquo malware linked to more fi-nancial attacksrdquo 2016 httpswwwsymanteccomconnectblogsswift-attackers-malware-linked-more-financial-attacks

[2] S S Response ldquoWannacry ransomware attacks show strong linksto lazarus grouprdquo 2017 httpswwwsymanteccomconnectblogswannacry-ransomware-attacks-show-strong-links-lazarus-group

[3] K lab ldquoLazarus under the hoodrdquo 2018 httpsmediakasperskycontenthubcomwp-contentuploadssites4320180307180244Lazarus_Under_+e_Hood_PDF_finalpdf

[4] Operation Blockbuster ldquoDestructive malware reportrdquo 2016httpswwwoperationblockbustercomwp-contentuploads201602Operation-Blockbuster-Destructive-Malware-Reportpdf

[5] D Martin and SANS Institute InfoSec Reading Room ldquoTracingthe lineage of DarkSeoulrdquo 2016 httpswwwsansorgreading-roomwhitepaperscriticaltracing-lineage-darkseoul-36787

[6] D S C T U T Intelligence ldquoWiper malware threatanalysisrdquo 2013 httpswwwsecureworkscomresearchwiper-malware-analysis-attacking-korean-financial-sector

[7] R Sherstobitoff M L Itai Liba and O O T C James WalterldquoDissecting operation troy cyberespionage in South Koreardquo2013 httpswwwmcafeecomenterpriseen-usassetswhite-paperswp-dissecting-operation-troypdf

[8] N Horton andA DeSimone ldquoSonyrsquos nightmare before christmasthe 2014 North Korean cyber attack on Sony and lessons for USgovernment actions in cyberspacerdquo 2018 httpswwwjhuapleduContentdocumentsSonyNightmareBeforeChristmaspdf

[9] I K Lee and S R Ramsey 9e Korean Language StateUniversity of New York Albany NY USA 2000

[10] V Benjamin and H Chen ldquoSecuring cyberspace identifyingkey actors in hacker communitiesrdquo in Proceedings of the 2012IEEE International Conference on Intelligence and SecurityInformatics pp 24ndash29 Arlington VA USA June 2012

[11] Y Lu X Luo M Polgar et al ldquoSocial network analysis of acriminal hacker communityrdquo Journal of Computer In-formation Systems vol 51 no 2 pp 31ndash41 2010

[12] J-W Jang H Kang J Woo A Mohaisen and H K KimldquoAndro-autopsy anti-malware system based on similaritymatching of malware and malware creator-centric in-formationrdquo Digital Investigation vol 14 pp 17ndash35 2015

[13] J W Jang and H K Kim ldquoFunction-orientedmobile malwareanalysis as first aidrdquo Mobile Information Systems vol 2016Article ID 6707524 11 pages 2016

[14] Y Ki E Kim and H K Kim ldquoA novel approach to detectmalware based on api call sequence analysisrdquo InternationalJournal of Distributed Sensor Networks vol 11 no 6 ArticleID 659101 2015

[15] M L Han H C Han A R Kang et al ldquoWeb-hacking datasetfor the cyber criminal profilingrdquo 2016 httpocslabhksecuritynetDatasetsweb-hacking-profiling

[16] M L Han H C Han A R Kang B I Kwak A Mohaisenand H K Kim ldquoWAHP web-hacking profiling using case-based reasoningrdquo in Proceedings of the 2016 IEEE Conference

Security and Communication Networks 19

on Communications and Network Security (CNS) pp 344-345Philadelphia PA USA October 2016

[17] A Aamodt and E Plaza ldquoCase-based reasoning foundationalissues methodological variations and system approachesrdquo AICommunications vol 7 no 1 pp 39ndash59 1994

[18] D M L Martins and F B D Lima Neto ldquoHybrid intelligentdecision support using a semiotic case-based reasoning andself-organizing mapsrdquo IEEE Transactions on Systems Manand Cybernetics Systems no 99 pp 1ndash8 2017

[19] H K Kim K H Im and S C Park ldquoDSS for computersecurity incident response applying CBR and collaborativeresponserdquo Expert Systems with Applications vol 37 no 1pp 852ndash870 2010

[20] J-B Lamy B Sekar G Guezennec J Bouaud andB Seroussi ldquoExplainable artificial intelligence for breastcancer a visual case-based reasoning approachrdquo ArtificialIntelligence in Medicine vol 94 pp 42ndash53 2019

[21] M Relich and P Pawlewski ldquoA case-based reasoning ap-proach to cost estimation of new product developmentrdquoNeurocomputing vol 272 pp 40ndash45 2018

[22] E R Reyes S Negny G C Robles et al ldquoImprovement ofonline adaptation knowledge acquisition and reuse in case-based reasoning application to process engineering designrdquoEngineering Applications of Artificial Intelligence vol 41pp 1ndash16 2015

[23] H K Kim S-K Kim and S-H Kim ldquoDecision supportsystem for zero-day attack responserdquo Applied Mathematicsand Information Sciences vol 6 no 1 pp 221Sndash241S 2012

[24] G Horsman C Laing and P Vickers ldquoA case-based rea-soning method for locating evidence during digital forensicdevice triagerdquo Decision Support Systems vol 61 pp 69ndash782014

[25] G Horsman C Laing and P Vickers ldquoA case based reasoningsystem for automated forensic examinationsrdquo in Proceedings ofthe PGNET 2011 the 12th Annual Postgraduate Symposium onthe Convergence of Telecommunications Networking andBroadcasting pp 26ndash31 Liverpool UK June 2011

[26] Z Yin Y Gao and B Chen ldquoOn development of supple-mentary criminal analysis system based on cbr and ontologyrdquoin Proceedings of the 2010 International Conference onComputer Application and System Modeling (ICCASM 2010)vol 14 Taiyuan China October 2010

[27] A J Pinizzotto and N J Finkel ldquoCriminal personality pro-filing an outcome and process studyrdquo Law and HumanBehavior vol 14 no 3 pp 215ndash233 1990

[28] P Chen and J Kurland ldquoTime place and modus operandi asimple apriori algorithm experiment for crime pattern de-tectionrdquo in Proceedings of the 2018 9th International Con-ference on Information Intelligence Systems and Applications(IISA) pp 1ndash3 Zakynthos Greece July 2018

[29] C J R Collie and K Shalev Greene ldquoExamining modusoperandi in stranger child abduction a comparison ofattempted and completed casesrdquo Journal of InvestigativePsychology and Offender Profiling vol 16 no 2 pp 91ndash1092019

[30] V Benjamin B Zhang J F Nunamaker Jr and H ChenldquoExamining hacker participation length in cybercriminalinternet-relay-chat communitiesrdquo Journal of ManagementInformation Systems vol 33 no 2 pp 482ndash510 2016

[31] V Benjamin and H Chen ldquoTime-to-event modeling forpredicting hacker IRC community participant trajectoryrdquo inProceedings of the 2014 IEEE Joint Intelligence and SecurityInformatics Conference pp 25ndash32 +e Hague +e Nether-lands September 2014

[32] K Veena and K Meena ldquoIdentification of cyber criminal byanalysing the users profilerdquo International Journal of NetworkSecurity vol 20 no 4 pp 738ndash745 2018

[33] F Iqbal B C M Fung M Debbabi R Batool andA Marrington ldquoWordnet-based criminal networks miningfor cybercrime investigationrdquo IEEE Access vol 7pp 22740ndash22755 2019

[34] N Qazi and B L W Wong ldquoAn interactive human centereddata science approach towards crime pattern analysisrdquo In-formation Processing ampManagement vol 56 no 6 p 1020662019

[35] N Jain P Sharma R Anchan et al ldquoComputerized forensicapproach using data mining techniquesrdquo in Proceedings of theACM Symposium on Women in Research 2016 pp 55ndash60ACM New York NY USA 2016

[36] P M Cozens G Saville and D Hillier ldquoCrime preventionthrough environmental design (cpted) a review and modernbibliographyrdquo Property Management vol 23 no 5pp 328ndash356 2005

[37] H Hassani X Huang E S Silva andM Ghodsi ldquoA review ofdata mining applications in crimerdquo Statistical Analysis andData Mining 9e ASA Data Science Journal vol 9 no 3pp 139ndash154 2016

[38] A Sharma and S Sharma ldquoAn intelligent analysis of webcrime data using data miningrdquo International Journal of En-gineering and Innovative Technology (IJEIT) vol 2 no 32012

[39] S-T Li S-C Kuo and F-C Tsai ldquoAn intelligent decision-support model using FSOM and rule extraction for crimepreventionrdquo Expert Systems with Applications vol 37 no 10pp 7108ndash7119 2010

[40] Y-H Tseng Z-P Ho K-S Yang and C-C Chen ldquoMiningterm networks from text collections for crime investigationrdquoExpert Systems with Applications vol 39 no 11 pp 10082ndash10090 2012

[41] A Malathi and S S Baboo ldquoAn enhanced algorithm topredict a future crime using data miningrdquo InternationalJournal of Computer Applications vol 21 no 1 2011

[42] S Kapetanakis A Filippoupolitis G Loukas et al ldquoProfilingcyber attackers using case-based reasoningrdquo in Proceedings ofthe 19th UK Workshop on Case-Based Reasoning (UKCBR2014) Cambridge UK December 2014

[43] R Al-Zaidy B C Fung A M Youssef et al ldquoMining criminalnetworks from unstructured text documentsrdquo Digital In-vestigation vol 8 no 3-4 pp 147ndash160 2012

[44] M Zulfadhilah Y Prayudi and I Riadi ldquoCyber profilingusing log analysis and k-means clusteringrdquo InternationalJournal of Advanced Computer Science and Applicationsvol 7 no 7 pp 430ndash435 2016

[45] S V Nath ldquoCrime pattern detection using data miningrdquo inProceedings of the 2006 IEEEWICACM International Con-ference on Web Intelligence and Intelligent Agent TechnologyWorkshops pp 41ndash44 Hong Kong China December 2006

[46] ITPnet ldquoSyria Egypt crises spur escalation of me cyber at-tacksrdquo 2013 httpwwwitpnet594742-syria-egypt-crises-spur-escalation-of-me-cyber-attack

[47] A McEnery and R Xiao ldquoCharacter encoding in corpusconstructionrdquo in Developing Linguistic Corpora A Guide toGood Practice Oxbow Books Ltd Oxford UK 2005

[48] B Bos T Ccedilelik I Hickson et al ldquoCascading style sheets level2 revision 1 (CSS 21) specificationrdquo W3C Working Draft2005 httpwwww3orgTRCSS21

20 Security and Communication Networks

[49] W Stuckey ldquoMassive sony breach sheds light on murkyhacker universerdquo 2018 httpamericaaljazeeracomarticles20141224sony-hacker-universehtml

[50] S Gallagher ldquoSony pictures malware tied to SeoulldquoShamoonrdquo cyber-attacksrdquo 2018 httpsarstechnicacominformation-technology201412sony-pictures-malware-tied-to-seoul-shamoon-cyber-attacks

[51] J Pagliery ldquoSony hack signs point to North Koreardquo 2018httpsmoneycnncom20141205technologysecuritysony-hack-north-korea-employeeindexhtml

[52] K Ketler ldquoCase-based reasoning an introductionrdquo ExpertSystems with Applications vol 6 no 1 pp 3ndash8 1993

[53] M Rosvall and C T Bergstrom ldquoMapping change in largenetworksrdquo PLoS One vol 5 no 1 Article ID e8694 2010

[54] OASIS ldquoSTIXTAXII standardsrdquo 2017-2018 httpsoasis-opengithubiocti-documentation

Security and Communication Networks 21

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 19: CBR-Based Decision Support Methodology for Cybercrime

and evidence-driven decision making should be thecritical process Also reducing the amount of data andtime to be analysed are important factors to deliver thehigh value of intelligence data

Although the obtained results appear to be sound andmeaningful it is difficult to evaluate the accuracy of theresults unless the attacker is captured Naturally theground-truth data with specific information about theinvolved hacking groups for verification are rare (ie noadversary claimed that the two attacks were the result oftheir actions) However it is noteworthy that our meth-odology provides a meaningful insight into the confidentialand undercover network of cybercrime as well especiallywhen there is a lack of information Also the proposedmethodology contributes to facilitate the analysis and re-ducing the time required for searching for possible suspectsof cybercrime We believe that the proposed system ismeaningful for further exploration and correlation ofvarious website defacement cases

As mentioned in Discussion and Limitations a cross-data analysis with other various data sources should bereviewed Said differently the use of additional online oroffline information acquired by human intelligence(HUMINT) or different types of signal intelligence(SIGINT) and sources may also help to reason compo-sition requirements of crime and reduce the category ofinvestigation Furthermore the proposed methodologycan be expanded into incident information for compat-ibility and information exchangeability with othercyberthreat intelligence system as the Structured +reatInformation eXpression (STIX) and Trusted AutomatedeXchange of Indicator Information (TAXII) which arekey strategic elements of the information-sharingsystem [54]

+ere are features such as the particular messages (iethanks-to notifier nationality religion and anniversary)or image and mp3 file in the web resources which aregathered from the zone-horg site Although these featuresare limited to only a small number of hackers of the webresources in future research we will try to study a close-knit network among them such as the hub hacking groupkey player and followers Furthermore we also plan tomore definitely classify and systemize the hackersrsquo intentsusing text mining and mood detection techniques +efindings of this prospective study will contribute mean-ingful insights to trace hackersrsquo behavioural patterns and toestimate their primary purpose and intent

Data Availability

+e web-hacking dataset applied to our paper can bedownloaded from the linked site below httpocslabhksecuritynetDatasetsweb-hacking-profiling

Conflicts of Interest

+e authors declare that there are no conflicts of interestregarding the publication of this paper

Acknowledgments

+is work was supported under the framework of internationalcooperation program managed by the National ResearchFoundation of Korea (No 2017K1A3A1A17092614)

References

[1] S S Response ldquoSwift attackersrsquo malware linked to more fi-nancial attacksrdquo 2016 httpswwwsymanteccomconnectblogsswift-attackers-malware-linked-more-financial-attacks

[2] S S Response ldquoWannacry ransomware attacks show strong linksto lazarus grouprdquo 2017 httpswwwsymanteccomconnectblogswannacry-ransomware-attacks-show-strong-links-lazarus-group

[3] K lab ldquoLazarus under the hoodrdquo 2018 httpsmediakasperskycontenthubcomwp-contentuploadssites4320180307180244Lazarus_Under_+e_Hood_PDF_finalpdf

[4] Operation Blockbuster ldquoDestructive malware reportrdquo 2016httpswwwoperationblockbustercomwp-contentuploads201602Operation-Blockbuster-Destructive-Malware-Reportpdf

[5] D Martin and SANS Institute InfoSec Reading Room ldquoTracingthe lineage of DarkSeoulrdquo 2016 httpswwwsansorgreading-roomwhitepaperscriticaltracing-lineage-darkseoul-36787

[6] D S C T U T Intelligence ldquoWiper malware threatanalysisrdquo 2013 httpswwwsecureworkscomresearchwiper-malware-analysis-attacking-korean-financial-sector

[7] R Sherstobitoff M L Itai Liba and O O T C James WalterldquoDissecting operation troy cyberespionage in South Koreardquo2013 httpswwwmcafeecomenterpriseen-usassetswhite-paperswp-dissecting-operation-troypdf

[8] N Horton andA DeSimone ldquoSonyrsquos nightmare before christmasthe 2014 North Korean cyber attack on Sony and lessons for USgovernment actions in cyberspacerdquo 2018 httpswwwjhuapleduContentdocumentsSonyNightmareBeforeChristmaspdf

[9] I K Lee and S R Ramsey 9e Korean Language StateUniversity of New York Albany NY USA 2000

[10] V Benjamin and H Chen ldquoSecuring cyberspace identifyingkey actors in hacker communitiesrdquo in Proceedings of the 2012IEEE International Conference on Intelligence and SecurityInformatics pp 24ndash29 Arlington VA USA June 2012

[11] Y Lu X Luo M Polgar et al ldquoSocial network analysis of acriminal hacker communityrdquo Journal of Computer In-formation Systems vol 51 no 2 pp 31ndash41 2010

[12] J-W Jang H Kang J Woo A Mohaisen and H K KimldquoAndro-autopsy anti-malware system based on similaritymatching of malware and malware creator-centric in-formationrdquo Digital Investigation vol 14 pp 17ndash35 2015

[13] J W Jang and H K Kim ldquoFunction-orientedmobile malwareanalysis as first aidrdquo Mobile Information Systems vol 2016Article ID 6707524 11 pages 2016

[14] Y Ki E Kim and H K Kim ldquoA novel approach to detectmalware based on api call sequence analysisrdquo InternationalJournal of Distributed Sensor Networks vol 11 no 6 ArticleID 659101 2015

[15] M L Han H C Han A R Kang et al ldquoWeb-hacking datasetfor the cyber criminal profilingrdquo 2016 httpocslabhksecuritynetDatasetsweb-hacking-profiling

[16] M L Han H C Han A R Kang B I Kwak A Mohaisenand H K Kim ldquoWAHP web-hacking profiling using case-based reasoningrdquo in Proceedings of the 2016 IEEE Conference

Security and Communication Networks 19

on Communications and Network Security (CNS) pp 344-345Philadelphia PA USA October 2016

[17] A Aamodt and E Plaza ldquoCase-based reasoning foundationalissues methodological variations and system approachesrdquo AICommunications vol 7 no 1 pp 39ndash59 1994

[18] D M L Martins and F B D Lima Neto ldquoHybrid intelligentdecision support using a semiotic case-based reasoning andself-organizing mapsrdquo IEEE Transactions on Systems Manand Cybernetics Systems no 99 pp 1ndash8 2017

[19] H K Kim K H Im and S C Park ldquoDSS for computersecurity incident response applying CBR and collaborativeresponserdquo Expert Systems with Applications vol 37 no 1pp 852ndash870 2010

[20] J-B Lamy B Sekar G Guezennec J Bouaud andB Seroussi ldquoExplainable artificial intelligence for breastcancer a visual case-based reasoning approachrdquo ArtificialIntelligence in Medicine vol 94 pp 42ndash53 2019

[21] M Relich and P Pawlewski ldquoA case-based reasoning ap-proach to cost estimation of new product developmentrdquoNeurocomputing vol 272 pp 40ndash45 2018

[22] E R Reyes S Negny G C Robles et al ldquoImprovement ofonline adaptation knowledge acquisition and reuse in case-based reasoning application to process engineering designrdquoEngineering Applications of Artificial Intelligence vol 41pp 1ndash16 2015

[23] H K Kim S-K Kim and S-H Kim ldquoDecision supportsystem for zero-day attack responserdquo Applied Mathematicsand Information Sciences vol 6 no 1 pp 221Sndash241S 2012

[24] G Horsman C Laing and P Vickers ldquoA case-based rea-soning method for locating evidence during digital forensicdevice triagerdquo Decision Support Systems vol 61 pp 69ndash782014

[25] G Horsman C Laing and P Vickers ldquoA case based reasoningsystem for automated forensic examinationsrdquo in Proceedings ofthe PGNET 2011 the 12th Annual Postgraduate Symposium onthe Convergence of Telecommunications Networking andBroadcasting pp 26ndash31 Liverpool UK June 2011

[26] Z Yin Y Gao and B Chen ldquoOn development of supple-mentary criminal analysis system based on cbr and ontologyrdquoin Proceedings of the 2010 International Conference onComputer Application and System Modeling (ICCASM 2010)vol 14 Taiyuan China October 2010

[27] A J Pinizzotto and N J Finkel ldquoCriminal personality pro-filing an outcome and process studyrdquo Law and HumanBehavior vol 14 no 3 pp 215ndash233 1990

[28] P Chen and J Kurland ldquoTime place and modus operandi asimple apriori algorithm experiment for crime pattern de-tectionrdquo in Proceedings of the 2018 9th International Con-ference on Information Intelligence Systems and Applications(IISA) pp 1ndash3 Zakynthos Greece July 2018

[29] C J R Collie and K Shalev Greene ldquoExamining modusoperandi in stranger child abduction a comparison ofattempted and completed casesrdquo Journal of InvestigativePsychology and Offender Profiling vol 16 no 2 pp 91ndash1092019

[30] V Benjamin B Zhang J F Nunamaker Jr and H ChenldquoExamining hacker participation length in cybercriminalinternet-relay-chat communitiesrdquo Journal of ManagementInformation Systems vol 33 no 2 pp 482ndash510 2016

[31] V Benjamin and H Chen ldquoTime-to-event modeling forpredicting hacker IRC community participant trajectoryrdquo inProceedings of the 2014 IEEE Joint Intelligence and SecurityInformatics Conference pp 25ndash32 +e Hague +e Nether-lands September 2014

[32] K Veena and K Meena ldquoIdentification of cyber criminal byanalysing the users profilerdquo International Journal of NetworkSecurity vol 20 no 4 pp 738ndash745 2018

[33] F Iqbal B C M Fung M Debbabi R Batool andA Marrington ldquoWordnet-based criminal networks miningfor cybercrime investigationrdquo IEEE Access vol 7pp 22740ndash22755 2019

[34] N Qazi and B L W Wong ldquoAn interactive human centereddata science approach towards crime pattern analysisrdquo In-formation Processing ampManagement vol 56 no 6 p 1020662019

[35] N Jain P Sharma R Anchan et al ldquoComputerized forensicapproach using data mining techniquesrdquo in Proceedings of theACM Symposium on Women in Research 2016 pp 55ndash60ACM New York NY USA 2016

[36] P M Cozens G Saville and D Hillier ldquoCrime preventionthrough environmental design (cpted) a review and modernbibliographyrdquo Property Management vol 23 no 5pp 328ndash356 2005

[37] H Hassani X Huang E S Silva andM Ghodsi ldquoA review ofdata mining applications in crimerdquo Statistical Analysis andData Mining 9e ASA Data Science Journal vol 9 no 3pp 139ndash154 2016

[38] A Sharma and S Sharma ldquoAn intelligent analysis of webcrime data using data miningrdquo International Journal of En-gineering and Innovative Technology (IJEIT) vol 2 no 32012

[39] S-T Li S-C Kuo and F-C Tsai ldquoAn intelligent decision-support model using FSOM and rule extraction for crimepreventionrdquo Expert Systems with Applications vol 37 no 10pp 7108ndash7119 2010

[40] Y-H Tseng Z-P Ho K-S Yang and C-C Chen ldquoMiningterm networks from text collections for crime investigationrdquoExpert Systems with Applications vol 39 no 11 pp 10082ndash10090 2012

[41] A Malathi and S S Baboo ldquoAn enhanced algorithm topredict a future crime using data miningrdquo InternationalJournal of Computer Applications vol 21 no 1 2011

[42] S Kapetanakis A Filippoupolitis G Loukas et al ldquoProfilingcyber attackers using case-based reasoningrdquo in Proceedings ofthe 19th UK Workshop on Case-Based Reasoning (UKCBR2014) Cambridge UK December 2014

[43] R Al-Zaidy B C Fung A M Youssef et al ldquoMining criminalnetworks from unstructured text documentsrdquo Digital In-vestigation vol 8 no 3-4 pp 147ndash160 2012

[44] M Zulfadhilah Y Prayudi and I Riadi ldquoCyber profilingusing log analysis and k-means clusteringrdquo InternationalJournal of Advanced Computer Science and Applicationsvol 7 no 7 pp 430ndash435 2016

[45] S V Nath ldquoCrime pattern detection using data miningrdquo inProceedings of the 2006 IEEEWICACM International Con-ference on Web Intelligence and Intelligent Agent TechnologyWorkshops pp 41ndash44 Hong Kong China December 2006

[46] ITPnet ldquoSyria Egypt crises spur escalation of me cyber at-tacksrdquo 2013 httpwwwitpnet594742-syria-egypt-crises-spur-escalation-of-me-cyber-attack

[47] A McEnery and R Xiao ldquoCharacter encoding in corpusconstructionrdquo in Developing Linguistic Corpora A Guide toGood Practice Oxbow Books Ltd Oxford UK 2005

[48] B Bos T Ccedilelik I Hickson et al ldquoCascading style sheets level2 revision 1 (CSS 21) specificationrdquo W3C Working Draft2005 httpwwww3orgTRCSS21

20 Security and Communication Networks

[49] W Stuckey ldquoMassive sony breach sheds light on murkyhacker universerdquo 2018 httpamericaaljazeeracomarticles20141224sony-hacker-universehtml

[50] S Gallagher ldquoSony pictures malware tied to SeoulldquoShamoonrdquo cyber-attacksrdquo 2018 httpsarstechnicacominformation-technology201412sony-pictures-malware-tied-to-seoul-shamoon-cyber-attacks

[51] J Pagliery ldquoSony hack signs point to North Koreardquo 2018httpsmoneycnncom20141205technologysecuritysony-hack-north-korea-employeeindexhtml

[52] K Ketler ldquoCase-based reasoning an introductionrdquo ExpertSystems with Applications vol 6 no 1 pp 3ndash8 1993

[53] M Rosvall and C T Bergstrom ldquoMapping change in largenetworksrdquo PLoS One vol 5 no 1 Article ID e8694 2010

[54] OASIS ldquoSTIXTAXII standardsrdquo 2017-2018 httpsoasis-opengithubiocti-documentation

Security and Communication Networks 21

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 20: CBR-Based Decision Support Methodology for Cybercrime

on Communications and Network Security (CNS) pp 344-345Philadelphia PA USA October 2016

[17] A Aamodt and E Plaza ldquoCase-based reasoning foundationalissues methodological variations and system approachesrdquo AICommunications vol 7 no 1 pp 39ndash59 1994

[18] D M L Martins and F B D Lima Neto ldquoHybrid intelligentdecision support using a semiotic case-based reasoning andself-organizing mapsrdquo IEEE Transactions on Systems Manand Cybernetics Systems no 99 pp 1ndash8 2017

[19] H K Kim K H Im and S C Park ldquoDSS for computersecurity incident response applying CBR and collaborativeresponserdquo Expert Systems with Applications vol 37 no 1pp 852ndash870 2010

[20] J-B Lamy B Sekar G Guezennec J Bouaud andB Seroussi ldquoExplainable artificial intelligence for breastcancer a visual case-based reasoning approachrdquo ArtificialIntelligence in Medicine vol 94 pp 42ndash53 2019

[21] M Relich and P Pawlewski ldquoA case-based reasoning ap-proach to cost estimation of new product developmentrdquoNeurocomputing vol 272 pp 40ndash45 2018

[22] E R Reyes S Negny G C Robles et al ldquoImprovement ofonline adaptation knowledge acquisition and reuse in case-based reasoning application to process engineering designrdquoEngineering Applications of Artificial Intelligence vol 41pp 1ndash16 2015

[23] H K Kim S-K Kim and S-H Kim ldquoDecision supportsystem for zero-day attack responserdquo Applied Mathematicsand Information Sciences vol 6 no 1 pp 221Sndash241S 2012

[24] G Horsman C Laing and P Vickers ldquoA case-based rea-soning method for locating evidence during digital forensicdevice triagerdquo Decision Support Systems vol 61 pp 69ndash782014

[25] G Horsman C Laing and P Vickers ldquoA case based reasoningsystem for automated forensic examinationsrdquo in Proceedings ofthe PGNET 2011 the 12th Annual Postgraduate Symposium onthe Convergence of Telecommunications Networking andBroadcasting pp 26ndash31 Liverpool UK June 2011

[26] Z Yin Y Gao and B Chen ldquoOn development of supple-mentary criminal analysis system based on cbr and ontologyrdquoin Proceedings of the 2010 International Conference onComputer Application and System Modeling (ICCASM 2010)vol 14 Taiyuan China October 2010

[27] A J Pinizzotto and N J Finkel ldquoCriminal personality pro-filing an outcome and process studyrdquo Law and HumanBehavior vol 14 no 3 pp 215ndash233 1990

[28] P Chen and J Kurland ldquoTime place and modus operandi asimple apriori algorithm experiment for crime pattern de-tectionrdquo in Proceedings of the 2018 9th International Con-ference on Information Intelligence Systems and Applications(IISA) pp 1ndash3 Zakynthos Greece July 2018

[29] C J R Collie and K Shalev Greene ldquoExamining modusoperandi in stranger child abduction a comparison ofattempted and completed casesrdquo Journal of InvestigativePsychology and Offender Profiling vol 16 no 2 pp 91ndash1092019

[30] V Benjamin B Zhang J F Nunamaker Jr and H ChenldquoExamining hacker participation length in cybercriminalinternet-relay-chat communitiesrdquo Journal of ManagementInformation Systems vol 33 no 2 pp 482ndash510 2016

[31] V Benjamin and H Chen ldquoTime-to-event modeling forpredicting hacker IRC community participant trajectoryrdquo inProceedings of the 2014 IEEE Joint Intelligence and SecurityInformatics Conference pp 25ndash32 +e Hague +e Nether-lands September 2014

[32] K Veena and K Meena ldquoIdentification of cyber criminal byanalysing the users profilerdquo International Journal of NetworkSecurity vol 20 no 4 pp 738ndash745 2018

[33] F Iqbal B C M Fung M Debbabi R Batool andA Marrington ldquoWordnet-based criminal networks miningfor cybercrime investigationrdquo IEEE Access vol 7pp 22740ndash22755 2019

[34] N Qazi and B L W Wong ldquoAn interactive human centereddata science approach towards crime pattern analysisrdquo In-formation Processing ampManagement vol 56 no 6 p 1020662019

[35] N Jain P Sharma R Anchan et al ldquoComputerized forensicapproach using data mining techniquesrdquo in Proceedings of theACM Symposium on Women in Research 2016 pp 55ndash60ACM New York NY USA 2016

[36] P M Cozens G Saville and D Hillier ldquoCrime preventionthrough environmental design (cpted) a review and modernbibliographyrdquo Property Management vol 23 no 5pp 328ndash356 2005

[37] H Hassani X Huang E S Silva andM Ghodsi ldquoA review ofdata mining applications in crimerdquo Statistical Analysis andData Mining 9e ASA Data Science Journal vol 9 no 3pp 139ndash154 2016

[38] A Sharma and S Sharma ldquoAn intelligent analysis of webcrime data using data miningrdquo International Journal of En-gineering and Innovative Technology (IJEIT) vol 2 no 32012

[39] S-T Li S-C Kuo and F-C Tsai ldquoAn intelligent decision-support model using FSOM and rule extraction for crimepreventionrdquo Expert Systems with Applications vol 37 no 10pp 7108ndash7119 2010

[40] Y-H Tseng Z-P Ho K-S Yang and C-C Chen ldquoMiningterm networks from text collections for crime investigationrdquoExpert Systems with Applications vol 39 no 11 pp 10082ndash10090 2012

[41] A Malathi and S S Baboo ldquoAn enhanced algorithm topredict a future crime using data miningrdquo InternationalJournal of Computer Applications vol 21 no 1 2011

[42] S Kapetanakis A Filippoupolitis G Loukas et al ldquoProfilingcyber attackers using case-based reasoningrdquo in Proceedings ofthe 19th UK Workshop on Case-Based Reasoning (UKCBR2014) Cambridge UK December 2014

[43] R Al-Zaidy B C Fung A M Youssef et al ldquoMining criminalnetworks from unstructured text documentsrdquo Digital In-vestigation vol 8 no 3-4 pp 147ndash160 2012

[44] M Zulfadhilah Y Prayudi and I Riadi ldquoCyber profilingusing log analysis and k-means clusteringrdquo InternationalJournal of Advanced Computer Science and Applicationsvol 7 no 7 pp 430ndash435 2016

[45] S V Nath ldquoCrime pattern detection using data miningrdquo inProceedings of the 2006 IEEEWICACM International Con-ference on Web Intelligence and Intelligent Agent TechnologyWorkshops pp 41ndash44 Hong Kong China December 2006

[46] ITPnet ldquoSyria Egypt crises spur escalation of me cyber at-tacksrdquo 2013 httpwwwitpnet594742-syria-egypt-crises-spur-escalation-of-me-cyber-attack

[47] A McEnery and R Xiao ldquoCharacter encoding in corpusconstructionrdquo in Developing Linguistic Corpora A Guide toGood Practice Oxbow Books Ltd Oxford UK 2005

[48] B Bos T Ccedilelik I Hickson et al ldquoCascading style sheets level2 revision 1 (CSS 21) specificationrdquo W3C Working Draft2005 httpwwww3orgTRCSS21

20 Security and Communication Networks

[49] W Stuckey ldquoMassive sony breach sheds light on murkyhacker universerdquo 2018 httpamericaaljazeeracomarticles20141224sony-hacker-universehtml

[50] S Gallagher ldquoSony pictures malware tied to SeoulldquoShamoonrdquo cyber-attacksrdquo 2018 httpsarstechnicacominformation-technology201412sony-pictures-malware-tied-to-seoul-shamoon-cyber-attacks

[51] J Pagliery ldquoSony hack signs point to North Koreardquo 2018httpsmoneycnncom20141205technologysecuritysony-hack-north-korea-employeeindexhtml

[52] K Ketler ldquoCase-based reasoning an introductionrdquo ExpertSystems with Applications vol 6 no 1 pp 3ndash8 1993

[53] M Rosvall and C T Bergstrom ldquoMapping change in largenetworksrdquo PLoS One vol 5 no 1 Article ID e8694 2010

[54] OASIS ldquoSTIXTAXII standardsrdquo 2017-2018 httpsoasis-opengithubiocti-documentation

Security and Communication Networks 21

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 21: CBR-Based Decision Support Methodology for Cybercrime

[49] W Stuckey ldquoMassive sony breach sheds light on murkyhacker universerdquo 2018 httpamericaaljazeeracomarticles20141224sony-hacker-universehtml

[50] S Gallagher ldquoSony pictures malware tied to SeoulldquoShamoonrdquo cyber-attacksrdquo 2018 httpsarstechnicacominformation-technology201412sony-pictures-malware-tied-to-seoul-shamoon-cyber-attacks

[51] J Pagliery ldquoSony hack signs point to North Koreardquo 2018httpsmoneycnncom20141205technologysecuritysony-hack-north-korea-employeeindexhtml

[52] K Ketler ldquoCase-based reasoning an introductionrdquo ExpertSystems with Applications vol 6 no 1 pp 3ndash8 1993

[53] M Rosvall and C T Bergstrom ldquoMapping change in largenetworksrdquo PLoS One vol 5 no 1 Article ID e8694 2010

[54] OASIS ldquoSTIXTAXII standardsrdquo 2017-2018 httpsoasis-opengithubiocti-documentation

Security and Communication Networks 21

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 22: CBR-Based Decision Support Methodology for Cybercrime

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom