Familial Clustering For Weakly-labeled Android Malware ... · IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY CLASS FILES, VOL. XXX, NO. XXX, XXX 2019 1 Familial Clustering

IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY CLASS FILES, VOL. XXX, NO. XXX, XXX 2019 1

Familial Clustering For Weakly-labeled AndroidMalware Using Hybrid Representation Learning

Yanxin Zhang, Yulei Sui*, Shirui Pan, Zheng Zheng, Baodi Ning, Ivor Tsang and Wanlei Zhou

Abstract—Labeling malware or malware clustering is im-portant for identifying new security threats, triaging and build-ing reference datasets. The state-of-the-art Android malwareclustering approaches rely heavily on the raw labels fromcommercial AntiVirus (AV) vendors, which causes misclusteringfor a substantial number of weakly-labeled malware due to theinconsistent, incomplete and overly generic labels reported bythese closed-source AV engines, whose capabilities vary greatlyand whose internal mechanisms are opaque (i.e., intermediatedetection results are unavailable for clustering). The raw labelsare thus often used as the only important source of informationfor clustering.

To address the limitations of the existing approaches, thispaper presents ANDRE, a new ANDroid Hybrid REpresentationLearning approach to clustering weakly-labeled Android mal-ware by preserving heterogeneous information from multiplesources (including the results of static code analysis, the meta-information of an app, and the raw-labels of the AV vendors)to jointly learn a hybrid representation for accurate cluster-ing. The learned representation is then fed into our outlier-aware clustering to partition the weakly-labeled malware intoknown and unknown families. The malware whose maliciousbehaviours are close to those of the existing families on thenetwork, are further classified using a three-layer Deep NeuralNetwork (DNN). The unknown malware are clustered using astandard density-based clustering algorithm. We have evaluatedour approach using 5,416 ground-truth malware from Drebinand 9,000 malware from VIRUSSHARE (uploaded between Mar.2017 and Feb. 2018), consisting of 3324 weakly-labeled malware.The evaluation shows that ANDRE effectively clusters weakly-labeled malware which cannot be clustered by the state-of-the-art approaches, while achieving comparable accuracy with thoseapproaches for clustering ground-truth samples.

I. INTRODUCTION

ANDROID devices have represented around 80% of allsmartphones since 2017 and are forecast to maintain

their leadership with over 85% market share by 2020 [1].The increasing popularity of Android devices has witnessed anunprecedented growth in emerging Android apps, which hasalso become the prime targets of attackers. It has been reportedthat the total number of Android malware had reached 856.52million by the end of 2018. There were more than 137.5million newly discovered malicious apps in 2018 (i.e., 350,000new malware per day) [2]. Android malware is a major sourceof cyberattacks and is a serious threat to smartphone users.

Labeling a malicious app as an unknown or a variant ofan existing family is important for identifying new threats,determining the severity of the threat, creating signatures formalware detection, malware triaging, and building referencedatasets. Labeling malware is a non-trivial task. The raw

*Corresponding author.

TABLE IMALWARE WITH EMPTY LABELS. THE TABLE GIVES THE NUMBER OF

MALICIOUS APPS WHICH DO NOT HAVE A FAMILY NAME AFTERCLUSTERING BY EUPHONY AND AVCLASS DUE TO OVERLY GENERIC RAWLABELS BEING REPORTED BY AV VENDORS. THE 9000 MALICIOUS APPSARE FROM VIRUSSHARE UPLOADED BETWEEN 03/2017 AND 02/2018.Clusters AVCLASS EUPHONY

#App # Empty-labeled # percentage #Empty-labeled # percentage

9000 3066 34.06% 1534 17.04%

TABLE IIMALWARE WITH CONTROVERSIAL LABELS. THE TABLE GIVES THE

NUMBER OF MALICIOUS APPS WHOSE TOP TWO FREQUENT FAMILIESREPORTED BY A CLOSE NUMBER OF VENDORS BASED ON PLURALITYVOTING USING EUPHONY. OF THE 7466 APPS WHICH HAVE FAMILYNAMES (EXCLUDING THE 1534 EMPTY LABELED APPS IN TABLE I),

PLURALITY VOTING IS NOT CONFIDENT IN 1790 APPS SINCE THE TWOMOST FREQUENT NAMES ARE REPORTED BY AN EQUAL NUMBER OF

VENDORS FOR EACH APP.#Gap = # vendor reporting the most frequent name -# vendor reporting the second most frequent name.

#Gap 0 1 2 3 4 ≥ 5 # Apps # Vendor

# App No. 1790 1389 825 594 438 2430 7466 67

labels reported by AntiVirus (AV) vendors are well-knownto be inconsistent (e.g., two vendors may report two aliasedfamily names for the same type of malware. solimba andfirseria are aliases) without a standard naming convention(e.g., the conventions CARO [3] and CME [4] are not oftenused by AV vendors). Although manual inspection for malwarelabeling by an expert can provide an accurate solution, it isextremely costly in practice due to the huge number of appsbeing released every day.

Existing efforts. To address these issues, the recentlyproposed clustering approaches, such as AVCLASS [5] andEUPHONY [6], perform an automatic malware labeling basedon the outputs from a collection of AntiVirus vendors inVIRUSTOTAL [7]. Due to the inconsistency among the rawlabels reported by a wide variety of AV vendors, the existingapproaches normally perform a pre-processing to extract thefamily names from the raw labels based on vendor-specificor self-defined heuristic rules [5, 6], including generic tokenremoval and alias token reduction. Plurality voting [8] is thenapplied to disambiguate the inconsistent family names.

Limitations and our observations. These raw-label-based approaches rely heavily on the original labels from theAV vendors. A substantial number of weakly-labeled malwareare unable to be clustered because their raw labels are oftenincomplete, inconsistent and overly generic, as reported byincompatible commercial vendors with different capabilities


Fig. 1. An example of malware with empty label. Only one AV vendor(Symantec Mobile Insight) reports it as malware. The raw labelAppRisk.Generisk given by this vendor is very generic. According tothe rules of AVCLASS and EUPHONY, the raw label AppRisk.Generiskis parsed as two generic tokens AppRisk and Generisk with no actualfamily information, thus returning an empty label after their generic tokenremoval phases.

in the presence of rapidly evolving malware.We define two types of weakly-labeled malware based

on the results from the state-of-the-art approaches [5, 6]with their statistics given in Tables I and II: (1) Malwarewith empty label. The tokens from the raw labels beingrecognized as generic tokens (e.g., apprisk) or unable tobe parsed by the rules of AVCLASS and EUPHONY arediscarded, resulting in weakly-labeled apps without a familyname. Figure 1 illustrates an concrete example of malwarewith empty label. In addition, for example, as listed in Table I,AVCLASS and EUPHONY are unable to produce a familyname for 34.06% and 17.04% of the total 9000 apps whichwere recently uploaded to VIRUSTOTAL between Mar. 2017and Feb. 2018, resulting in 1534 weakly-labeled malware.(2) Malware with controversial labels. The plurality votingstrategy is also hard to disambiguate inconsistency betweentwo family names reported by a close number of vendorsamong the 7466 apps which have family names extractedfrom the raw labels by EUPHONY. In figure 2, an exampleof malware with controversial labels is given. Table II showsthat the top two most frequent names are reported by anequal number of vendors for 1790 apps (24%), causing theplurality voting to be less confident for these weakly-labeledmalware compared to their strongly-labeled counterpart withan uncontroversial family name reported by the majority (e.g.,80%) of all available vendors in VIRUSTOTAL [7].

Insights and challenges. A strongly-labeled malwarewith a clear and unambiguous family name is easy to beclustered. However, relying on vendors’ raw labels as the onlysource of information for labeling weakly-labeled malware isinherently partial and shallow. Apart from the reports from AVvendors, an Android app itself is a comprehensive packagecontaining source code and meta-info (e.g., configuration andresource files), which are crucial for extracting the behaviorsof an app. This information cannot be easily obtained and isnot often considered by the existing raw-label-based clusteringapproaches. This is because inferring the malware family byanalyzing an app, in fact, is to develop another AV engine,competing with dozens of available AV vendors, which is hard

Fig. 2. An example of malware with controversial labels. There are sevenAV engines report this app as malware. According to EUPHONY, five of themare generic labels (it becomes empty labels after generic token removal byEUPHONY). The remaining two AV engines label this malware as two differentfamilies, i.e., dinehu by NANO-Antivirus and Emagsoftware bySophos AV. Thus, the family name of this app becomes controversial if themalware labeling is purely based on the raw labels produced by AV vendors.

and likely to cause inconsistent results due to lack of ground-truth family names and the unknown mechanisms of thecommercial AV vendors, whose capabilities and mechanismsvary greatly with no intermediate detection results reflectedin their reports. Thus, the raw labels generated from thesevendors are the only easy-to-use and important data sourcefor the existing clustering approaches.

Our solution. Deep representation learning (DRL) [9] isa new and promising branch of machine learning. DRL learnsthe representation of the target data via deep architectures in alayer-wise manner, through which the higher abstraction levelof the features is embedded in a lower unified representation,making it easy for later classification and clustering tasks.Recently, Hybrid Representation Learning approaches [10–13] have significantly enhanced the existing DRL techniquesin terms of both efficiency and accuracy for training andpredicting large-scale networks by extracting heterogeneousinformation in a variety of formats (e.g., texts [14] or graphs[15]) and from multiple data sources.

Inspired by the recent advances in DRL (Deep Rep-resentation Learning), this paper proposes ANDRE, a newhybrid representation learning approach to clustering weakly-labeled malware by utilizing multiple sources of data (in-cluding raw labels from AV vendors, meta-info of apps andresults from source code analysis). ANDRE jointly learns ahybrid representation that allows heterogeneous informationto be integrated into one neural network pipeline that distillsthe discriminative features for accurate clustering. ANDREis based on a new Android malware network, in whichevery node represents an app whose label contains its meta-info and the raw reports from AV vendors, and every edgedenotes the similarity between two apps inferred by ourstatic analysis which exploits a pairwise analysis of codesimilarity. Weakly-labeled malware in the network, togetherwith existing strongly-labeled malware, are fed into our hybridrepresentation learning model to embed all the nodes on thenetwork into a continuous and low-dimensional space thatpreserves comprehensive heterogeneous information.

The learned representation is then used for our outlier-aware clustering to partition the weakly-labeled malware into


Fig. 3. The overview of ANDRE. ni represents a single malicious app, and ci denotes a known family name if ni is a strongly-labeled. ci is empty if ni isweakly-labeled for prediction purposes. wj denotes a word in the meta-info associated with a node.

known and unknown families. The malware whose maliciousbehaviours are close to the existing families on the network,are further classified using a three-layer Deep Neural Network(DNN). The unknown malware are clustered using a standarddensity-based clustering algorithm.

The key contributions of our paper are as follows:• We present a new hybrid representation learning approach

to cluster weakly-labeled Android malware.• We propose a new representation by successfully pre-

serving heterogeneous information, including raw labelsfrom AV vendors, meta-info and results from source codeanalysis. This provides a compact and low-dimensionalrepresentation for effective Android malware clustering.

• We have conducted a comprehensive evaluation using5,416 ground-truth samples from the Drebin dataset and9,000 malware from VIRUSSHARE, uploaded betweenMar. 2017 and Feb. 2018, consisting of 3324 weakly-labeled malware. The results show that our approach hascomparable accuracy to the state-of-the-art approachesin clustering ground-truths, and can effectively clusterweakly-label malware which are unable to be clusteredby AVCLASS and EUPHONY.The rest of the paper is structured as follows. Section II

defines the malware clustering problem and introduces theoverall framework of ANDRE. Section III details our approachincluding feature extraction, network construction, representa-tion learning and outlier-aware clustering.

II. PROBLEM DEFINITION AND FRAMEWORK OVERVIEW

A. Problem Definition

An Android malware network is represented as G =(N,E,W,C), where the node set N = {n1, n2, . . . , n|N |}denotes a set of Android malware (including both strongly-and weakly-labeled apps), and an edge ei,j = (ni , nj) ∈ Ebetween two nodes encodes the code similarity between two

apps. Each node ni is associated with content information diconsisting of a sequence of word tokens wj ∈ di extractedfrom the app’s meta-info and the raw labels produced by AVvendors. We use W ={w1, . . . , w|W |} to denote the words ofall nodes in this network.

Every node ni has a label li ∈ C = L ∪ U , whereU denotes a set of labels with no family name for pre-diction purposes, and L are known family names of thestrongly-labeled malware (either from ground-truths in Drebinor downloaded from VIRUSSHARE with each app’s uniquefamily name reported by over 80% of all available vendors).The numbers of ground-truth and weakly-labeled apps areconfigurable in the network. If the label set L = ∅, i.e., C=U ,the representation learning becomes purely unsupervised, andour proposed solution is still valid but imprecise.

Our hybrid representation learning problem is formulatedas maximizing an objective function J (Equation 1), whichaims to jointly learn a k-dimensional vector Vni ∈ Rk (k isa smaller number) for each node ni in the network, such thatnodes, which are neighbors based on the network topologyor have similar meta-info or family names are close to oneanother in the latent embedding space.

J =α1

|N |∑i=1

|N |∑j=1

AN◦N (Vni ,Vnj )+α2

|N |∑i=1

|W |∑p=1

AN◦W(Vni ,Vwp)

+α3

|L|∑q=1

|W |∑p=1

AL◦W(Vwp,Vcq )

(1)where Vni

, Vwpand Vcq denote the learned representation

vectors of the three entities, i.e., node ni, word wp and familycq in the representation space Rk. AN◦N , AN◦W and AL◦Ware affinity functions to capture the correlation between twoentities in Rk. α1, α2 and α3 are the weights that balancenetwork structure, meta-info, and family information (α1 +α2


+α3 = 1).

B. Framework Overview

Figure 3 gives the overview of our framework, whichconsists of the following three major components.

1. Feature extraction. There are two steps for extractingfeatures. (1) Code similarity analysis. ANDRE implements apairwise comparison scheme to dissect the similarities betweenapps. Two apps are provided as inputs and our methodyields a similarity profile for each pair. The similarity profilesummarizes similarity facts relating to similarity scores at filelevel, which means calculating similarity score for two sourcecode files. The method builds an invertd-index to quicklycalculate the similarity score of a pair of Android sourcecode files. In addition, there is also a filtering heuristics toreduce the size of the index, which reduces the number of pairsneeded to evaluate the similarity scores. A large portion of anAndroid app contains standard Android and safe third partylibraries [16], which can be seen as noise in our representationlearning. Following [16], we maintain a whitelist of commonlibraries to remove those library-related code segments fromthe application code of a malicious app in our code similarityanalysis. Whitelist is a standard way for reducing noise incode similarity analysis. In addition, whitelist also provides aflexible and customized way for adding any new APIs whichare safe.

(2) Analyzing the Android manifest files. We extractmeta-info of each malware app from its manifest files, includ-ing package names, API versions, launcher activities, permis-sions, etc, which are associated with the corresponding nodein the network. In addition, the raw labels from the existingAV vendors are also attached to each node in our network.Finally, all the ground-truths (strongly-labeled) malware arelabeled with their unique family names.

2. Hybrid representation learning. An Android networkG is constructed from the code similarity analysis, whereeach edge encodes the similarity between node ni and nj .The meta-info and raw labels associated with node ni arerepresented by di. Our network G contains both weakly- andstrongly-labeled (as the ground-truths) malware nodes withconfigurable portions of both kinds. The Hybrid (or multi-modal) Representation Learning (HRL) module [11, 17, 18]jointly embeds nodes N , words W from the meta-info, andfamily names L in a low-dimensional space, so that theaffinity of heterogeneous information can be captured. Themalware clustering can be accurately performed based on thecomprehensive representation space via Equation 1, throughwhich the following three relationships are preserved.

(1) Inter-app correlation AN◦N (Vni,Vnj

), which cap-tures the relationship between two apps via a neural networkbased on the randomly generated sequences (random walks)from the network structure [19], where each sequence s = n1→ n2 → ...→ nn can be seen as a phrase in natural languagemodel. Given a node nj in each random walk sequence swithin a sliding window b, the neural network maximizesthe log-likelihood of observing a set of neighboring nodes{nj−b, nj−b+1, · · · , nj+b−1, nj+b}, so that the representation

vectors Vni and Vnj are close to each other if ei,j ∈ E, i.e.,AN◦N (Vni

,Vnj) is maximized.

(2) Node-word correlation, which maximizes the co-occurrence of a node and a word in the meta-info, i.e.,maximizing the affinity value of AN◦W(Vni ,Vwp) in theembedding space, modeling the fact that if a word wp fromthe meta-info appears in node ni, then the two vectors Vni

and Vwpare near in the representation space.

(3) Label-word correlation, which maximizes the co-occurrence of a family name and a word in the meta-info,i.e., maximizing AL◦W(Vwp

,Vcq ), such that the label cq andword wp are forced to be closed to each other in the resultingrepresentation space.

Lastly, the above three relations are jointly learned ina unified mode in an iterative manner. We further employa hierarchical softmax modeling [20] to reduce the timecomplexity of our model, enabling ANDRE to scale to largemalware datasets.

3. Outlier-aware weakly-labeled malware clustering.Given our learned hybrid representation space Rk, this stepperforms prediction of a weakly-labeled app in the networkG. When predicting its family name, this malware may fallinto an existing family in L from the strongly-label malwarein G or it may be an unknown type of malware with differentbehaviors which are different from any known families in Lsince the unknown malware types do exist and the ground-truth labels from strongly-labeled malware in the networkmay be limited. To address this issue, ANDRE performs anoutlier-aware clustering that first employs outlier detection topartition all the weakly-labeled malware apps into (1) inlierapps whose behaviors are close to those of known families, and(2) anomalous apps that unlikely belong to any existing familyin the network. Malware apps in the outlier set are clusteredusing a density-based algorithm, while apps in the inlier set arefed into a designed neural network to train the multi-classifierfor classifying these malware into known families. The neuralnetwork consists of three layers, where the first two 1024-nodelayers comprise a dropout followed by a dense layer with aparametric rectified linear unit (ReLU) activation function inthe first two layers and the sigmoid function in the third layer,which is also used for prediction.

III. OUR APPROACH

This section introduces the detailed approaches of AN-DRE, including Android malware feature extraction, hy-brid representation learning and outlier-aware clustering forweakly-labeled malware.

A. Android Malware Feature Extraction

The aim of our feature extraction is to build the networkvia code similarity analysis between two nodes (apps) andassociate each node with its meta-info extracted from itscorresponding app.

Our source code analysis and meta-data extraction arecomplementary. Performing code similarity analysis on possi-ble malicious code segments from two programs is helpful forgrouping malware apps with similar runtime behaviors. The


meta-info of an Android app, including permissions, Androidservices, activities, providers, receivers, intent filters, securitysettings and referenced libraries are important building blocksfor understanding about the special characteristics of an app.Combining code similarity and meta-info for constructing thenetwork provides more comprehensive understanding of apprelationships.

1) Code Similarity Analysis: Directly applying the ex-isting code clone analysis, such as SOURCERCC [21] andDECKARD [22] on Android apps to obtain their similarityinformation is ineffective, because an Android app commonlycontains a large portion of safe library code, either by invokingAndroid SDK APIs or third party libraries. The maliciouscode segments which reflect the malware type may be hiddenin the application code. The safe (library) code segmentsmay become noisy, resulting in inaccurate identification ofcode similarity between the two malware families. In addition,some malware apps are developed by repackaging an existingbenign app [23]. By injecting two different types of maliciouscode into one benign app, the two repackaged apps can beof different malware families. Therefore, the benign codesegments from Android SDK and third-library are the majornoises, impeding analyzing code similarity to differentiate thetwo malware families.

Our code similarity analysis consists of four steps. First,all the apks are decompiled using dex2jar by convertingtheir original DEX files to Java files for our source codeanalysis. Next, we perform a common library removal toreduce as many noise as possible, and our token-based codesimilarity analysis is performed to group apps based on file-level similarity. We then perform fine-grained code-block-levelsimilarity analysis by considering existing available maliciouspayloads [16, 24] to refine the similarity scores between twoapps. Lastly, an edge in the network is connected between twoapps if their similarity score is above a pre-defined threshold.

Common library removal. Previous study [25] showsthat over 60% of Android application code (in terms of low-level bytecode instructions after compilation) is contributedby library code. To address the noises introduced by a largeportion of library code when conducting code similarity anal-ysis, we follow [16, 26–28] to perform a common libraryremoval by maintaining comprehensive whitelists of commonlibraries [29] to exclude library-related code segments prior toour similarity analysis.

Code similarity. Our code similarity analysis appliesa bag-of-words-based code clone approach [21], which hasbeen shown to be the most efficient strategy and has a goodprecision for scaling to large code bases. Tokenization is firstperformed to remove comments, white space, and terminal. Atoken is extracted from each source file into Java keywords, lit-erals, and identifiers. A string literal is split on whitespace andoperators are not included. Tokens in known malicious payloadcode segments [16, 24] are given higher weights for codesimilarity analysis. The source code of an Android app n isrepresented as a set of code blocks (basic blocks of the control-flow graph of a program) Source(n) = {B1, ..., Bnum} witheach block Bi denoting a bag-of-tokens Bi = {T1..., Tk}.One token may appear multiple times in a block, therefore

each token is qualified with its occurrence frequency insidea block, Tj = (token, frequency) to differentiate betweenthe frequency of words in the bag-of-words model. Formally,given two apk files Ax and Ay , a similarity function f , and athreshold θ, the aim is to find all the code block pairs Ax.Band Ay .B s.t f(Ax.B,Ay.B) ≥ [θ ·max(|Ax.B|, |Ay.B|)].

There are several choices of similarity function for mea-suring the similarity between two code pieces, we use theJaccard index, or Jaccard similarity, which is defined as thesize of intersection divided by the size of the union of twosets. The similarity of two code blocks Bx and By is definedas follows:

J(Bx, By) =|Bx ∩By||Bx ∪By|

=|Bx ∩By|

|Bx|+ |By| − |Bx ∩By|(2)

2) Meta-Info Extraction: Our meta-info is mainly ex-tracted from AndroidManifest.xml, which is the keyfile at the root of an Android project. It provides allthe essential information of an app to the building tools,Android OS, and app stores. Once an app is launched,AndroidManifest.xml is the first file to be consulted bythe Android system, providing the first-hand information tounderstand the characteristics and security settings of the app.

The detailed content of a manifest file is illustratedin Table III, including (1) permissions, which are re-sponsible for protecting the application from accessing anyprotected parts, (2) instrumentation classes, whichprovide profiling and other dynamic monitoring informa-tion, (3) application-level Android APIs that anapp is going to use, (4) four types of Android compo-nents: activity, service, content provider, andbroadcast receiver. The names of an app’s compo-nents may help identify the known malware components.(5) hardware component, which is helpful to identifymalicious behaviors reflected by access requests to spe-cific device components, e.g., touchscreen, camera, or sen-sors. (6) intent and intent filter, which can beused to trigger malicious activities, thus it is also neces-sary to be collected, and (7) package name, version,referenced libraries.

B. Android Network Representation Learning (ANDRE)

This section presents our malware representation learningwhich jointly exploits the network structure, the meta-info,and vendors’ labels to embed heterogeneous information intoa latent feature space for each node in the network.

The process of ANDRE consists of two steps:• Network node (app) sequence generation with random

walks. This step randomly generates a set of walks on theAndroid malware network, where each sequence startswith a node ni and randomly jumps to other nodes ateach iteration. The random walk sequences capture thetopological relationships between nodes.

• Neural network architectures. This step encodes eachnode (app) into a vector representation by preservingmulti-source information from (1) an android sequencewhich captures the inter-app correlation, (2) the meta-info which represents the node-word correlation, and


TABLE IIIMETA-DATA INFORMATION EXTRACTED FROM AN ANDROID APP

Type DescriptionPermission Permissions constitute an important security feature in Android apps. A user has to grant them to install

applications or access particularly sensitive data.Instrumentation Declares the code used to test this package or other package directive components. A manifest can contain

zero or more of this element.Application Contains the root node of the application-level component declaration in the package and contains global

and default properties in the application, such as tags, icons, themes, necessary permissions, and more.Activity Activity is the primary Android component used to interact with users.Service A service is a component that can run any time in the background.Content provider A content provider is a component that is used to manage persistent data and publish it to other

applications.Broadcast receiver The receiver enables the application to obtain data changes or operations that occur even if it is not

running.Intent/intent-filter The IntentFilter is formed by declaring the Intent values supported by the specified set of components.Action The intent action supported by the component.Category Intent Category supported by the component.Type Intent data MIME type supported by the component.Schema Intent data URI scheme supported by the component.Authority Intent data URI authority supported by the component.Path Intent data URI path supported by the component.

Fig. 4. The DEEPWALK (Skip-Gram) method vs our proposed hybrid learningmethod. The DEEPWALK approach learns the network representation basedonly on the network structure. Our hybrid method couples two neural networksto learn the representation from three parties (i.e., node structure, meta-info,and label information from strongly-labeled malware) to capture the inter-node, node-word, and label-word relationships. The input, projection, andoutput indicate the input layer, hidden layer, and output layer of a neuralnetwork model.

(3) the label information which records the label-wordcorrelation. Lastly, a hybrid representation model is builtby jointly learning the above three correlations.

1) Model Architecture: The proposed joint neural net-work model is illustrated on the right of Figure 4. It has keyproperties:

Inter-node correlation. Assuming that neighbor nodesare highly correlated and statistically dependent on each other,the upper layer of ANDRE exploits the network structure froma set of generated random walk sequences.

Node-word correlations. The lower layer of ANDREcaptures the relations between nodes ni ∈ N and the meta-info, which is considered as a document with a set of wordswj ∈W . If a word is used to describe a node, the node andthe word are correlated.

Label-word correlation. We use the labels ci ∈ C fromstrongly-labeled malware (a ground-truth) and its correspond-ing node ni∈N as input and learn the label vector and wordvector simultaneously, so that the label-word correlation can

be well captured to improve the learning model.Joint training model. We integrate these two layers by

the node ni in the model. The three kinds of correlations arejointly learned in a unified way, so that they can benefit eachother and ultimately converge to a steady stage.

2) Model Details: Inter-node correlation. To capture theinter-node correlation, we assume that the nodes with linkages(edges) in a network are statistically dependent on each other.This idea is inspired by DEEPWALK [19], which extends theSKIP-GRAM [30] algorithm.

The SKIP-GRAM model [30] is a language model thatlearns the word embedding by exploiting word orders in asequence and assuming that words close to one another arestatistically more related. Due to its simplicity and efficiency,The SKIP-GRAM model is widely used in many NLP tasks.Given a word wm, Skip-Gram maximizes the log-likelihoodof wm’s surrounding words within a certain window b.

J1 =

T∑m=1

logP(wm−b : wm+b|wm) (3)

where b is the window size and wm−b : wm+b is a wordsequence in a window, which excludes the word wm itself.The probability P(wm−b : wm+b|wm) is defined as:∏

−b≤j≤b,j 6=0

P(wm+j |wm) (4)

Equation (4) makes the assumption that the words ina window wm−b : wm+b are independent of each other.P(wm+j |wm) is defined as:

P(wm+j |wm) =exp(V>wm

V′

wm+j)∑W

w=1 exp(V>wmV′

w)(5)

where Vw and V′

w are the input vector and output vector ofw. Once the training is finished, the input vector Vw is usedas the final representation of word w.


DEEPWALK [19] extends SKIP-GRAM and employs theneural language model to learn the embedding for a network.Specifically, DEEPWALK generates a set of random walk se-quences S based on the network structure. Each random walks = ni−b → ... ni ... → ni+b is regarded as a sentence witha window of size b in the natural language model, and eachnode ni is considered as a word. Then DEEPWALK employsthe SKIP-GRAM algorithm [30] on the node sequences toobtain a latent representation for each node. This is achievedby solving an objective function which maximizes the log-likelihood of the observed nodes on a target node ni for allrandom sequences s ∈ S.

J2 =

|N |∑i=1

∑s∈S

logP(ni−b : ni+b|ni)

=

N∑i=1

∑s∈S

∑−b≤j≤b,j 6=0

logP(ni+j |ni)

(6)

The DEEPWALK neural network is illustrated on the leftof Figure 4. It exploits the network structure for learningrepresentations without considering other information (suchas meta-info and label-info), which is exploited by our model.

Node-word correlation. To enable joint learning togetherwith inter-node correlations, ANDRE captures the node-wordcorrelations as depicted in the right lower panel in Figure 4,which applies a DOC2VEC style model built on top of SKIP-GRAM and CBOW (Continuous Bag-Of-Words) [30], with theaim of learning the distributed representation for a document.In our setting, for a node ni, we learn the input node vectorVni

and output word vector V′

wjbased on meta-info di =

[w0, w1, . . . , w|di|] associated with ni using a sliding windowof size b for repeatedly picking a sequence of words centeringwj within di.

Label-word correlation We enable semi-supervising byleveraging uncontroversial family names from strongly-labeledmalware, benefiting from the knowledge of existing AVvendors. By applying DOC2VEC, we use the ground-truthlabel information together with the meta-info as inputs andsimultaneously learn the input label vector Vci of node niand output word vector V′wj

based on the meta-info associatedwith ni, modeling the correlation between the nodes’ labelsand the nodes’ meta-info.

Since the first correlation is captured via DEEPWALK andthe last two correlations are modeled via DOC2VEC, it canintuitively be seen that our hybrid learning couples these twomodelings using two panels, as illustrated in Figure 4. Thenode ni shared by both panels indicates that ni is influencedby the two models to produce a hybrid representation whichpreserves the heterogeneous information and the relationsbetween the three parties (e.g., node sequences from randomwalks, word sequences from meta-info, and label information).

Joint learning model. Given a network G consisting ofnodes N = {n1, n2, . . . , n|N | }, our hybrid learning model inEquation 7 implements the pre-defined objective function inEquation 1 by considering the three aforementioned correla-tions. The aim is to jointly learn the following three affinityfunctions with random walks S generated for node ni, and a

sliding window for a sequence of nodes or words.

J = (1−α)|N |∑i=1

∑s∈S

∑−b≤j≤b,j 6=0

AN◦N (ni+j , ni)

+ α

|N |∑i=1

∑−b≤j≤b

AN◦W(wj , ni)+α

|L|∑i=1

∑−b≤j≤b

AL◦W(wj , ci)

(7)where α is the weight that balances the network structure,meta-info, and label information. b is the window size of anode or a word sequence, and wj indicates the j-th word in acontextual window. Given Equation 7, the first term computesthe affinity function A(ni+j , ni), and the log-likelihood prob-ability of observing surrounding nodes given node ni, usingthe log-likelihood softmax functions as Equation 8:

AN◦N (ni+j , ni) = logP(ni+j |ni) = logexp(V>ni

V′

ni+j)∑|N |

x=1 exp(V>niV′

nx)

(8)where Vnx

and V′

nxare the input and output vector repre-

sentations of node nx. |N | is the number of the nodes in thenetwork. The probability of observing contextual words wi−b: wi+b given current node ni is:

AN◦W(wj , ni) = logP(wj |ni) = logexp(V>ni

V′

wj)∑|W |

x=1 exp(V>niV′

wx)(9)

where V′

wjis the output representation of word wj , and |W | is

the number of distinct words on the whole network. Similarly,the probability of observing the words given a class label ciis then defined as:

AL◦W(wj , ci) = logP(wj |ci) = logexp(V>ciV

′

wj)∑|W |

x=1 exp(V>ciV′wx

)(10)

Equations 9 and 10 reflect the correlation between nodesand meta-info, and the correlation between meta-info andlabel information, so that ANDRE jointly learns the outputrepresentation vector V′

wjof word wj , which will propagate

back to influence the input representation of ni ∈ N in thenetwork as also illustrated in Figure 4. As a result, the noderepresentation (i.e., the input vectors of nodes) is enhanced byboth the meta-info and the label information.

3) Training Hybrid Learning Model: We train our modelEquation 7 using stochastic gradient ascent, which is a stan-dard solution for training. However, computing the gradientsin Equations 8, 9 and 10 is expensive, as the computationis proportional to the number of nodes and words in thenetwork G. To address this issue, we resort to hierarchicalsoftmax [11, 20], which reduces the time complexity toO(|W |log(W ) +Nlog(N)).

The hierarchical model in our algorithm uses two binarytrees, one with distinct nodes as leaves, and another withdistinct words and labels as leaves. The trees are built usingHUFFMAN algorithm, so that each vertex in a tree has a binary


code, in which more frequent nodes (or words) have shortercodes. There is a unique path from the root to each leaf ina tree. The interval vertices of the trees are represented asreal-valued vectors with the same dimension as the leaves.Instead of enumerating all nodes in Equation 7 in each gradientstep, we only need to evaluate the path from the root to thecorresponding leaf in the Huffman tree. Suppose the path to theleaf node ni is a sequence of vertices (l0, l1, ..., lh) reachingni, where lh is ni and l0 is the root node in the Huffman tree.The probability is computed as follows:

P(ni+j |ni) =h∏

t=1

P(lt|ni) (11)

P(lt|ni) = σ(V>niV

′

lt) (12)

where P(lt|ni) is a binary classifier with σ(.) as the sigmoidfunction, and V′

ltis the representation of node lt, which is the

parent of ni in the Huffman tree. Thus, the time complexityis reduced to O(NlogN). Likewise, we can use hierarchicalsoftmax technique [14] to compute the words and labels inEquations 9 and 10.

C. Outlier-aware Android Malware Clustering

Outlier detection (a.k.a. anomaly detection [31]) detectsrare samples that do not meet the expected patterns or behav-iors of the majority of samples in the dataset. The techniqueis essentially a density-based outlier detection algorithm thatconstructs a graph of the data using nearest neighbors, insteadof calculating local densities. A malware sample whose ma-licious behaviours are closed to those of known families inthe network G will be classified into an existing family in L,otherwise, it will be identified as an unknown type of malware.This is due to the fact that the unknown malware types do existand the number of ground-truth labels from strongly-labeledmalware in the network may always be limited.

Finally, the outlier set of the collected malware apps isclustered to produce unknown family clusters, while the inlierset is fed into the designed Multi-layer Perceptron (MLP)classifier (Supervised Neural Network) to train the multi-classification model, thereby classifying these malware intoknown families.

IV. EVALUATION

The objective of our evaluation is to show that ANDREis effective in clustering weakly-labeled malware that cannotbe clustered by AVCLASS and EUPHONY, while achievingcomparable accuracy with the state-of-the-art tools evaluatedusing the ground-truth samples in the Drebin dataset.

A. Experimental Setup and Implementation

To evaluate the effectiveness of ANDRE, 5,416 ground-truth malware samples from the Drebin dataset [32] and 9,000recent malware from VIRUSSHARE [33] (uploaded betweenMarch 2017 and October 2018) were used. The Androidapks were first decompiled to Java files using dex2jar forour source code analysis. Note that there are 5560 malware

samples in the Drebin dataset, but only 5416 samples wereused since the remaining 144 samples could not be correctlydecompiled by dex2jar. Out of the 9,000 recently collectedmalware from VIRUSSHARE, 4314 are smaller than 5MB,1969 are between 5MB and 10MB, and 2717 are greater than10MB.

To build the edge relations between nodes (apps) of thenetwork, we adopted the bag-of-words model [21] to performcode similarity analysis for all pairs of malware apps afterexcluding their library-related code, following [16] by usinga self-maintained whitelist containing common third-partylibraries [29] and the available malicious payload [16, 24].An edge is connected between two nodes if their similarityscore is above 0.8 (with the maximum score 1 using Jaccardsimilarity in Equation 2 after normalization).

For each app in the network, we use the ANDROID ASSETPACKAGING TOOL [34] to extract meta-info of an app. Theraw labels reported by Anti-Virus vendors were obtained byuploading an app or its hash value to VIRUSTOTAL [7], anonline malware scanning service which integrates a set ofexisting commercial Antivirus vendors, such as Comodo andKaspersky. Given VIRUSTOTAL’s reports, we can identifyweakly-labeled malware, including malware with no label(1534) and malware with controversial family names (1790)by using EUPHONY. The results are shown in Table I andTable II.

We implemented DEEPWALK, DOC2VEC and our hybridlearning models to embed the constructed network usingPython. Our outlier detection approach was implementedbased on TOPOLOGICAL ANOMALY DETECTION (TAD) [35].To classify the inlier known families, we applied and com-pared different classification algorithms, including the neuralnetwork classifier MULTI-LAYER PERCEPTRON (MLP) [36]and traditional classification algorithms SVM [37], KNN [38],DSTREE [39] and RDFOREST [40] whose implementations areavailable from Scikit-learn library [41].

B. Evaluation Methodology

We evaluate the effectiveness of our approach fromthe following four aspects: (1) comparing ANDRE with thestate-of-the-art tools using Drebin’s ground-truth samples (atypical dataset widely used by existing clustering tools) toshow the effectiveness and applicability of our representationlearning approach for clustering all types of malware (notlimited to weakly-labeled malware) with good precision, (2)comparing ANDRE against different baseline settings (i.e.,different representation learning models including DEEPWALKand DOC2VEC) to show whether our hybrid learning modelcan obtain a promising level of accuracy by preserving het-erogeneous information, (3) comparing different classificationmodels given the learned representation, and (4) validating ouroutlier-aware malware clustering results with comprehensivecase studies to show ANDRE’s effectiveness in clusteringweakly-labeled malware.


Fig. 5. Performance comparison with respect to different learning models. • represents the DEEPWALK method, • represents the DOC2VEC method, •represents our ANDRE method. In the three subgraphs, the horizontal axis represents the ratio of the training set, which is 10% to 70%. The vertical axes inthe three subgraphs represent the fractions of accuracy, macro and micro score respectively.

TABLE IVCOMPARISON WITH AVCLASS [5] AND EUPHONY [6]. THE ACCURACY,

F1 AND RECALL DATA ARE DIRECTLY FROM THEIR PAPERS

Method Accuracy F1 Score Recall

AVCLASS 95.2% 93.9% 92.5%EUPHONY 95.0% 95.5% 96.1%

ANDRE 93.1% 93.3% 93.1%

C. Comparing with the state-of-the-art tools and differentbaseline settings

Comparing with non-learning approaches. We compareANDRE with the two state-of-the-art tools AVCLASS andEUPHONY by using ground-truth samples from Drebin to showANDRE’s applicability and effectiveness in clustering mal-ware which are not weakly-labeled. Our hybrid representationlearning approach randomly selects 70% of malware as thetraining set and 30% of samples in the dataset for testing. TableIV shows that ANDRE achieves 93.1% for accuracy, 93.3%for F1 score and 93.1% for recall, which are comparable orslightly less than the results of the two tools (reported in theirpapers [5] and [6]).

Note that we do not claim that our learning-based ap-proach is superior to the non-machine-learning methods inclustering malware apps that are not weakly-labeled. This isbecause plurality voting does work well for strongly-labeledsamples with uncontroversial raw labels from AV vendors afterlabel preprocessing, such as generic token removal and aliasdetection [5]. Rather, our aim is to show that our approachsuccessfully preserves necessary information for effective clus-tering and can achieve comparable accuracy with the existingtools for ground-truth samples. Due to the nature of thelearning-based algorithm, its performance depends on severalfactors, such as ratio selection of the training and testingsets, clustering algorithm, network structure (from the codesimilarity analysis) and feature attentions (importance of dif-ferent meta-information). The following subsections conductcomprehensive evaluations and discussions of these factors.

Comparing with different representation learning mod-els. To demonstrate that our hybrid network representationlearning is able to achieve better performance, we com-pare ANDRE with the mainstream learning methods, i.e.,DOC2VEC [14] (a paragraph vectors algorithm for embeddingtexts and phrases in a distributed vector using neural networks)

TABLE VCOMPARISON WITH DIFFERENT CLASSIFIERS

Classifiers Accuracy F1 Score Speed(Seconds)KNN 0.8677 0.4468 225SVM 0.8295 0.2580 256

Random Forest 0.4911 0.0491 224Decision Tree 0.4031 0.0357 230MLP Classifier 0.9126 0.6079 356

and DEEPWALK [19] (which applies language modeling tocapture the topological relationships between nodes in a socialnetwork).

Figure 5 shows our comparison results of the threeapproaches. By preserving the network structure (node-nodecorrelation) and the co-occurrence relations between appsand their corresponding meta-info (node-word and label-wordcorrelations), our approach performs better than the individuallearning methods, i.e., DEEPWALK and DOC2VEC. In general,the accuracy values and F1 scores of the three approachesincrease when the sizes of the training sets increase. The trendbecomes stable and gradually reaches a convergence when thetraining set occupies around 70% of the total samples. ANDREachieves up to 93% in accuracy, while DEEPWALK andDOC2VEC can only achieve 89.5% and 70.1% respectively.

Comparisons using different multi-class classifiers.Given the learned representation, we compare the classifica-tion performance when applying MULTI-LAYER PERCEPTRON(MLP), a deep neural network classifier and conventionalclassification algorithms, including SVM (separating hyper-plane in the feature space to maximize the interval betweenpositive and negative samples), KNN (distance measurementbetween different eigenvalues), RANDOM FOREST (integratingmultiple trees through Ensemble Learning) and DECISIONTREE (mapping relationships between object attributes andobject values through a tree structure).

To fairly compare the results of different classifiers, wekeep their underlying network embeddings unchanged (i.e.,by associating all meta-info for each node, using the samenetwork structure and same representation space with the di-mension K=400). As shown in Table V, the accuracy of MLPclassifier exceeds that of all other methods, demonstratingthe recent advances in deep learning that enable the precisecapture of input-output data relations correlations, i.e., theinput features are connected to the neurons of the hidden layer,and the neurons of the hidden layer are then linked with the


TABLE VIPERFORMANCE COMPARISON REGARDING DIFFERENT K , I.E., THE

NUMBER OF DIMENSIONS IN THE REPRESENTATION SPACE.

Number of K Accuracy Macro-F1 Score Micro-F1 Score5 0.7230 0.2360 0.723010 0.8505 0.4661 0.850530 0.9022 0.5851 0.902250 0.9083 0.6115 0.9083

100 0.9151 0.6027 0.9151200 0.9138 0.6189 0.9138300 0.9169 0.5927 0.9169400 0.9175 0.6282 0.9157

Fig. 6. Performance comparison (permissions vs all meta-info)

neurons of the input layer. The multi-layer perceptron layeris fully associated (the full connection means that any singleneuron in the upper layer is connected to all neurons in thenext layer).

Comparisons when selecting K-dimensional represen-tation space. Table VI shows the results when the number ofrepresentation dimensions K are varied from 5 to 400 in ourhybrid representation learning. There are noticeable increasesfor accuracy, macro-F1 score and micro-F1 score when Kis increased from 5 to 50. The subsequent differences arenegligible especially when K is greater than 100.

All meta-info vs. permission information. In this exper-iment, we choose the representation dimension K = 400 forour MLP classifier. This experiment aims to show that of allthe meta-info components, permission information makes themajor contribution to the accuracy of our clustering results.All other components also contribute to the improvement inaccuracy. As illustrated in Figure IV-C, the mean accuracyis 90% when only permission strings were used for ournetwork embedding. The accuracy increases to 93.4% whenall elements of meta-info are used. The improvement showsthat the complete meta-info includes not only permissioninformation, but also many other types of useful information,such as activity, intent, service, etc., which represents muchricher text information for capturing the correlations of appswhose behaviors are similar.

D. Result analysis for weakly-labeled malwareWhen clustering the weakly-labeled malware from

VIRUSSHARE, we apply our outlier-aware clustering on topof the hybrid representation learned from the constructedmalware network, which consists of 5676 strongly-labeled and3324 weakly-labeled samples.

Our outlier detection identifies that 3242 malware appsare close to existing malware families, falling into the inlier

TABLE VIIRESULTS OF OUTLIERS AND INLIERS OF WEAKLY-LABELED MALWARE

INCLUDING MALWARE WITH EMPTY LABELS AND CONTROVERSIALLABELS (I.E., THE TOP TWO MOST FREQUENT FAMILY NAMES REPORTED

BY AN EQUAL NUMBER OF VENDORS USING EUPHONY).

#Weakly-labeled #Malware in outlier #Malware in inliermalware

Empty- Controversial- Empty- Controversial-labels labels labels labels

# 3324 35 47 1499 1743

zone, and 82 apps have behaviors that are unlikely to corre-spond to those of the known families, thereby falling into theoutlier zone.

Table VII gives the inliers and outliers of the 3324weakly-labeled malware. The number of outlier samples isrelatively small, comprising only 2.5%. A major portion of theweakly-labeled malware are detected as inliers. For the 1534empty-labeled malware (Table I), 35 and 1449 are outliers andinliers respectively. Of the 1790 malware (Table II) who havecontroversial family names (i.e., the top two most frequentfamily names reported by an equal number of vendors usingEUPHONY), 47 and 1743 are outliers and inliers respectively.

Figure 7 shows the outlier results with 58 outlier (un-known) families clustered by anomaly detection. 47 out of 58outliers contain only 1 malware sample, and the remaining11 include more than 1 candidate. The largest outlier clustercontains 9 samples.

For inliers, there are 60 families out of the 176 knownfamilies. The distribution is uneven. The malware apps inthe top 10 families occupy the majority (88.26%) of all theweakly-labeled malware. Regarding malware in the inlier,we observe that our method is not limited by the numberof malware samples in the training set. For example, theplankton family ranks first in terms of the prediction resultin the Drebin dataset, whereas it only ranks as the third largestfamily. Similarly, Vdloader ranks fifth according to ourexperiment result, while it only ranks 167th in the Drebindataset with only 16 samples inside.

Due to the lack of ground-truths for weakly-labeledmalware, manual checking is the best and only way to validatethe results [24]. Following [16], we have also conductedmanual inspections (costing two people for four weeks) of themalware inside the same cluster to check the similarity of theirmalicious behaviors (by running and looking into the sourcecode and meta-info of the apps) to validate ANDRE’s clusteringresults. All the malware in outliers were checked since allthe outlier clusters have very small sizes. For inlier families,we randomly checked 3 apps for each family if the familycontains more than 3 apps, otherwise, all the apps within thefamily were checked. This process includes running the app(in a virtual Android environment) and checking the meta-info, vendors’ raw labels and malicious code (decompiled bydex2jar) to understand their familial behaviors. We willselect and illustrate in detail the representative apps within 3different clusters in the inlier set in the following subsection.


Fig. 7. Weakly-labeled malware detected as outliers (unknown families in thenetwork). There are 58 outlier clusters; their corresponding malware numberslisted on the right-hand side.

E. Case studies for weakly labeled malware in inlier

This section conducts in-depth case studies to demon-strate representative inlier weakly-labeled malware clusteredby ANDRE. Since every malware has a unique hash value(MD5), we will refer to their hash values in the followingstudies.

Inlier case Study 1 - YzchBgserv family. Malware“327a0387a3114ab4e4d18a81ad9fca8b”, which is a malwarewith an empty label being processed by EUPHONY, is clus-tered into the YzchBgserv family by ANDRE based on ourhybrid representation. The app has similar malicious behaviorsto other apps in YzchBgserv, which is essentially a type ofransomware. The app runs as a background thread which isautomatically triggered when the app starts. It first fetchesa target service provider number (component info) from aremote server and then sends an SMS message (permissioninfo) to the number, incurring a charge on the user’s phonebill. The SMS messages (code info) sent will always start witha string “YZHC”.

Inlier case study 2 – DroidKungFu family. Malware“00307253cd9ad3d1120df85beb473801” which is unable tobe clustered by EUPHONY, is clustered into the DroidKungFufamily by ANDRE. We manually checked the app whereasobserving that such malware infects a self-defined serviceand receiver (code-info), which can be automatically launchedwithout user interaction. Once the service gets started, it willcollect three kinds of user information on the infected device,including the IMEI number, phone model, and the Android OSversion (meta-info). These behaviors are typical symptoms ofbeing infected by the Droidkungfu family.

Inlier case Study 3 – Fakeupdates family. Malware“005054fdf485fdbbada461bd7824cfb2”, whose top two fam-ilies are reported by an equal number of vendors (“fake-updates”:1, “igexin”:1). After manual validation, we estab-lished that the plankton.device.android.servicepackage is its source of malice. By looking at several otherpeers within the same cluster, we find that they all share thispackage (code-info) with an identical code segment to start themalicious service (i.e., AndroidMDKService. initMDK ()).

F. Case study for weakly-labeled malware in outlier

We also performed a manual inspection of outliers whosemalicious behaviours are different from the known familiesin the network. As per our discussions in Section III-C, theresults of our outlier detection often rely on the availabilityof the existing known families in the network, which maynot cover all emerging malware families. The following casestudies report the outlier clusters, inside which an app doesnot conduct behaviors that are similar to those of the existingground-truths (e.g., Drebin).

Outlier-1 case study. This cluster, containing 9 mal-ware, is the largest of all outlier clusters. Our manualinspection reveals that the apps in this cluster share thesame risky permission combinations (9 suspicious permis-sions in total, as shown in Table VIII) to trigger similarmalicious behaviors, which induce users to click an inboundlink address to install a trojan package that modifies theapp’s configuration files. For example, the malware candidate“ee93c3eefb81151eca7346db74d613c” has the top two familynames “autoins” and “letang” both reported by 2 vendors usingEUPHONY. These two family names are new and do not appearin our ground-truths.

Outlier-2 case study. This cluster contains 4 malwaresharing similar malicious behaviors as triada [42], a recentmalware family, which does not belong to any of the existingfamilies in the network. The malware in this cluster first col-lects sensitive information including Android OS versions andthe list of installed applications. Then it sends this informationto the command and control (C&C) server [43].

Outlier-3 case study. This cluster also contains 4 mal-ware in total. The malice of the malware in this clus-ter starts from their in-app advertisements. In particular,when a user clicks a pre-defined advertisement link, themalicious program will display as many ads as possi-ble to the user. These malware also have a risky com-bination of permissions using CHANGE WIFI STATE andCHANGE NETWORK STATE whose intention is to switchthe Wifi and network status respectively after deriving theircurrent status.

V. RELATED WORK

Android malware detection. A number of solutions to Androidmalware detection [26, 32, 44–46] exist that gives a binarydecision to identify whether or not an app is malicious.DREBIN [32] proposes a malware detection approach througha two-class SVM by performing a lightweight static analysis

TABLE VIIISUSPICIOUS PERMISSIONS USED BY APPS IN OUTLIER-1

Permission StringsACCESS COARSE LOCATION

ACCESS FINE LOCATIONACCESS FINE LOCATION

INTERNETREAD PHONE STATE

SYSTEM ALERT WINDOWWRITE EXTERNAL STORAGE

ACCESS DOWNLOAD MANAGERDOWNLOAD WITHOUT NOTIFICATION

WRITE SETTINGS


Fig. 8. Weakly-labeled malware detected as inliers (known families in the network). The horizontal axis in the figure shows the family names of all malwarein the inlier clusters. The vertical axis indicates the log value of the number of apps in the corresponding families.

to extract API calls and manifest files as the input features.MaMaDroid [47] leverages sequences of abstracted methodcalls to create a probabilistic representation of program be-haviors in the form of Markov chains. DroidAPIMiner [44]mines API features for malware detection in Android using alightweight KNN-based binary classification. HINDroid [48]proposes an Android malware detection approach using hetero-geneous information network. Kim et al. [49] present a deep-learning-based approach to malware detection by utilizingthe features extracted from an Android app. Unlike malwaredetection, malware clustering or labeling (such as ANDRE)performs fine-grained familial identification, which in turn canbe used to build references datasets for supervised detectionand classification [5].Android malware classification and clustering. Droid-Miner [50] proposes a two-level behavioral graph modeland extracts sensitive paths to represent malicious behavioralpatterns for malware classification. DroidSIFT [51] classifiesAndroid malware through dependency graphs. The approachexcludes common third-party libraries to improve precisionby using a set of benign subgraphs to remove the commonsubgraphs of the whole dependency graph. DroidCat [52]proposes a dynamic malware detection and classification ap-proach that complements existing static program analyses bysupporting dynamic features such as reflections and callbacks.SMART [53] presents a malware classification approach basedon semantic clones using deterministic symbolic automation.

Malware clustering mainly works in an unsupervised orsemi-supervised scenario to cluster unlabeled samples. Mostexisting works on Android malware clustering rely on theraw labels reported by AntiVirus vendors [32]. However, thelack of common naming standards results in inconsistenciesamong the reports from different vendors [54]. To address thisproblem, AVCLASS [5] processes the raw label from vendorsby performing alias detection and generic token removal basedon vendor-specific rules to produce a single label per sample.EUPHONY further enhances its accuracy by using self-definedextraction rules to distinguish fields of a raw label. Consensusbetween different vendors is then reached through pluralityvoting. Li et al. [16] present a malicious payload miningmethod for clustering malware by considering the source code

information of an app. However, the approach relies on apre-labeling phase similar to AVCLASS and weakly-labeledsamples are not included in their clustering process. This paperaddresses the limitations of previous approaches to clusteringweakly-labeled malware using a new representation learningapproach that preserves code similarity, meta-info and the rawlabels of vendors.Representation Learning. Representation learning takes sam-ples and their corresponding raw features as input and pro-duces a unified and low-dimensional representation, whichis particularly useful for later machine-learning tasks, suchas classification and clustering. Many existing methodsperform representation learning based on a single sourceof information, such as natural languages (word2vec [30]and doc2vec [14]) and graph structures (DeepWalk [19],Node2vec [55], and LINE [56]).

HRL model has recently emerged as a promising branchof representation learning, demonstrating its power in embed-ding heterogeneous information into a unified low dimensionalspace. The resulting representation is particularly useful formany complicated machine-learning tasks. This paper is thefirst work that leverages the recent advances in HRL to per-form comprehensive Android malware clustering by preserv-ing multi-source heterogeneous information by jointly learningthree correlations: node sequences, node content, and nodelabels. This comprehensive representation learning approach isshown as a promising solution for clustering Android malware.

VI. DISCUSSIONS AND LIMITATIONS

First, like all learning-based approach, our approachrequires samples for training our model for precise clusteringof weakly-labeled malware. The more samples we have, themore precise and robust our clustering model will be.

Second, this approach relies on a whitelist [29] to excludecommon Android application third-party libraries, which maypotentially lead to false positives or negatives, if the whitelistis incomplete for emerging APIs. Nevertheless, whitelist is stilla standard way for reducing noise in code similarity analysis.In addition, whitelist also provides a flexible way for addingany new APIs which are safe.


Third, regarding the calculation of the similarity distancebetween two Android applications, our method uses Jaccardfor measuring the similarity distances, which achieves goodresults for code analysis. However, other methods, such asCosine [57], can also be used as an alternative approach tosimilarity analysis.

Fourth, ANDRE aims to develop a representation learn-ing framework for clustering weakly-labeled malware. Whencomparing the similarity between Android apps, our methoddoes not focus on obfuscation or deobfuscation, which maylead to imprecise results if malicious apps are obfuscated.Investigating obfuscated app is an orthogonal but an interestingfuture topic.

Finally, when calculating the similarity between two apps,we made a pairwise comparison for all Android malware.Applying parallel computing or enable similarity comparisonbetween apps using multiple threads can reduce the codeanalysis overhead and save comparison time.

VII. CONCLUSION

This paper proposes ANDRE, a new approach to Androidmalware clustering that utilizes heterogeneous informationincluding code similarity, the raw labels of AV vendors andmeta-data information to jointly learn an effective represen-tation that embeds all malware in the network into a low-dimensional and compact hybrid feature space for effectivelyclustering weakly-labeled malware. The experimental resultsshow that ANDRE achieves comparable accuracy to the state-of-the-art approaches for clustering ground-truth samples andthat ANDRE can effectively cluster weakly-labeled malwarewhich cannot be clustered by those approaches.

VIII. ACKNOWLEDGEMENT

We would like to thank the anonymous reviewers for theirhelpful comments. This research is supported by AustralianResearch Grant DE170101081, DP180102828, LP150100671,DP180100106 and the National Natural Science Foundationof China under Grant 61772055.

REFERENCES

[1] T. I. Murphy, “Android-statistics & facts,”https://www.statista.com/topics/876/android, 2018.

[2] “Avtest,” https://www.av-test.org/en/statistics/malware/, 2018.

[3] “Caro naming convention,” http://www.caro.org/articles/naming.html.

[4] J. Kuo and D. Beck, “The common malware enumerationinitiative,” Virus Bulletin, pp. 14–15, 2005.

[5] M. Sebastian, R. Rivera, P. Kotzias, and J. Caballero,“Avclass: A tool for massive malware labeling,” in RAID.Springer, 2016, pp. 230–253.

[6] M. Hurier, G. Suarez-Tangil, S. K. Dash, T. F. Bissyande,Y. L. Traon, J. Klein, and L. Cavallaro, “Euphony:Harmonious unification of cacophonous anti-virus vendorlabels for android malware,” in MSR, 2017, pp. 425–435.

[7] “Virustotal,” https://www.virustotal.com/home/upload,accessed 2018.

[8] “Pluralityvoting,” https://en.wikipedia.org/wiki/Plurality voting.[9] Y. Bengio, A. Courville, and P. Vincent, “Representa-

tion learning: A review and new perspectives,” TPAMI,vol. 35, no. 8, pp. 1798–1828, 2013.

[10] C. Yang, Z. Liu, D. Zhao, M. Sun, and E. Y. Chang,“Network representation learning with rich text informa-tion.” in IJCAI, 2015, pp. 2111–2117.

[11] S. Pan, J. Wu, X. Zhu, C. Zhang, and Y. Wang, “Tri-party deep network representation,” in AAAI, 2016, pp.1895–1901.

[12] W. Hamilton, Z. Ying, and J. Leskovec, “Inductive rep-resentation learning on large graphs,” in NIPS, 2017, pp.1024–1034.

[13] W. Yu, C. Zheng, W. Cheng, C. C. Aggarwal, D. Song,B. Zong, H. Chen, and W. Wang, “Learning deepnetwork representations with adversarially regularizedautoencoders,” in SIGKDD, 2018, pp. 2663–2671.

[14] Q. Le and T. Mikolov, “Distributed representations ofsentences and documents,” in ICML, 2014, pp. 1188–1196.

[15] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and P. S.Yu, “A comprehensive survey on graph neural networks,”arXiv preprint:1901.00596, 2019.

[16] Y. Li, J. Jang, X. Hu, and X. Ou, “Android malwareclustering through malicious payload mining,” in RAID,2017, pp. 192–214.

[17] C. Shi, B. Hu, W. X. Zhao, and S. Y. Philip, “Heteroge-neous information network embedding for recommenda-tion,” TKDE, vol. 31, no. 2, pp. 357–370, 2019.

[18] Y.-H. H. Tsai, P. P. Liang, A. Zadeh, L.-P. Morency,and R. Salakhutdinov, “Learning factorized multimodalrepresentations,” in ICLR, 2019.

[19] B. Perozzi, R. Al-Rfou, and S. Skiena, “Deepwalk:Online learning of social representations,” in SIGKDD,2014, pp. 701–710.

[20] F. Morin and Y. Bengio, “Hierarchical probabilisticneural network language model.” in AISTATS, vol. 5.Citeseer, 2005, pp. 246–252.

[21] H. Sajnani, V. Saini, J. Svajlenko, C. K. Roy, and C. V.Lopes, “Sourcerercc: scaling code clone detection to big-code,” in ICSE, 2016, pp. 1157–1168.

[22] L. Jiang, G. Misherghi, Z. Su, and S. Glondu, “Deckard:Scalable and accurate tree-based detection of codeclones,” in ICSE, 2007, pp. 96–105.

[23] L. Li, T. F. Bissyande, and J. Klein, “Rebooting researchon detecting repackaged android apps: Literature reviewand benchmark,” TSE, 2019.

[24] F. Wei, Y. Li, S. Roy, X. Ou, and W. Zhou, “Deepground truth analysis of current android malware,” inInternational Conference on Detection of Intrusions andMalware, and Vulnerability Assessment. Springer, 2017,pp. 252–276.

[25] H. Wang, Y. Guo, Z. Ma, and X. Chen, “Wukong: ascalable and accurate two-phase approach to android appclone detection,” in ISSTA, 2015, pp. 71–82.

[26] K. Chen, P. Wang, Y. Lee, X. Wang, N. Zhang, H. Huang,W. Zou, and P. Liu, “Finding unknown malice in 10seconds: Mass vetting for new threats at the google-play


scale.” in USENIX, vol. 15, 2015.[27] K. Chen, P. Liu, and Y. Zhang, “Achieving accuracy and

scalability simultaneously in detecting application cloneson android markets,” in ICSE, 2014, pp. 175–186.

[28] L. Li, T. F. Bissyande, J. Klein, and Y. Le Traon, “Aninvestigation into the use of common libraries in androidapps,” in SANER, vol. 1, 2016, pp. 403–414.

[29] ——, “An investigation into the use of common librariesin android apps,” in SANER, vol. 1, 2016, pp. 403–414.

[30] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado,and J. Dean, “Distributed representations of words andphrases and their compositionality,” in NIPS, 2013, pp.3111–3119.

[31] V. Chandola, A. Banerjee, and V. Kumar, “Anomalydetection: A survey,” ACM computing surveys (CSUR),vol. 41, no. 3, p. 15, 2009.

[32] D. Arp, M. Spreitzenbarth, M. Hubner, H. Gascon,K. Rieck, and C. Siemens, “Drebin: Effective and ex-plainable detection of android malware in your pocket.”in Ndss, vol. 14, 2014, pp. 23–26.

[33] “Virusshare,” https://www.virusshare.com.[34] “Android asset packaging tool,”

https://developer.android.com/studio/command-line/aapt2.

[35] “Topological anomaly detection,”https://unsupervisedlearning.wordpress.com/2014/08/04/topological-anomaly-detection/.

[36] “Mlpclassifier,” https://scikit-learn.org/stable/modules/neural networks supervised.html .

[37] “Svm,” https://scikit-learn.org/stable/modules/svm.html.[38] “Knn,” https://scikit-learn.org/stable/modules/generated/

sklearn.neighbors.KNeighborsClassifier.html.[39] “Decisiontree,” https://scikit-learn.org/stable/modules

/tree.html.[40] “Randomforest,” https://scikit-learn.org/stable/modules

/generated/sklearn.ensemble.RandomForestClassifier.html.[41] “Sklearn,” https://scikitlearn.org.[42] “triada,” https://www.kaspersky.com/blog/triada-

trojan/11481/.[43] “Command and control server (c&c),”

https://www.trendmicro.com/vinfo/us/security/definition/commandandcontrol-server.

[44] Y. Aafer, W. Du, and H. Yin, “Droidapiminer: Min-ing api-level features for robust malware detection inandroid,” in International conference on security andprivacy in communication systems. Springer, 2013, pp.86–103.

[45] Y. Feng, S. Anand, I. Dillig, and A. Aiken, “Apposcopy:Semantics-based detection of android malware throughstatic analysis,” in FSE. ACM, 2014, pp. 576–587.

[46] W. Yang, X. Xiao, B. Andow, S. Li, T. Xie, and W. Enck,“Appcontext: Differentiating malicious and benign mo-bile app behaviors using context,” in ICSE, 2015, pp.303–313.

[47] E. Mariconti, L. Onwuzurike, P. Andriotis, E. De Cristo-faro, G. Ross, and G. Stringhini, “Mamadroid: Detectingandroid malware by building markov chains of behavioralmodels,” NDSS, 2017.

[48] S. Hou, Y. Ye, Y. Song, and M. Abdulhayoglu, “Hin-droid: An intelligent android malware detection systembased on structured heterogeneous information network,”in SIGKDD. ACM, 2017, pp. 1507–1515.

[49] T. Kim, B. Kang, M. Rho, S. Sezer, and E. G. Im, “Amultimodal deep learning method for android malwaredetection using various features,” TIFS, vol. 14, no. 3,pp. 773–788, 2019.

[50] C. Yang, Z. Xu, G. Gu, V. Yegneswaran, and P. Porras,“Droidminer: Automated mining and characterization offine-grained malicious behaviors in android applications,”in European symposium on research in computer secu-rity. Springer, 2014, pp. 163–182.

[51] M. Zhang, Y. Duan, H. Yin, and Z. Zhao, “Semantics-aware android malware classification using weightedcontextual api dependency graphs,” in SIGSAC. ACM,2014, pp. 1105–1116.

[52] H. Cai, N. Meng, B. Ryder, and D. Yao, “Droidcat:Effective android malware detection and categorizationvia app-level profiling,” TIFS, 2018.

[53] G. Meng, Y. Xue, Z. Xu, Y. Liu, J. Zhang, andA. Narayanan, “Semantic modelling of android malwarefor effective malware comprehension, detection, andclassification,” in Proceedings of the 25th InternationalSymposium on Software Testing and Analysis. ACM,2016, pp. 306–317.

[54] M. Hurier, K. Allix, T. F. Bissyande, J. Klein, andY. Le Traon, “On the lack of consensus in anti-virusdecisions: Metrics and insights on building ground truthsof android malware,” in DIMVA, 2016, pp. 142–162.

[55] A. Grover and J. Leskovec, “node2vec: Scalable featurelearning for networks,” in SIGKDD, 2016, pp. 855–864.

[56] J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, and Q. Mei,“Line: Large-scale information network embedding,” inWWW, 2015, pp. 1067–1077.

[57] “Cosine similarity,” https://en.wikipedia.org/wiki/Cosinesimilarity, accessed 2018.

Documents

Familial Clustering For Weakly-labeled Android Malware ... · IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY CLASS FILES, VOL. XXX, NO. XXX, XXX 2019 1 Familial Clustering