13
AI Based An*virus: Detec*ng Android Malware Variants With a Deep Learning System Thomas Lei Wang @thomaslwang

AI Based An$virus: Detec$ng Android Malware Variants With a

Embed Size (px)

Citation preview

Page 1: AI Based An$virus: Detec$ng Android Malware Variants With a

AIBasedAn*virus:Detec*ngAndroidMalwareVariantsWithaDeepLearningSystem

ThomasLeiWang@thomaslwang

Page 2: AI Based An$virus: Detec$ng Android Malware Variants With a

About me • My first (boring) job was a virus analyst in 2004. •  I had a dream…

Page 3: AI Based An$virus: Detec$ng Android Malware Variants With a

Virus Analysis VS Image Recognition

Experiencedvirusanalystsome*mesisdoingimagerecogni*on!

ImageProvidedbytheMNISThandwriHendatabase

Page 4: AI Based An$virus: Detec$ng Android Malware Variants With a

Sample increase VS signature efficiency decrease

NumberofMaliciousAndroidApps Dowgin:ARichVariantsAndroidAdwareFamilyNewDowginSamplesVSAverageDowginSamplesHitPerSignature

Maliciousapps,DowginsamplesandDowginsignaturesarecountedfromourdatabase.

Page 5: AI Based An$virus: Detec$ng Android Malware Variants With a

Signaturebasedrules

Behavioralbasedrules

Opcodebasedrules

AIbaseddeeplearningsystem

Our evolution

Page 6: AI Based An$virus: Detec$ng Android Malware Variants With a

Training FeatureExtrac+on

•  Structuraltype

•  Sta*s*caltype •  Empiricaltype •  Continuousvalue

•  0-1value

FeatureNormaliza+on

•  Standardscorenormaliza*on

•  CuVngtechnique

•  Quan*lenormaliza*on

TraininginDeepNeuralNetwork

•  PaddlePaddleplaYorm

•  Residuallayer •  AutoEncoder •  Configura*ontunings

Models

• Malwaremodel

•  PUAmodel

InputAPKfeatures Model Output

Prediction

Page 7: AI Based An$virus: Detec$ng Android Malware Variants With a

APK

Feature extraction

Structuralfeatures• Numofuses-permissonsinAndroidManifest• Numberofpicturefilesin/res• Sizeof/res• NumberofclassesstartswithLcom/• NumofclassesstartswithLjava/• Numoffieldstypeboolean• Numofmethodswhichhasparameters>20

Empiricalfeatures• Hasexecutablefilein/res• Hasapkfilein/assets• RegisterDEVICE_ADMIN_ENABLEDbroadcastandhassendSMSMessagepermission

Sta*s*calfeatures• Countcer*ficatefieldsinsamplestoget100stringswithdiscrimina*[email protected]/benign=52

205 3 34.5 143234

285 68 296 7

13850 157 11218 847

1.23e+9 422 1004 177

0 398 13.333 125

Numeraliza*on(N=1235)

Con*nuousvalue(N=571)

0-1value(N=664)

0 0 0 0

0 0 0 1

0 1 0 0

0 1 1 0

0 1 0 0

‘’’’’’’’’’’’’’

‘’’’’’’’’’’‘

Page 8: AI Based An$virus: Detec$ng Android Malware Variants With a

CuVngtechnique

Feature normalization

Con*nuousvalue

Gaussiandistribu*on

[-1,1]

Standardscorenormaliza*on

Tomakefeaturesmorediscrimina*vePrecisionincreasedby9%

Noiseproblem

Long-taileddistribu*on

Quan*lenormaliza*on

CuVngtechnique

Mul*modaldistribu*on

Page 9: AI Based An$virus: Detec$ng Android Malware Variants With a

Tanh ReLU

Training in deep neural network

AutoEncoder

Configura*ons:• Hiddenlayerac*va*onfunc*on:TanhandReLU• Costfunc*on:Mul*classcrossentropy• Learningmethod:ADADELTA• Finallayerac*va*onfunc*on:Sormax• Passes:20–30

Inputlayer

NormalizedCon*nuousvalue(n=571)

0-1value(n=664)

Hiddenlayer1

Hiddenlayer2 Hiddenlayer3 Residuallayer iden*ty

outputlayer

n=256

n=256

ReLU

ReLU

Tanh

n=256 n=256

Sormax

n=256

Tanh Tanh

NetworkArchitecture

0

1

TrainedonPaddlePaddleplaYormwith15M+samples

Page 10: AI Based An$virus: Detec$ng Android Malware Variants With a

Prediction & Evaluation

Models

ExtractapkfeaturesonthephonePerf:140ms/apkTraffic:1kB/apk

Sendfeaturestothecloud

Returnpredic*ontothephone

Predic*on

Predictinthecloud

Apkfeatures

Produc+ondeployment

0.880.890.90.910.920.930.940.950.960.97

Jan2016 Mar2016 May2016 Jul2016

Thelife*meofmodeltrainedonJan2016

ThemodelistrainedonJan2016andtestedagainstAV-TESTJan,Mar,MayandJuly’ssamples.Recallratedroppedby7.6%in6months.

0.95

0.955

0.96

0.965

0.97

0.975

0.98

0.985

0.99

0.995

0 0.02 0.04 0.06 0.08 0.1

Detec*onperformanceasROCcurve

ROCcurveistestagainstAV-TESTJuly’ssamples:7613Androidmalware,3020legi*mateAndroidapps,total10633.

Detec+onperformance

True

Posi*veRate

FalsePosi*veRate

Modellife+me

Recall

Page 11: AI Based An$virus: Detec$ng Android Malware Variants With a

Limitations • Can’t provide explanations

for its detection results • Can’t understand code

meaning. • Build on static analysis and

lack of dynamic inspection. • Can’t self learning, need

continuous training with labeled data.

Advantages • More difficult to evade • Fixed-size

Page 12: AI Based An$virus: Detec$ng Android Malware Variants With a

Conclusion • Feature extraction is the key step

•  Virus analyst experience can help to find valuable features. •  AutoEncoder neural network can be used to extract the most valuable

features from a large number of features. • This system is designed to detect Android malware, but these

methods can also be used in detecting malware in other platforms.

• Our system learns in image recognition way. It’s effective only in detecting malware variants.

Page 13: AI Based An$virus: Detec$ng Android Malware Variants With a

Thank you • Welcome contact me

•  Twitter: @thomaslwang •  Email: [email protected]

• Welcome cooperation and partnership with us • Acknowledgement

•  Baidu IDL: Lyv Qin, Xiao Zhou, Jie Zhou, Errui Ding, Yuanqing Lin, Andrew Ng

•  Partner: Liuping Hou, Jinke Liu, Zhijun Jia, Yanyan Ji •  PaddlePaddle platform http://paddlepaddle.org