Detection of Malware Downloads via Graph Mining (AsiaCCS '16)

Preview:

Citation preview

Real-Time Detection of Malware Downloads via Large-Scale URL→File→Machine Graph Mining

Babak Rahbarinia ; Marco Balduzzi ; Roberto PerdisciAsiaCCS 2016, June 02, Xi’an, China

1

Introduction

Traditional AV is dead?Signature-based VS. Statistical-based

Traditional AVs inefficiency (they don’t work!)polymorphism, code obfuscation, packers, ...

URL blacklistingstatic, lags behindtime consuming analysis of individual URLs

Local VS. GlobalLocal: looks at one potential malware at a time

Global: leverages global situational awareness

2

Introduction

Large-scale analysis of behavioral patterns“Who - where - what” relationshipGlobal situation awarenessGraph-based machine learning

Combination of system- and network-level info

Mastino:Real-time and concurrent detection of download

eventsReal-world deployment on million of machines

(Internet-scale)3

Approach

4

Approach

5

Static+dynamic detection [Many]

Graph mining detection: Polonium [KDD10]Offline approach VS real-timeOnly files classification VS + URLs (download event)Bipartite VS tripartite graphProprietary reputation function VS open

AMICO [Esorics13]HTTP-centric VS protocol-independentOnly works in LANs VS “move across networks”

Google’s CAMP [NDSS13]Browser-centric VS system-centric

(Quick) Related Work

6

Download GraphURLs

Files Machines

7

AnnotationsURLs

Files Machines

● Age of URL, domain, path, IP

● Size● Lifetime, prevalence● Packed, signed

● Download behavior● Client processes8

URLs

Files Machines

Labeling

Machines’ reputations based on their download/activity history 9

● B: Alexa (-hosting)● M: GSB + WRS

● B: Grid + VT● M: VT

Features and classifier

f

url1 url2 url3

f behavior-basedfeatures = {URL stats, machine stats}

url4

machine1 machine3machine2

compute min, max, med, avg, and std

compute min, max, med, avg, and std

URL’s R + R of [FQD, e2LD, path, path pattern, query string, query pattern]

Machine’s R

Files Features

10

Features and classifier

f

url1 url2 url3

f behavior-basedfeatures = {URL stats, machine stats}

url4

machine1 machine3machine2

compute min, max, med, avg, and std

compute min, max, med, avg, and std

URL’s R + R of [FQD, e2LD, path, path pattern, query string, query pattern]

Machine’s R

f intrinsicfeatures = {file size, prevalence,

packed, signed, ...}+

Files Features

11

Features and classifier

f

url1 url2 url3

f behavior-basedfeatures = {URL stats, machine stats}

url4

machine1 machine3machine2

compute min, max, med, avg, and std

compute min, max, med, avg, and std

URL’s R + R of [FQD, e2LD, path, path pattern, query string, query pattern]

Machine’s R

f intrinsicfeatures = {file size, prevalence,

packed, signed, ...}

Files Features

URLs Features

u + {all URLs sharing a component with u}

file1 file2 file3

u behavior-basedfeatures = {files stats, machine stats}

file4

machine1 machine3machine2

compute min, max, med, avg, and std

compute min, max, med, avg, and std

File’s R

Machine’s R

+

12

Features and classifier

URLs Features

u + {all URLs sharing a component with u}

file1 file2 file3

u behavior-basedfeatures = {files stats, machine stats}

file4

machine1 machine3machine2

compute min, max, med, avg, and std

compute min, max, med, avg, and std

File’s R

Machine’s R

u intrinsicfeatures = {URL, FQD,

e2LD recency}+

f

url1 url2 url3

f behavior-basedfeatures = {URL stats, machine stats}

url4

machine1 machine3machine2

compute min, max, med, avg, and std

compute min, max, med, avg, and std

URL’s R + R of [FQD, e2LD, path, path pattern, query string, query pattern]

Machine’s R

f intrinsicfeatures = {file size, prevalence,

packed, signed, ...}

Files Features

+

13

Example #1

U1

U2

URLs

Files Machines

F2

F1

F3

G1

G2

What could be said about F1 and F2?

14

Example #1URLs

Files Machines

F2

F1

What could be said about F1 and F2?

15

Example #1URLs

Files Machines

F2

F1

What could be said about F1 and F2?

16

Example #2

u

URLs

Files

What could be said about F1?All neighbors are unknown

F1

Machines17

Example #2

u

URLs

Files

FQD Path

All URLs that share the same components as u

Machines

All URL components:* FQD* e2LD* Path* Path pattern* Query string* Query string pattern* IP* IP/24

18

F1

Example #2

u

URLs

Files

FQD Path

All URLs that share the same components as u

Machines19

F1

Example #2

u

URLs

Files

FQD Path

All URLs that share the same components as u

Machines

F1

20

Deployment

TimeDay 1 Day 2 Today

...Yesterday

21

Time Window of 10 days

Deployment

TimeDay 1 Day 2 Today

...Yesterday

Trained classifiers

URL classifier

SHA1 classifier

Real-time classification

of URLs & SHA1s

Detection of

Malicious Download Events

22

Data Collection

7 months of data (Jan to Aug 2014)d = (u; f; m) Hundreds of thousands of machines, files, urlsMillion of nodes

Labeling:Files: VirusTotal, GRID [Trend]URLs: Alexa, Google Safe Browsing, WRS [Trend]

Annotations:File census and GUID census [Trend]Virus Total (signed..)

23

Train & test for new download events

New download events

Detection results new events over 7 periods of 5 days (35 days, total)

Files URLs

24

Combined detection of download events

(u = m) v (f = m) -> d = m1 day experiment (5 months)

Efficiency: requests are served in ~0.16 sec84% of detection: 0-days (unknown)

25

Wuachos.A DropperFilename file_saw.exe

URLs with _no_ reputationLow prevalenceInvalid signaturePath pattern with R of 0.72 (malicious) [*]

1,445 URLs serving 182 polymorphic malware

[*] /f/1392240240/1255385580/2 , /f/1392240120/4165299987/2 -> /H1/I10/I10/I1

Case Study #1

26

Somoto AdwareFilename FreeZipSetup-[\d].exe

Packed, short lifetime, prevalence = 01 related machine downloaded 1 known

sample during our time window T=10days

Detected a campaign of 695 samples616 were unknown to VirusTotal

61 unknown +6 months

Case Study #2

27

TTAWinCDM Spyware

Machine and URL with _no_ reputationLow lifetime&prevelance&countries

Mismatch on downloading processAcrobat process VS. Unauthoritative domain

Flash 0-day (+2 month)

Case Study #3

28

Analysis of Window T

Bonus #1

29

Features Analysis

Bonus #2

30

Files analysis URLs analysis

Mastino: real-time detection of malware downloads by passive clients monitoring

Content agnostic, behavioral analysis

Real-world deployment on large-scaleOver 95% TP / 0.5% FP0-days

Conclusions

31

Thank you!

@embytehttp://www.madlab.it

Babak Rahbarinia ; Marco Balduzzi ; Roberto Perdisci

Questions?

32

Recommended