Upload
marco-balduzzi
View
293
Download
0
Embed Size (px)
Citation preview
Real-Time Detection of Malware Downloads via Large-Scale URL→File→Machine Graph Mining
Babak Rahbarinia ; Marco Balduzzi ; Roberto PerdisciAsiaCCS 2016, June 02, Xi’an, China
1
Introduction
Traditional AV is dead?Signature-based VS. Statistical-based
Traditional AVs inefficiency (they don’t work!)polymorphism, code obfuscation, packers, ...
URL blacklistingstatic, lags behindtime consuming analysis of individual URLs
Local VS. GlobalLocal: looks at one potential malware at a time
Global: leverages global situational awareness
2
Introduction
Large-scale analysis of behavioral patterns“Who - where - what” relationshipGlobal situation awarenessGraph-based machine learning
Combination of system- and network-level info
Mastino:Real-time and concurrent detection of download
eventsReal-world deployment on million of machines
(Internet-scale)3
Approach
4
Approach
5
Static+dynamic detection [Many]
Graph mining detection: Polonium [KDD10]Offline approach VS real-timeOnly files classification VS + URLs (download event)Bipartite VS tripartite graphProprietary reputation function VS open
AMICO [Esorics13]HTTP-centric VS protocol-independentOnly works in LANs VS “move across networks”
Google’s CAMP [NDSS13]Browser-centric VS system-centric
(Quick) Related Work
6
Download GraphURLs
Files Machines
7
AnnotationsURLs
Files Machines
● Age of URL, domain, path, IP
● Size● Lifetime, prevalence● Packed, signed
● Download behavior● Client processes8
URLs
Files Machines
Labeling
Machines’ reputations based on their download/activity history 9
● B: Alexa (-hosting)● M: GSB + WRS
● B: Grid + VT● M: VT
Features and classifier
f
url1 url2 url3
f behavior-basedfeatures = {URL stats, machine stats}
url4
machine1 machine3machine2
compute min, max, med, avg, and std
compute min, max, med, avg, and std
URL’s R + R of [FQD, e2LD, path, path pattern, query string, query pattern]
Machine’s R
Files Features
10
Features and classifier
f
url1 url2 url3
f behavior-basedfeatures = {URL stats, machine stats}
url4
machine1 machine3machine2
compute min, max, med, avg, and std
compute min, max, med, avg, and std
URL’s R + R of [FQD, e2LD, path, path pattern, query string, query pattern]
Machine’s R
f intrinsicfeatures = {file size, prevalence,
packed, signed, ...}+
Files Features
11
Features and classifier
f
url1 url2 url3
f behavior-basedfeatures = {URL stats, machine stats}
url4
machine1 machine3machine2
compute min, max, med, avg, and std
compute min, max, med, avg, and std
URL’s R + R of [FQD, e2LD, path, path pattern, query string, query pattern]
Machine’s R
f intrinsicfeatures = {file size, prevalence,
packed, signed, ...}
Files Features
URLs Features
u + {all URLs sharing a component with u}
file1 file2 file3
u behavior-basedfeatures = {files stats, machine stats}
file4
machine1 machine3machine2
compute min, max, med, avg, and std
compute min, max, med, avg, and std
File’s R
Machine’s R
+
12
Features and classifier
URLs Features
u + {all URLs sharing a component with u}
file1 file2 file3
u behavior-basedfeatures = {files stats, machine stats}
file4
machine1 machine3machine2
compute min, max, med, avg, and std
compute min, max, med, avg, and std
File’s R
Machine’s R
u intrinsicfeatures = {URL, FQD,
e2LD recency}+
f
url1 url2 url3
f behavior-basedfeatures = {URL stats, machine stats}
url4
machine1 machine3machine2
compute min, max, med, avg, and std
compute min, max, med, avg, and std
URL’s R + R of [FQD, e2LD, path, path pattern, query string, query pattern]
Machine’s R
f intrinsicfeatures = {file size, prevalence,
packed, signed, ...}
Files Features
+
13
Example #1
U1
U2
URLs
Files Machines
F2
F1
F3
G1
G2
What could be said about F1 and F2?
14
Example #1URLs
Files Machines
F2
F1
What could be said about F1 and F2?
15
Example #1URLs
Files Machines
F2
F1
What could be said about F1 and F2?
16
Example #2
u
URLs
Files
What could be said about F1?All neighbors are unknown
F1
Machines17
Example #2
u
URLs
Files
FQD Path
All URLs that share the same components as u
Machines
All URL components:* FQD* e2LD* Path* Path pattern* Query string* Query string pattern* IP* IP/24
18
F1
Example #2
u
URLs
Files
FQD Path
All URLs that share the same components as u
Machines19
F1
Example #2
u
URLs
Files
FQD Path
All URLs that share the same components as u
Machines
F1
20
Deployment
TimeDay 1 Day 2 Today
...Yesterday
21
Time Window of 10 days
Deployment
TimeDay 1 Day 2 Today
...Yesterday
Trained classifiers
URL classifier
SHA1 classifier
Real-time classification
of URLs & SHA1s
Detection of
Malicious Download Events
22
Data Collection
7 months of data (Jan to Aug 2014)d = (u; f; m) Hundreds of thousands of machines, files, urlsMillion of nodes
Labeling:Files: VirusTotal, GRID [Trend]URLs: Alexa, Google Safe Browsing, WRS [Trend]
Annotations:File census and GUID census [Trend]Virus Total (signed..)
23
Train & test for new download events
New download events
Detection results new events over 7 periods of 5 days (35 days, total)
Files URLs
24
Combined detection of download events
(u = m) v (f = m) -> d = m1 day experiment (5 months)
Efficiency: requests are served in ~0.16 sec84% of detection: 0-days (unknown)
25
Wuachos.A DropperFilename file_saw.exe
URLs with _no_ reputationLow prevalenceInvalid signaturePath pattern with R of 0.72 (malicious) [*]
1,445 URLs serving 182 polymorphic malware
[*] /f/1392240240/1255385580/2 , /f/1392240120/4165299987/2 -> /H1/I10/I10/I1
Case Study #1
26
Somoto AdwareFilename FreeZipSetup-[\d].exe
Packed, short lifetime, prevalence = 01 related machine downloaded 1 known
sample during our time window T=10days
Detected a campaign of 695 samples616 were unknown to VirusTotal
61 unknown +6 months
Case Study #2
27
TTAWinCDM Spyware
Machine and URL with _no_ reputationLow lifetime&prevelance&countries
Mismatch on downloading processAcrobat process VS. Unauthoritative domain
Flash 0-day (+2 month)
Case Study #3
28
Analysis of Window T
Bonus #1
29
Features Analysis
Bonus #2
30
Files analysis URLs analysis
Mastino: real-time detection of malware downloads by passive clients monitoring
Content agnostic, behavioral analysis
Real-world deployment on large-scaleOver 95% TP / 0.5% FP0-days
Conclusions
31
Thank you!
@embytehttp://www.madlab.it
Babak Rahbarinia ; Marco Balduzzi ; Roberto Perdisci
Questions?
32