1© 2013 Applied Communication Sciences. A Business of the SI
All Rights Reserved.Registered Trademark of TT Government Solutions, Inc.
© 2013 Applied Communication Sciences. A Business of the SI. All Rights Reserved.Registered Trademark of TT Government Solutions, Inc.
.
Detection of Metamorphic Malware Variants Using Global Control Flow Analysis
Shane R SnyderCERDECUS ArmyAberdeen Proving Ground, MD
Hira AgrawalLisa BahlerMike LittleJosephine MicallefSystems & Security ResearchApplied Communication SciencesBasking Ridge, NJ
2© 2013 Applied Communication Sciences. A Business of the SI
All Rights Reserved.Registered Trademark of TT Government Solutions, Inc.
• Detect zero-day vulnerabilities arising from new variants of old malware A vast majority of existing malware is made up of
variations of old malware There is a substantial lag between the time a new
variant is discovered and the time its signature is added to the local AV signature database
Machines remain vulnerable during this time window, which may often be long
The Problem
3© 2013 Applied Communication Sciences. A Business of the SI
All Rights Reserved.Registered Trademark of TT Government Solutions, Inc.
Current Solutions are Inadequate
• Current techniques derive syntactic malware signatures—based on control structures such as specific byte sequences, flow graphs, and call graphs
• These signatures are easily defeated using automated program diversification techniques such as those employed by new metamorphic transformation engines− Adding spurious sub graphs
within flow graphs − Adding spurious functions and
calls to those functions− Inlining and outlinig code into-
and out-of existing functions
4© 2013 Applied Communication Sciences. A Business of the SI
All Rights Reserved.Registered Trademark of TT Government Solutions, Inc.
Approach
Malware Family A:Variant A1
Variant A2
⁞
Variant Aa
Malware Family B:Variant B1
Variant B2
⁞
Variant Bb
Malware Family X:Variant X1
Variant X2
⁞
Variant Xx
⁞
Current Approaches
Signature A1
Signature A2
⁞
Signature Aa
Signature B1
Signature B2
⁞
Signature Bb
Signature X1
Signature X2
⁞
Signature Xx
⁞
⁞
⁞
⁞
Abstract Signature A
⁞
Abstract Signature B
Abstract Signature X
Proposed Approach
Even though the variants differ in their control and data flow, they share a common set of core elements!
5© 2013 Applied Communication Sciences. A Business of the SI
All Rights Reserved.Registered Trademark of TT Government Solutions, Inc.
5
if xS1;S2;
elseS3;S4;
S5;
a
b
c
d
OriginalVersion
a
b c
d
Flow Graph ofthe Original Version
1
d
b c
S5
s1; s2 s3; s4Signature of theOriginal Version
3
l
m o
q
r
s
u
vx y
n pt w
f1()
f2()
f3()
f4()
main()
Flow Graph of the Sample Variant
2
5
s1; s2 s3; s4r u
qS5
Signature of theSample Variant
4
A Sample Variant
lm
oq
rs
uv
x
y
n
p
t
w
if xf1();
elsef3();
S5;
f1()S1;f2();
f2()S2;
f3()S3;f4();
f4()S4;
Abstract, Semantic Signatures
6© 2013 Applied Communication Sciences. A Business of the SI
All Rights Reserved.Registered Trademark of TT Government Solutions, Inc.
Predicate Node
Non-Predicate Node
(v) Merged-Dominator Graph
a
b c
d
a
b c
d
(ii) Flow Graph (iii) Pre Dominator Tree
a
b c
d
(iv) Post-Dominator Tree
d
b c
a
if xS1;S2;
elseS3;S4;
S5;
a
b
c
d
(i) Malware
(vii) Malware Signature(Super-block Dominator
Tree Projected OverNon-Predicate Nodes)
d
b c
s5
s1; s2 s3; s4
(vi) Super-blockDominator Graph
and Tree
a, d
b c
x,s5
s1; s2 s3; s4
Construction of Local Signatures
Key
7© 2013 Applied Communication Sciences. A Business of the SI
All Rights Reserved.Registered Trademark of TT Government Solutions, Inc.
main
f1
f2 f4
f3
(iv) Call Graph
l
m o
q
r
s
u
vx y
n pt w
f1
f2
f3
f4
main
(ii) Flow Graphs
lm
oq
rs
x
n
pt
uv
y
w
if x f1();else f3();S5;
f2() S2;
f1() S1; f2()f3() S3; f4();f4() S4;
(i) Malware Variant
q
m,n o,p
r,s,t
x
u,v,w
y
(v) Intermediate Dominator Graph
q
m,n,r,s,t,x
o,p,u,v,w,y
(vi) Mega Block Dominator Graph
and Tree
s1; s2 s3; s4r,x, u,y
q S5
(vii) Variant Signature(Projected Mega Block Dominator Tree)
Construction of Global Signatures
8© 2013 Applied Communication Sciences. A Business of the SI
All Rights Reserved.Registered Trademark of TT Government Solutions, Inc.
c
h
a
r
f
j
g i
b d
e
a
r
j
g
b
System or library callnode
Othernodes
Entry or exit node
KEY
OriginalFlow Graph
ProjectedFlow Graph
Flow Graph Projection
9© 2013 Applied Communication Sciences. A Business of the SI
All Rights Reserved.Registered Trademark of TT Government Solutions, Inc.
Flow Graph Projection Steps
c
h
a
r
f
j
g i
b d
e
Identify and mark all nodes
that represent system-
or library-calls
c
h
a
r
f
j
g i
b d
e
Mark the entry and
exit nodes, if not
already marked, as
relevant
c
h
a,d,e
r
f
j
g i
b
Merge all cycles
made up entirely of irrelevant
nodes
a,d,e,f
r
j
g,h
b,c
i
Merge any irrelevant node that
has a single predecessor
node with the latter
node
a,d,e,f
r
j,i
g,h
b,c
Merge any irrelevant node that
has a single
successor node, with that node
a
r
j
g
b
Assign the label of the
first instruction
of each node as
the label of that node
10© 2013 Applied Communication Sciences. A Business of the SI
All Rights Reserved.Registered Trademark of TT Government Solutions, Inc.
f1
f2
f6
f5
f4
f3
f1
f4
f5
Relevant function, which contains a system/library call
Other functions
KEY
OriginalCall Graph
ProjectedCall Graph
Function that directly or indirectly calls a relevant function
Call Graph Projection
11© 2013 Applied Communication Sciences. A Business of the SI
All Rights Reserved.Registered Trademark of TT Government Solutions, Inc.
Call Graph Projection Steps
f1
f2
f6
f5
f4
f3
Mark the root node of
the call graph, if not
already marked, as
relevant
f1
f2
f6
f5
f4
f3
Identify and mark any node
that represents
function containing a relevant node as relevant
f1
f2
f6
f5
f4
f3
Mark all predecessors of all marked
nodes, as well as all their call sites, as relevant
f1
f4
f5,f6
Remove all irrelevant
nodes from which no relevant
nodes may be reached
f1
f2 f4
f3 f5,f6
Merge any irrelevant node that
has a single predecessor node, with that node
f1
f4
f5
Make the label of the first
instruction in each
node, as the label of that node
12© 2013 Applied Communication Sciences. A Business of the SI
All Rights Reserved.Registered Trademark of TT Government Solutions, Inc.
Concept of Operations1. Offline Analysis
Instance of knownmalware binary
Graph basedrepresentation
(a) Analyze and transform
(d) distance < neighbor’s threshold
MalwareBinary
2. Online Analysis
New, unknown binary instance
Graph basedrepresentation
(c) Compare signatures
Graph-based SignatureFind the nearest neighbor of the new binary in the malware library and compute
the distance between them
Graph-based SignatureGraph-based SignatureAbstract signatures
(b) Generate abstract signatures
For efficiency, the chosen “distance”measure must satisfy the triangle inequality!
Benign instance!no
Malware instance!yes
Abstract signature of the new binary
13© 2013 Applied Communication Sciences. A Business of the SI
All Rights Reserved.Registered Trademark of TT Government Solutions, Inc.
T1
u
v x
w y
T3
u
w x
z y
T2
z
u
v x
w y
insert (z,1,w)
Edit Distance, De (Ti,Tj) = The length the shortest Edit Script (Ti,Tj)
delete (v,1,u)v
z
or, (i) relabel v to w (ii) relabel w to z
De(T1,T3) = 2
Tree Edit Distance
Edit Script (T1,T3) : (i) insert z as the first child of w (ii) delete v, the 1st child u
v
or, (i) delete v, the 1st child u (ii) insert z as the first child of w
14© 2013 Applied Communication Sciences. A Business of the SI
All Rights Reserved.Registered Trademark of TT Government Solutions, Inc.
De(T1,T2) = 1
De(T2,T3) = 1
De(T1,T3) = 2
T1
u
v x
w y
T3
u
w x
z y
T2
z
u
v x
w y
Tree Edit Distance (cont’d)
De satisfies the triangle inequality!
De(T1,T3) ≤ De(T1,T2) + De(T2,T3)
15© 2013 Applied Communication Sciences. A Business of the SI
All Rights Reserved.Registered Trademark of TT Government Solutions, Inc.
T1 (5 nodes) T2 (6 nodes)
u
v x
w y
z
u
v x
w y
Tree Edit Distance (cont’d)
T4 (106 nodes)
u
x
y
t
⁞ (+100 nodes)z
v
w
T3 (105 nodes)
u
w x
v y
t
⁞(+100 nodes)
De(T1,T2) = 1
De(T3,T4) = 1 !!!
16© 2013 Applied Communication Sciences. A Business of the SI
All Rights Reserved.Registered Trademark of TT Government Solutions, Inc.
T1 (5 nodes) T2 (6 nodes)
u
v x
w y
z
u
v x
w y
Normalized Tree Edit Distance
Dn(T1,T2) = 1/11
T4 (106 nodes)
u
x
y
t
⁞ (+100 nodes)z
v
w
T3 (105 nodes)
u
w x
v y
z
⁞(+100 nodes)
Dn(T3,T4) = 1/211 « 1/11
Dn(Ti,Tj) = De(Ti,Tj) / |Ti|+|Tj|
17© 2013 Applied Communication Sciences. A Business of the SI
All Rights Reserved.Registered Trademark of TT Government Solutions, Inc.
T1 (5 nodes) T2 (6 nodes)
u
v x
w y
v
u
v x
w y
Normalized Tree Edit Distance (cont’d)
Dn(T1,T2) = 1/11 = 10/110 Dn(T2,T3) = 1/11 = 10/110
Dn(T1,T3) = 2/10 = 22/110 > 20/110
Dn(T1,T3) > Dn(T1,T2) + Dn(T2,T3)
T3 (5 nodes)
u
w x
v y
Dn does NOT satisfy the triangle inequality!
18© 2013 Applied Communication Sciences. A Business of the SI
All Rights Reserved.Registered Trademark of TT Government Solutions, Inc.
T1 (5 nodes) T2 (6 nodes)
u
v x
w y
v
u
v x
w y
Normalized, Metric Tree Edit Distance
Dnm(T1,T2) = 2*De(T1,T2) / (|T1|+|T2| + De(T1,T2))
Dnm(T1, T2) = 2*1/(11+1) = 2/12 Dnm(T2, T3) = 2*1/(11+1) = 2/12Dnm(T1, T3) = 2*2/(10+2) = 4/12
Dnm(T1,T3) ≤ Dnm(T1,T2) + Dnm(T2,T3)
T3 (5 nodes)
u
w x
v y
Dnm satisfies the triangle inequality!
19© 2013 Applied Communication Sciences. A Business of the SI
All Rights Reserved.Registered Trademark of TT Government Solutions, Inc.
Detecting Variants Missed by AV• We found a malware family with five variants
−Virus.Win32.Thorin−Virus.Win32.Thorin.b−Virus.Win32.Thorin.c−Virus.Win32.Thorin.d−Virus.Win32.Thorin.e
• A major AV product failed to detect the last variant—Virus.Win32.Thorin.e
• MAA correctly flags it as malware—based on abstract signature it derives from the first variant
• It detects the other three variants as well—without requiring a separate, dedicated signature for each of them, as AV products often do
20© 2013 Applied Communication Sciences. A Business of the SI
All Rights Reserved.Registered Trademark of TT Government Solutions, Inc.
Checking for False Positives & Negatives
Platform Number of Families
Number of Malware
Family Size
Largest Smallest
DOS 2 2 1 1FreeBSD 2 7 6 1id1242 1 1 1 1IIS 1 4 4 4IRC 2 2 1 1Linux 48 67 5 1MSIL 2 2 1 1MSWord 3 3 1 1Multi 11 21 4 1QNX 1 2 2 2Svat 1 1 1 1VBS 2 2 1 1Win32 822 2742 89 1Win9x 133 473 62 1WinHLP 3 5 3 1Unknown 39 39 1 1Total 1034 3373 89 1
We chose ~500 Win32 samples from families of size 10 or more.
21© 2013 Applied Communication Sciences. A Business of the SI
All Rights Reserved.Registered Trademark of TT Government Solutions, Inc.
Fivefold Cross Validation Test
Test Run # of Test Binaries
True Positives False Negatives
Count % Count %
1 104 103 99% 1 1%
2 109 103 94% 6 6%
3 104 94 90% 10 10%
4 106 99 93% 7 7%
5 105 105 100% 0 0%
Total 528 504 95% 24 5%
22© 2013 Applied Communication Sciences. A Business of the SI
All Rights Reserved.Registered Trademark of TT Government Solutions, Inc.
Checking Against Benign Samples• We randomly picked a Windows 7 system folder
containing over 400 executables, ranging in size from 8 KB to over 70 KB.
• The false positives rate decreases predictably as the distance threshold is reduced.
Distance Threshold MisclassificationsCount %
0.20 103 < 25%0.19 63 < 16%0.17 28 < 7%0.15 15 < 4%0.10 4 < 1 %0.05 1 < 0.25%
23© 2013 Applied Communication Sciences. A Business of the SI
All Rights Reserved.Registered Trademark of TT Government Solutions, Inc.
The Need for Identifying Sub Families
Bagle
Klez
Mydoom
Mimail
Netsky
Roron
Malware Library Distance Matrix
24© 2013 Applied Communication Sciences. A Business of the SI
All Rights Reserved.Registered Trademark of TT Government Solutions, Inc.
Request Latency
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73 76 79 82 85 88 91 94 97 100 1030
4,000
8,000
12,000
16,000
20,000
Total Profiling Uncompression UnpackingDisassembly SigGeneration SigMatching
Request #
25© 2013 Applied Communication Sciences. A Business of the SI
All Rights Reserved.Registered Trademark of TT Government Solutions, Inc.
Request Latency (cont’d)
Latency Time(s) Fraction
Total 5,427 100%
Disassembly 3,425 63%
Profiling 594 11%
Unpacking 240 4%
Signature Matching 205 4%
Signature Generation 185 3%
Uncompression 26 0.5%
26© 2013 Applied Communication Sciences. A Business of the SI
All Rights Reserved.Registered Trademark of TT Government Solutions, Inc.
System Overhead
1 15 29 43 57 71 85 99 1131271411551691831972112252392532670
20406080
100
CPU Utilization Memory Utilization
Measurement Point in Time
Util
izat
ion
(%)
Average Utilization Idle Active CPU 2% 54% Memory 30% 35%
27© 2013 Applied Communication Sciences. A Business of the SI
All Rights Reserved.Registered Trademark of TT Government Solutions, Inc.
Next Steps
• Liner AESA for Faster Signature Library Generation
• Identification of Malware Sub Families
• Automated Removal of “Redundant” Variants
• Family Specific Distance Thresholds
• Malware Fragment Matching
• Smart Label Matching
28© 2013 Applied Communication Sciences. A Business of the SI
All Rights Reserved.Registered Trademark of TT Government Solutions, Inc.
Powerhouse Research. Practical Solutions.
Applied Communication Sciences and design logo is a registered trademark of TT Government Solutions, Inc.
29© 2013 Applied Communication Sciences. A Business of the SI
All Rights Reserved.Registered Trademark of TT Government Solutions, Inc.
Backup Slides
30© 2013 Applied Communication Sciences. A Business of the SI
All Rights Reserved.Registered Trademark of TT Government Solutions, Inc.
• Some malware obfuscations inject redundant, unrelated instructions, which do not affect its external behavior but change its abstract signature
• Data flow analysis can help detect and eliminate such instructions− It can help determine direct- and indirect data
dependencies of relevant instructions− Programs make such calls to affect their environment—
files, network, registry, other processes, etc.− Malware must make such calls to accomplish its goals − Nodes that do not have an influence on any
system/library call can then be removed from abstract signatures
Incorporate Data Flow in Signatures
31© 2013 Applied Communication Sciences. A Business of the SI
All Rights Reserved.Registered Trademark of TT Government Solutions, Inc.
if xu = ...; v = ...;
elseu = -u; v = -v;
write(u*v);
a
b
c
dOriginal Malware
Control Flow Based AbstractSignature of Original Malware
d
b c
Obfuscated Code
if xu = ...; v = ...;
y = 1024; if !x
u = -u; v = -v;
while (y > 0) y -= u*v;
write(u*v);
a
b
c
d
m
pq
n
Control Flow Based AbstractSignature of Malware Variant
m,d
b c q
The Need for Data Flow Analysis
32© 2013 Applied Communication Sciences. A Business of the SI
All Rights Reserved.Registered Trademark of TT Government Solutions, Inc.
Control & Data Flow Based AbstractSignature of Original Malware
d
b cControl & Data Flow Based Abstract
Signature of Malware Variant
if xu = ...; v = ...;
y = 1024; if !x
u = -u; v = -v;
while (y > 0) y -= u*v;
write(u*v);
a
b
c
d
m
pq
n
Obfuscated Code
Control Flow Based AbstractSignature of Malware Variant
m,d
b c q
d
d
b c
The Need for Data Flow Analysis (cont’d)
33© 2013 Applied Communication Sciences. A Business of the SI
All Rights Reserved.Registered Trademark of TT Government Solutions, Inc.
Goal: Given (1) a database, D, of malware binaries, whose distances from one another are known, and (2) a new, query binary, q, which is not in D, find the nearest neighbor, n, of q in D.
Determining the Nearest Neighbor
Naïve Method: (1) Compute q’s distance from every malware in D. (2) The one with the shortest distance is the nearest neighbor of q.
Problem: Computing edit distances between graphical structures, including trees, is an expensive operation.
q
D
Solution: Exploit the triangle inequality to avoid many distance computations.
34© 2013 Applied Communication Sciences. A Business of the SI
All Rights Reserved.Registered Trademark of TT Government Solutions, Inc.
⁞
Determining the Nearest Neighbor (cont’d)
New, query binary
Malware binary whose distance from the query binary has been computed
Malware binary whose distance from the query binary is, currently, unknown
Currently known nearest malware neighbor of the query binary
Malware binary that has been removed from further consideration as it cannot possibly be the nearest neighbor of the query binary in D
Malware binary whose distance from the query binary is to be computed next
35© 2013 Applied Communication Sciences. A Business of the SI
All Rights Reserved.Registered Trademark of TT Government Solutions, Inc.
Each side of a triangle is ≥ the difference between the other two sides!
Each side of a triangle is ≤ the sum of the other two sides!
Exploiting Triangle Inequality
c
a b
p
q r
Suppose b ≥ a b ≤ a + c b – a ≤ c c ≥ b – a c ≥ | b – a |
Otherwise (b < a) a ≤ b + c a – b ≤ c c ≥ a – b c ≥ | b – a |
Pick any two sides, say a & b
p1
q r
p2 pi⁞
n
qr ≥ | q pi – pi r | for all pi qr ≥ max | q pi – pi r | for all pi
lowerbound (qr)If lowerbound (q r) ≥ q n, then q r ≥ q n There is no need to compute q r !
36© 2013 Applied Communication Sciences. A Business of the SI
All Rights Reserved.Registered Trademark of TT Government Solutions, Inc.
Associate a LowerBound with each pi, and initialize it to zero.
Every time pi q is computed, Update nearest neighbor, n, and the corresponding d Update LowerBound(pj q) for all j as: LowerBound(pj q) = max(LowerBound(pj q), | pj n – n q | )
if (lowerBound(pj q) > d) then Remove pj from further consideration!
Exploiting Triangle Inequality (cont’d)pj
pi
q nd
pk
pj
q nd
pk
q nd
pk
37© 2013 Applied Communication Sciences. A Business of the SI
All Rights Reserved.Registered Trademark of TT Government Solutions, Inc.
Powerhouse Research. Practical Solutions.
Applied Communication Sciences and design logo is a registered trademark of TT Government Solutions, Inc.