22
Silvio Cesare and Yang Xiang School of Management and Information Systems Centre for Intelligent and Networked Systems Central Queensland University

Malware Classification Using Structured Control Flow

Embed Size (px)

DESCRIPTION

Presented at AusPDC 2010 as part of ACSW. Full paper available on home page.

Citation preview

Page 1: Malware Classification Using Structured Control Flow

Silvio Cesare and Yang XiangSchool of Management and Information Systems

Centre for Intelligent and Networked SystemsCentral Queensland University

Page 2: Malware Classification Using Structured Control Flow

MotivationMalware - hostile, intrusive, or annoying software

or program code.

Malware is a pervasive problem in distributed and networked computing.

Detection of malware is necessary for a secure environment.

Detection of malware variants provides great benefit in early detection.

Page 3: Malware Classification Using Structured Control Flow

IntroductionA variety of schemes exist to statically classify

malware.N-grams, edit distances, control flow.

Control flow can be identified as an invariant characteristic across strains in a family of malware.

Control flow analysis is hindered by malware hiding the real code and contents using the ‘code packing transformation’

Page 4: Malware Classification Using Structured Control Flow

Introduction to Code PackingHides the malware’s real contents using encryption and

compression.Some legitimate software is packed.79% of malware in one month during 2007 was packed [1].50% of malware in 2006 were repacked versions of

existing malware [2].

Typical behaviour of packed program - at runtime, the hidden code is dynamically generated and then executed (self decompressing).

Automated unpacking extracts the hidden code by simulating the malware until the hidden content is revealed.1. Panda Research, “Mal(ware)formation statistics - panda research blog,” 2007; http://research.pandasecurity.com/archive/Mal_2800_ware_2900_formation-statistics.aspx

2. A. Stepan, “Improving proactive detection of packed malware,” Virus Bulletin Conference, 2006.

Page 5: Malware Classification Using Structured Control Flow

Our ContributionA novel system for approximate identification of control

flow (flowgraph) signatures using the decompilation technique of structuring, and then using those signatures to classify a query program against a malware database.

A fast application level emulator to provide automated unpacking, that is capable of real-time desktop use.

A novel algorithm to determine when to stop emulation, using entropy analysis.

We implement and evaluate our ideas in a prototype system that performs automated unpacking and malware classification.

Page 6: Malware Classification Using Structured Control Flow

Related WorkAutomated unpacking

Whole System Emulation – Pandora’s Bochs, RenovoDynamic Binary Instrumentation – SaffronNative Execution– OmniUnpack, SaffronVirtualization - Ether

Malware classificationN-grams, n-perms of raw contentsEdit distance between basic blocks, inverted index and

bloom filters.Flowgraphs – Exact and approximate. Call graphs and

control flow graphs. ‘A Fast Flowgraph Based Classification System for Packed

and Polymorphic Malware on the Endhost’.

Page 7: Malware Classification Using Structured Control Flow

Problem StatementA database exists containing malware signatures.

Given to the system is a query program – goal is to determine if it’s malicious.

Find the similarity between the query program and each of the malware in the database.Similarity is a real number between 0 and 1.Similarity is based on shared and invariant

characteristics or features.

If similarity exceeds a threshold, declare program as a malicious variant.

Page 8: Malware Classification Using Structured Control Flow

Our Approach Identify code packing using entropy analysis.

Unpack the program using application level emulation, using entropy analysis to detect when unpacking is complete.

Identify characteristics – control flow graphs of each procedure – and generate signatures using ‘structuring’. Structuring decompiles the procedure into source code like control

flow. Result is a string.

Use the string edit distance and the approximate dictionary search to show dissimilarity (and thus similarity) of each procedure to database signatures.

Accumulate similarities of signatures for a final result. A similarity equal to or greater than 0.6 indicates a variant.

Page 9: Malware Classification Using Structured Control Flow

Win32 Executable

Packed? Structure ClassifyYes Yes

Malware Database

Non Malicious

Malicious

NewSignature

No

Dynamic Analysis

EmulateEnd of

Unpacking?

No

Page 10: Malware Classification Using Structured Control Flow

Identifying Packed BinariesEntropy analysis identifies the amount of

‘information’ in a text.

Compressed and encrypted content has high entropy.

Packed malware contains compressed or encrypted content.

By looking for a sequence of high entropy blocks of data, we identify it as being packed.

Page 11: Malware Classification Using Structured Control Flow

Unpacking - Application Level EmulationA more efficient approach than the whole system

emulation employed by existing automated unpackers.

Implemented using interpretation.

Emulates:The non privileged x86 Instruction Set Architecture.Virtual memory, including segmentation.Windows Structured Exception Handling.The most common functions in the Windows API.Linking and Loading.Thread and Process management.OS specific structures.

Page 12: Malware Classification Using Structured Control Flow

Verifying EmulationAutomate testing the correctness of emulation.

Emulate the malware in parallel to running the malware in a debugger.

Verify program state is the same between emulator and debugger.

Some instructions and APIs behave differently when debugged.Debugger can rewrite these instructions on the fly

to maintain correctness.

Page 13: Malware Classification Using Structured Control Flow

Detecting Completion of Hidden Code ExtractionNeed to detect when the hidden code is revealed, and

emulation should stop. Known as the Original Entry Point (OEP)

Existing literature identifies execution of dynamically generated content by tracing writes to and execution of memory. But multiple layers of dynamically generated code exist. How

to know when to stop?

Our solution: Use entropy analysis to identify packed data that hasn’t been accessed during execution– meaning unpacking hasn’t processed the packed data and is therefore not complete.

If an unimplemented API is executed, stop also.

Page 14: Malware Classification Using Structured Control Flow

Flowgraph Based Signature GenerationOnce unpacked:1.Disassemble image.2.Identify procedures.3.Translate to an intermediate representation.4.Build control flow graphs.5.Transform control flow graphs to strings

using structuring.6.Calculate weight of each string using the

ratio of its size proportional to the sum of all string sizes.

Page 15: Malware Classification Using Structured Control Flow

Signatures Using StructuringTransformation of a cfg to a string uses a variation of

the structuring algorithm used in the DCC decompiler.When a cfg can’t be structured, a goto is generated.

The source code like output is transformed to a smaller but semantically equivalent string of tokens representing control flow constructs like if() or while().

Similar control flow graphs have similar string signatures.

String signatures are amenable to string algorithms such as the edit distance.

Page 16: Malware Classification Using Structured Control Flow

L_0

L_3

L_6

L_7L_1

L_2 L_4

L_5

true

true

true

true

true

BW|{BI{B}E{B}B}BR

proc(){L_0: while (v1 || v2) {L_1: if (v3) {L_2: } else {L_4: }L_5: }L_7: return;}

The relationship between a control flow graph, a high level structured graph, and a signature.

Page 17: Malware Classification Using Structured Control Flow

Malware ClassificationFinding similar signatures or strings, is done by

searching the malware database using an approximate dictionary search.

The similarity ratio, , is a measure of similarity between two signatures and calculated from the distance between strings using the Levenshtein (edit) distance.Levenshtein distance is the number of insertions, deletions

and substitutions to transform one string to the other.

Using a similarity ratio of s=0.9, we calculate the number of errors , , or distance, allowed in the dictionary search.

))(),(max(

),(1

ylenxlen

yxedwed

)1)(( sxlenE

Page 18: Malware Classification Using Structured Control Flow

Malware Classification AlgorithmSimilarity ratios for each control flow graph in the

query binary are found based on the best approximate match in the malware database.

The asymmetric similarity is the sum of the weighted similarity ratios.

Two weights are possible for each matching flowgraph – the weight from the malware database, and the weight from the query binary – resulting in two asymmetric similarities.

The final result, program similarity, is the product of the asymmetric similarities.

i edxed

ed

x twweightw

twS

iii

i

,

,0

Page 19: Malware Classification Using Structured Control Flow

EvaluationUnpacking Synthetic SamplesTested packing Windows programs hostname.exe (shown)

and calc.exe prototype against 14 public packing tools.

Results indicate accurate detection of the original entry point, and a speed suitable for adoption in real-time desktop Antivirus. Name Time (s) Num. Instr.

mew 0.13 56042fsg 0.13 58138upx 0.11 61654packman 0.13 123959npack 0.14 129021aspack 0.15 161183pe compact 0.14 179664expressor 0.20 620932winupack 0.20 632056yoda’s protector 0.15 659401rlpack 0.18 916590telock 0.20 1304163acprotect 0.67 3347105pespin 0.64 10482466

Name Revealed code and

data

Number of stages to real OEP

Stages unpacked

% of instr. to real

OEP unpacked

upx 13107 1 1 100.00rlpack 6947 1 1 100.00mew 4808 1 1 100.00fsg 12348 1 1 100.00npack 10890 1 1 100.00expressor 59212 1 1 100.00packman 10313 2 1 99.99pe compact 18039 4 3 99.98acprotect 99900 46 39 98.81winupack 41250 2 1 98.80telock 3177 19 15 93.45yoda's protector 3492 6 2 85.81aspack 2453 6 1 43.41pepsin err 23 err err

Page 20: Malware Classification Using Structured Control Flow

Evaluation ofFlowgraph Based ClassificationTested classifying Klez (shown bottom left),

Netsky, (shown bottom right) and Roron families of malware.

Results show high similarities between malware variants.a b c d g h

a   0.84 1.00 0.76 0.47 0.47

b 0.84 0.84 0.87 0.46 0.46

c 1.00 0.84 0.76 0.47 0.47

d 0.76 0.87 0.76 0.46 0.45

g 0.47 0.46 0.47 0.46 0.83

h 0.47 0.46 0.47 0.45 0.83  

aa ac f j p t x y

aa 0.78 0.61 0.70 0.47 0.67 0.44 0.81

ac 0.78 0.66 0.75 0.41 0.53 0.35 0.64

f 0.61 0.66 0.86 0.46 0.59 0.39 0.72

j 0.70 0.75 0.86 0.52 0.67 0.44 0.83

p 0.47 0.41 0.46 0.52 0.61 0.79 0.56

t 0.67 0.53 0.59 0.67 0.61 0.61 0.79

x 0.44 0.35 0.39 0.44 0.79 0.61 0.49

y 0.81 0.64 0.72 0.83 0.56 0.79 0.49

Page 21: Malware Classification Using Structured Control Flow

Evaluation of Flowgraph Based Classification (cont)Examined similarities between unrelated malware and

programs (left).

Evaluated likely occurrence of false positives by calculating the similarities between the set of Windows Vista system programs, which are mostly not similar to each other (right).

Most programs showed a low similarity to others.Similarity Matches

0.0 105497

0.1 2268

0.2 637

0.3 342

0.4 199

0.5 121

0.6 44

0.7 72

0.8 24

0.9 20

1.0 6

cmd.exe calc.exe netsky.aa klez.a roron.ao

cmd.exe   0.00 0.00 0.00 0.00

calc.exe 0.00 0.00 0.00 0.00

netsky.aa 0.00 0.00 0.19 0.08

klez.a 0.00 0.00 0.19 0.15

roron.ao 0.00 0.00 0.08 0.15  

Page 22: Malware Classification Using Structured Control Flow

ConclusionMalware can be classified according to similarity

between flowgraphs.

We proposed algorithms to perform fast unpacking. We also proposed algorithms to classify malware.

Automated unpacking was demonstrated to be effective on synthetically packed samples, and fast enough for desktop Antivirus.

Finally, we demonstrated that by using our classification system, real malware variants could be identified.