25
1 Copyright 2011 Trend Micro Inc. MALWARE CLASSIFICATION Amber Zhang, Bharath Chandrasekhar July, 2016

Malware Classification Project

Embed Size (px)

Citation preview

Page 1: Malware Classification Project

1 Copyright 2011 Trend Micro Inc.

MALWARE CLASSIFICATION

Amber Zhang, Bharath Chandrasekhar

July, 2016

Page 2: Malware Classification Project

2 Copyright 2011 Trend Micro Inc.

Solution summary

• Model selection: Xgboost

– Performance evaluation:• ROC score (Trend testing set): 0.9975

• Accuracy (5 fold cross validation): 0.987

• Attribute selection: 504 attributes

• Training time and resource usage– 2 hours, 4G memory to train the model

– Predicting 30,000 testing files in ~10 minutes

– 727G Raw data and generated features on disk

Page 3: Malware Classification Project

3 Copyright 2011 Trend Micro Inc.

Problem statement

• Binary classification– Training data comes in [data, label] pair– A data [label] is either “malicious” or “not malicious”

• 50,000 Training samples were giving• 30,000 Testing samples were giving• Goal: develop a robust and re-trainable model to

predict if a PE program is malicious or or malicious– Continue training the existing model to further boost model

performance– Time cost for predicting: light-weighted model is preferred

under same model performance– More features does not mean better model

Page 4: Malware Classification Project

4 Copyright 2011 Trend Micro Inc.

• Portable executable files broken into:• assembly source code (.asm file)• Imports files: a list of all functions names imported by the program• Sections file: assembly code sections information of each program• Info file: general information about the program• Strings file: a list of all literal strings in the program

Data Format

Page 5: Malware Classification Project

5 Copyright 2011 Trend Micro Inc.

Project Flow and Resource Usage

Extract and add new features

Re-train Model

Evaluate Model (cross validation)

• Dimension reduction: select top features

Feed nGram opcode

features to construct

initial model

Randomly sampled 10,000 data Full datasetFull dataset

Compare, train and tune the model on

full dataset

~12 hours to scan through all training data

~2 hours to run random forest~7 G of RAM memory

~4 hours to run random forest~4 G of RAM memory

~40G+ extracted files on the disk~1h to train the model

Page 6: Malware Classification Project

6 Copyright 2011 Trend Micro Inc.

Initial Model

Extract and add new features

Re-train Model

Evaluate Model (cross validation)

• Dimension reduction: select top features

Feed nGram opcode

features to construct

initial model

Randomly sampled 10,000 data Full datasetFull dataset

Compare, train and tune the model on

full dataset

~12 hours to scan through all training data

~2 hours to run random forest~7 G of RAM memory

~4 hours to run random forest~4 G of RAM memory

~40G+ extracted files on the disk~1h to train the model

Page 7: Malware Classification Project

7 Copyright 2011 Trend Micro Inc.

290 Features: nGram opcode counts

1. Scan through .asm assembly source code file and extract all 1-4 gram opcode

2. Select frequent opcode pattern– For one gram opcode: if in a loop then count 10 times more– Frequent if opcode patterns appears in the file more than 100 times

3. Run random forest to select opcode features with top importance (based on reduced info gain):– Top 200 one gram, top 30 two gram, top 30 three gram, top 30 four gram– In total, 290 nGram opcode counts features were selected

Loop pattern in assembly source code

One gram opcodemov, mul, shr, lea…Two gram opcodemov_mul, mul_shr, shr_lea…Three, four, five gram

Anti Debugmov eax, fs:[30h]mov eax, byte [eax+2]test eax,eaxjne @DebuggerDetected

Page 8: Malware Classification Project

8 Copyright 2011 Trend Micro Inc.

Feature importance: loss in info gain

Page 9: Malware Classification Project

9 Copyright 2011 Trend Micro Inc.

Adding more features

Extract and add new features

Re-train Model

Evaluate Model (cross validation)

• Dimension reduction: select top features

Feed nGram opcode

features to construct

initial model

Randomly sampled 10,000 data Full datasetFull dataset

Compare, train and tune the model on

full dataset

~12 hours to scan through all training data

~2 hours to run random forest~7 G of RAM memory

~4 hours to run random forest~4 G of RAM memory

~40G+ extracted files on the disk~1h to train the model

Page 10: Malware Classification Project

10 Copyright 2011 Trend Micro Inc.

• DD, DW, DB…

8 Features: memory declaration count

7 DD7*4 bytes

Page 11: Malware Classification Project

11 Copyright 2011 Trend Micro Inc.

127 Features: section hex dump single byte count

• Single byte counts• Count number of single bytes in the section hex dump files:

– 128 cc, 200 ff, 568 6a etc.

Page 12: Malware Classification Project

12 Copyright 2011 Trend Micro Inc.

58 other features: info, DDL imports, sections

10 Info fields

• File Entropy• Size of Stack

Reserve• File Size• Size of Image• Loop Count• Size of Code• Size Of

Initialized Data• Number of

sections• Size of Header• String Count

5 Derived fields (newly created)

• Loop Count to File Size

• Code Size to File Size

• String Count to File Size

• Loop Count to Code Size

• String Count to Code Size

26 DDL import Count

•KERNEL32.dll•USER32.dll•ADVAPI32.dll•GDI32.dll•MSVBVM60.DLL•SHELL32.dll•ntdll.dll•ole32.dll•COMCTL32.dll•OLEAUT32.dll•…

17 Section fields

• .text_entropy• .rsrc_entropy• .rsrc_vSize• .data_entropy• .text_rSize• .text_vSize• .rsrc_rSize• .data_vSize• .data_rSize• .rdata_entropy• …

Page 13: Malware Classification Project

13 Copyright 2011 Trend Micro Inc.

21 Features: .asm imageNormal Malicious Malicious

Pixel density changes across the file

10 % of the file Chunk each fileinto 10 partsto abstract densitychanges

Page 14: Malware Classification Project

14 Copyright 2011 Trend Micro Inc.

21 Features: .asm image

• Read .asm assembly source code in binary mode

• Extract– First 800 bytes– Average pixel density in each region– Standard deviation of pixel density in each region

Page 15: Malware Classification Project

15 Copyright 2011 Trend Micro Inc.

Accuracy improved as extracting new features

Randomly sampled 10,000

data

Extract new features Select attributes 10-fold cross

validationKeep predictive

attributes

Page 16: Malware Classification Project

16 Copyright 2011 Trend Micro Inc.

Intuition: the more the better

• High training error: underfit– Training error reaches 0.0000 only using 1000 training samples!

• High testing error: overfit– Avoid overfitting became the main challenge

=> Reduce the dimension of feature set as long as the model does not underfit the data

Page 17: Malware Classification Project

17 Copyright 2011 Trend Micro Inc.

Selecting and tuning the model

Extract and add new features

Re-train Model

Evaluate Model (cross validation)

• Dimension reduction: select top features

Feed nGram opcode

features to construct

initial model

Randomly sampled 10,000 data Full datasetFull dataset

Compare, train and tune the model on

full dataset

~12 hours to scan through all training data

~2 hours to run random forest~7 G of RAM memory

~4 hours to run random forest~4 G of RAM memory

~40G+ extracted files on the disk~1h to train the model

Page 18: Malware Classification Project

18 Copyright 2011 Trend Micro Inc.

Some models have inherent biases

Page 19: Malware Classification Project

19 Copyright 2011 Trend Micro Inc.

Xgboost: a gradient boosting algorithm

• Objective Function: minimize L+Ω

• Use gradient Descent to find the optima• Parameters (30+ parameters)

– Max_depth = 3,– min_child_width = 4,– n_estimator = 3000, – learning_rate = 0.1, – subsample = 0.8, – gamma = 0, – colsample_bytree = 0.8

• Resampled subset of data and attribute columns each time building a new tree• Reasons: parallel processing, regularization avoids overfitting, tree pruning,

handle missing values

Page 20: Malware Classification Project

20 Copyright 2011 Trend Micro Inc.

Xgboost: tuning intuition vs computation complexity

Objective Function: minimize L+Ω

Number of tree leaves

Score on the j-th leaf

L: control predictive power

Ω : control simplicity of model

Each tree

Page 21: Malware Classification Project

21 Copyright 2011 Trend Micro Inc.

Xgboost: tuning intuition vs computation complexity

Objective Function: minimize L+Ω

GammaLambda

L: control predictive power

Ω : control simplicity of model

N_estimator: number of trees

• Max_depth: maximun depth of tree• min_child_width: minimum score of

child in order to split the node• Subsample: Randomly sample subset

of data• colsample_bytree: Randomly sample

subset of features

Page 22: Malware Classification Project

22 Copyright 2011 Trend Micro Inc.

ROC score: trend 30,000 testing set

Page 23: Malware Classification Project

23 Copyright 2011 Trend Micro Inc.

A note on evaluation metric: ROC

• What if training data contains 98% non-malicious files?• ROC: you can specify threshold to catch almost all TP

– [0,1]– 0.5 is random guess– Evaluate model at all threshold (accuracy metric is a specific probabilistic cut)

Page 24: Malware Classification Project

24 Copyright 2011 Trend Micro Inc.

Potential further analysis: meta learner

Meta learner:Voting, geometric mean, weighted

average etc.

Sub-learner: Logistic Regression

Sub-learner: Neural Nets

Sub-learner:SVM

• Each sub-learner output probabilities for binary classification– Different model– Use different subset of features

• Meta Learner aggregate results– Voting– Weighted average– Geometric mean etcx

• Reduce Bias!– Strong feature– Model bias

Page 25: Malware Classification Project

25 Copyright 2011 Trend Micro Inc.

Solution summary

• Model selection: Xgboost – Performance evaluation:

• ROC score (Trend testing set): 0.9975• Accuracy (5 fold cross validation): 0.987

– Parameters: • Max_depth = 3, min_child_width = 4, n_estimator = 3000, learning_rate = 0.1,

subsample = 0.8, gamma = 0, colsample_bytree = 0.8

• Attribute selection: 504 attributes – Top 290 nGram opcodes counts, top 127 single byte count, top 8 memory

declaration keyword count, top 15 info fields, top 26 DDL imports function counts, top 17 section info, top 11 asm image density, top 10 asm image statistics

• Training time and resource usage– 2 hours, 4G memory to train the model – Predicting 30,000 testing files in ~10 minutes– 727G Raw data and generated features on disk