Malware Classification Project

1 Copyright 2011 Trend Micro Inc.

MALWARE CLASSIFICATION

Amber Zhang, Bharath Chandrasekhar

July, 2016


Solution summary

• Model selection: Xgboost

– Performance evaluation:• ROC score (Trend testing set): 0.9975

• Accuracy (5 fold cross validation): 0.987

• Attribute selection: 504 attributes

• Training time and resource usage– 2 hours, 4G memory to train the model

– Predicting 30,000 testing files in ~10 minutes

– 727G Raw data and generated features on disk


Problem statement

• Binary classification– Training data comes in [data, label] pair– A data [label] is either “malicious” or “not malicious”

• 50,000 Training samples were giving• 30,000 Testing samples were giving• Goal: develop a robust and re-trainable model to

predict if a PE program is malicious or or malicious– Continue training the existing model to further boost model

performance– Time cost for predicting: light-weighted model is preferred

under same model performance– More features does not mean better model


• Portable executable files broken into:• assembly source code (.asm file)• Imports files: a list of all functions names imported by the program• Sections file: assembly code sections information of each program• Info file: general information about the program• Strings file: a list of all literal strings in the program

Data Format


Project Flow and Resource Usage

Extract and add new features

Re-train Model

Evaluate Model (cross validation)

• Dimension reduction: select top features

Feed nGram opcode

features to construct

initial model

Randomly sampled 10,000 data Full datasetFull dataset

Compare, train and tune the model on

full dataset

~12 hours to scan through all training data

~2 hours to run random forest~7 G of RAM memory


~40G+ extracted files on the disk~1h to train the model


Initial Model


Re-train Model



Feed nGram opcode


initial model



full dataset






290 Features: nGram opcode counts

1. Scan through .asm assembly source code file and extract all 1-4 gram opcode

2. Select frequent opcode pattern– For one gram opcode: if in a loop then count 10 times more– Frequent if opcode patterns appears in the file more than 100 times

3. Run random forest to select opcode features with top importance (based on reduced info gain):– Top 200 one gram, top 30 two gram, top 30 three gram, top 30 four gram– In total, 290 nGram opcode counts features were selected

Loop pattern in assembly source code

One gram opcodemov, mul, shr, lea…Two gram opcodemov_mul, mul_shr, shr_lea…Three, four, five gram

Anti Debugmov eax, fs:[30h]mov eax, byte [eax+2]test eax,eaxjne @DebuggerDetected


Feature importance: loss in info gain


Adding more features


Re-train Model



Feed nGram opcode


initial model



full dataset






• DD, DW, DB…

8 Features: memory declaration count

7 DD7*4 bytes


127 Features: section hex dump single byte count

• Single byte counts• Count number of single bytes in the section hex dump files:

– 128 cc, 200 ff, 568 6a etc.


58 other features: info, DDL imports, sections

10 Info fields

• File Entropy• Size of Stack

Reserve• File Size• Size of Image• Loop Count• Size of Code• Size Of

Initialized Data• Number of

sections• Size of Header• String Count

5 Derived fields (newly created)

• Loop Count to File Size

• Code Size to File Size

• String Count to File Size

• Loop Count to Code Size

• String Count to Code Size

26 DDL import Count

•KERNEL32.dll•USER32.dll•ADVAPI32.dll•GDI32.dll•MSVBVM60.DLL•SHELL32.dll•ntdll.dll•ole32.dll•COMCTL32.dll•OLEAUT32.dll•…

17 Section fields

• .text_entropy• .rsrc_entropy• .rsrc_vSize• .data_entropy• .text_rSize• .text_vSize• .rsrc_rSize• .data_vSize• .data_rSize• .rdata_entropy• …


21 Features: .asm imageNormal Malicious Malicious

Pixel density changes across the file

10 % of the file Chunk each fileinto 10 partsto abstract densitychanges


21 Features: .asm image

• Read .asm assembly source code in binary mode

• Extract– First 800 bytes– Average pixel density in each region– Standard deviation of pixel density in each region


Accuracy improved as extracting new features

Randomly sampled 10,000

data

Extract new features Select attributes 10-fold cross

validationKeep predictive

attributes


Intuition: the more the better

• High training error: underfit– Training error reaches 0.0000 only using 1000 training samples!

• High testing error: overfit– Avoid overfitting became the main challenge

=> Reduce the dimension of feature set as long as the model does not underfit the data


Selecting and tuning the model


Re-train Model



Feed nGram opcode


initial model



full dataset






Some models have inherent biases


Xgboost: a gradient boosting algorithm

• Objective Function: minimize L+Ω

• Use gradient Descent to find the optima• Parameters (30+ parameters)

– Max_depth = 3,– min_child_width = 4,– n_estimator = 3000, – learning_rate = 0.1, – subsample = 0.8, – gamma = 0, – colsample_bytree = 0.8

• Resampled subset of data and attribute columns each time building a new tree• Reasons: parallel processing, regularization avoids overfitting, tree pruning,

handle missing values


Xgboost: tuning intuition vs computation complexity

Objective Function: minimize L+Ω

Number of tree leaves

Score on the j-th leaf

L: control predictive power

Ω : control simplicity of model

Each tree


Xgboost: tuning intuition vs computation complexity

Objective Function: minimize L+Ω

GammaLambda

L: control predictive power

Ω : control simplicity of model

N_estimator: number of trees

• Max_depth: maximun depth of tree• min_child_width: minimum score of

child in order to split the node• Subsample: Randomly sample subset

of data• colsample_bytree: Randomly sample

subset of features


ROC score: trend 30,000 testing set


A note on evaluation metric: ROC

• What if training data contains 98% non-malicious files?• ROC: you can specify threshold to catch almost all TP

– [0,1]– 0.5 is random guess– Evaluate model at all threshold (accuracy metric is a specific probabilistic cut)


Potential further analysis: meta learner

Meta learner:Voting, geometric mean, weighted

average etc.

Sub-learner: Logistic Regression

Sub-learner: Neural Nets

Sub-learner:SVM

• Each sub-learner output probabilities for binary classification– Different model– Use different subset of features

• Meta Learner aggregate results– Voting– Weighted average– Geometric mean etcx

• Reduce Bias!– Strong feature– Model bias


Solution summary

• Model selection: Xgboost – Performance evaluation:

• ROC score (Trend testing set): 0.9975• Accuracy (5 fold cross validation): 0.987

– Parameters: • Max_depth = 3, min_child_width = 4, n_estimator = 3000, learning_rate = 0.1,

subsample = 0.8, gamma = 0, colsample_bytree = 0.8

• Attribute selection: 504 attributes – Top 290 nGram opcodes counts, top 127 single byte count, top 8 memory

declaration keyword count, top 15 info fields, top 26 DDL imports function counts, top 17 section info, top 11 asm image density, top 10 asm image statistics

• Training time and resource usage– 2 hours, 4G memory to train the model – Predicting 30,000 testing files in ~10 minutes– 727G Raw data and generated features on disk

Documents

Malware Classification Project