20
Machine Learning for Malware Classification and Clustering Phil Roth, Data Scientist 1

Machine Learning for Malware Classification and Clustering

Embed Size (px)

Citation preview

Page 1: Machine Learning for Malware Classification and Clustering

Machine Learning for Malware Classification and Clustering

Phil Roth, Data Scientist

1

Page 2: Machine Learning for Malware Classification and Clustering

• PhD in particle astrophysics

• Switched to making images from radar data

• Switched to solving security problems with data

Phil RothData Scientist

2

Page 3: Machine Learning for Malware Classification and Clustering

Outline

• Malware Detection• Boosted Decision Trees• Malware Features• Evaluating Performance• Bringing a Human into the Loop

3

Page 4: Machine Learning for Malware Classification and Clustering

The Problem: Antivirus

The security industry has declared antivirus as dead, but there is no widely accepted replacement.

Machine Learning can be that replacement.4

Page 5: Machine Learning for Malware Classification and Clustering

The Problem: Antivirus• Antivirus uses signatures, heuristics, and hand

crafted rules that do not scale well

• Using polymorphism and obfuscation, malware authors can circumvent rules based detection techniques

5

Page 6: Machine Learning for Malware Classification and Clustering

The Solution: Machine LearningMachine Learning uses statistical techniques to

learn patterns from large datasets

6

Two Steps:• Feature Extraction• Boundary Learning

Page 7: Machine Learning for Malware Classification and Clustering

Machine Learning Advantages• Automation• Deep Insights• Scalability• Generalization

7

Page 8: Machine Learning for Malware Classification and Clustering

Machine Learning Challenges• Requires labels

• Requires large data sets

• Security field requires very low tolerance for errors

8

Page 9: Machine Learning for Malware Classification and Clustering

Boosted Decision TreesBasically, it’s a game of 20 questions

Source: https://en.wikipedia.org/wiki/Decision_tree_learning

A tree showing survival of passengers on the Titanic ("sibsp" is the number of spouses or siblings aboard). The figures under the leaves show the probability of survival and the percentage of observations in the leaf.

9

Page 10: Machine Learning for Malware Classification and Clustering

Boosted Decision Trees• The trees are built by choosing “questions” that

maximize the discrimination between two classes

• The model is called “boosted” because misclassified samples are given higher weight in future tree building

10

Page 11: Machine Learning for Malware Classification and Clustering

Why Boosted Decision Trees?Proven results in security and physics

References:https://www.kaggle.com/c/malware-classification/ http://arxiv.org/pdf/1511.04317.pdfhttp://jmlr.org/proceedings/papers/v42/chen14.pdf

11

Page 12: Machine Learning for Malware Classification and Clustering

Malware FeaturesThe extracted features determine your

model’s performance, but there is a tradeoff

Complicated Explainable

12

Page 13: Machine Learning for Malware Classification and Clustering

Complicated Features

Byte frequency and byte entropy features form a binary fingerprint that inform the model

13

Page 14: Machine Learning for Malware Classification and Clustering

Explainable FeaturesLists of capabilities don’t greatly help the model classify a sample, but they can provide more insight to an analyst.

This sample can:• Record keystrokes• Send/receive network traffic• Modify registry

14

Page 15: Machine Learning for Malware Classification and Clustering

Evaluating Performance

We must be careful not to learn from “future” information:

time

time

Train DataTest Data

Model Train Times

Patterns learned here….... should not inform classifications here

15

Page 16: Machine Learning for Malware Classification and Clustering

Bringing Humans in the LoopAmazon built an entire tool (Mechanical Turk) to cheaply generate labels from human intuition:

Are these products related?

16

Page 17: Machine Learning for Malware Classification and Clustering

Bringing Humans in the LoopOur labels are more expensive to obtain, and so choosing what samples to label is even more important.

Is this binary malicious?

Active Learning can help!17

Page 18: Machine Learning for Malware Classification and Clustering

Bringing Humans in the LoopWhen new data arrives, Active Learning tells analysts which labels would be most helpful.

18

Page 19: Machine Learning for Malware Classification and Clustering

Integration• Our malware classifier model has been integrated

into our stealthy sensor and Hunt Platform

• Ask the other friendly Endgamers here for a demo!

19

Page 20: Machine Learning for Malware Classification and Clustering

[email protected]

@mrphilroth

20