Using Data Science Techniques to Detect Malicious Behavior

Preview:

Citation preview

Using Data Science Techniques to

Help Detect Malicious Behavior

Phil Roth, Data Scientist

• An introduction to key data science concepts

• Challenges that exist to applying those concepts to security data

• Why focusing on aiding a human security analyst can lead to better machine learning tools

• How Endgame’s enterprise product benefits from that focus

Key Takeaways

Data Science Process

Gather Raw Data

Process and Clean

Data

Explore the Data

Apply a Model

Communicate the Result

Data Science Process

Data Science Process

Data can come from many disparate sources.

Raw data must be cleaned and features extracted

Gather RawData

Process and Clean Data

Explore DataFinding relationships in the data provides hints about what features and models will be useful.

Data Science Process

Models exploit features and relationships in the data to make a statement.

Apply a Model

Communicate the Result

The output of a data product is useless without effective and actionable communication.

Introduction to Machine Learning Models

In supervised learning, input data is labeled. An algorithm attempts to reproduce those labels on new unlabeled data.

input datalabel-3 -4 1 0 1-4 -3 1 1 1-4 -4 0 0 1+4 +3 1 0 0+3 +4 0 1 0+3 +3 1 0 0

new datalabel-3 -4 1 1 ???

Supervised learning

A Support Vector Machine1 finds the best separating boundary between two classes in space.

Supervised learning example

1 http://scikit-learn.org/stable/modules/svm.html

In unsupervised learning, input data is unlabeled. An algorithm attempts to find hidden structure in that data.

input data-3 -4 1 0-4 -3 1 1-4 -4 0 0+4 +3 1 0+3 +4 0 1+3 +3 1 0

group 1

group 2

Unsupervised learning

Unsupervised learning example

step 1:

step 2:

etc…

k-means clustering iteratively improves the location of cluster centers by moving them closer to cluster means

Challenges with Security Data

Recommendation Systems

Character RecognitionMNIST Database of Handwritten Digits

Security lacks open datasets

The DARPA Intrusion Detection Evaluation dataset is 15 years old, simulated, and techniques trained on it were never actionable.

Sharing data in the security industry will always be a challenge that even President Obama is attempting to address.

Security lacks open datasets

Labeling is an expensive process that requires expertise.

vs.

Security lacks easy labels

Is this binary malicious?

Is this traffic an intrusion?

Are these products related?

False positives lead to expensive analyst investigations and alert fatigue and

False negatives get CEOs fired

Security lacks tolerance for errors

Machine Learning in security could benefit from focusing on “human in the loop” products over

“the algorithm does it all” products

Chess Analogy

1997: IBM’s supercomputer Deep Blue vs. Gary Kasparov2005: Team ZachS vs multiple Grandmasters in Freestyle Chess2

Human/Machine teams retained an edge over machines for decades

2 Cowen, Tyler. Average Is Over. Chapter 5. 2013

Using the Human/Machine Model

Cloud deployed virtual machines are clustered based on their behavior. The results are communicated to analysts and used to improve the detection of malicious behavior.

Endgame Implementation

Package, process, and user information is collected from the machines.

DBSCAN, a clustering algorithm, groups the machines based on that information.

Endgame implementation

• An introduction to key data science concepts

• Existing challenges to applying those concepts to security data

• Why focusing on aiding a human security analyst can lead to better machine learning tools

• How Endgame’s enterprise product benefits from that focus

Key Takeaways

For more information contact: egs-info@endgame.com

Recommended