Using Data Science Techniques to
Help Detect Malicious Behavior
Phil Roth, Data Scientist
• An introduction to key data science concepts
• Challenges that exist to applying those concepts to security data
• Why focusing on aiding a human security analyst can lead to better machine learning tools
• How Endgame’s enterprise product benefits from that focus
Key Takeaways
Data Science Process
Gather Raw Data
Process and Clean
Data
Explore the Data
Apply a Model
Communicate the Result
Data Science Process
Data Science Process
Data can come from many disparate sources.
Raw data must be cleaned and features extracted
Gather RawData
Process and Clean Data
Explore DataFinding relationships in the data provides hints about what features and models will be useful.
Data Science Process
Models exploit features and relationships in the data to make a statement.
Apply a Model
Communicate the Result
The output of a data product is useless without effective and actionable communication.
Introduction to Machine Learning Models
In supervised learning, input data is labeled. An algorithm attempts to reproduce those labels on new unlabeled data.
input datalabel-3 -4 1 0 1-4 -3 1 1 1-4 -4 0 0 1+4 +3 1 0 0+3 +4 0 1 0+3 +3 1 0 0
new datalabel-3 -4 1 1 ???
Supervised learning
A Support Vector Machine1 finds the best separating boundary between two classes in space.
Supervised learning example
1 http://scikit-learn.org/stable/modules/svm.html
In unsupervised learning, input data is unlabeled. An algorithm attempts to find hidden structure in that data.
input data-3 -4 1 0-4 -3 1 1-4 -4 0 0+4 +3 1 0+3 +4 0 1+3 +3 1 0
group 1
group 2
Unsupervised learning
Unsupervised learning example
step 1:
step 2:
etc…
k-means clustering iteratively improves the location of cluster centers by moving them closer to cluster means
Challenges with Security Data
Recommendation Systems
Character RecognitionMNIST Database of Handwritten Digits
Security lacks open datasets
The DARPA Intrusion Detection Evaluation dataset is 15 years old, simulated, and techniques trained on it were never actionable.
Sharing data in the security industry will always be a challenge that even President Obama is attempting to address.
Security lacks open datasets
Labeling is an expensive process that requires expertise.
vs.
Security lacks easy labels
Is this binary malicious?
Is this traffic an intrusion?
Are these products related?
False positives lead to expensive analyst investigations and alert fatigue and
False negatives get CEOs fired
Security lacks tolerance for errors
Machine Learning in security could benefit from focusing on “human in the loop” products over
“the algorithm does it all” products
Chess Analogy
1997: IBM’s supercomputer Deep Blue vs. Gary Kasparov2005: Team ZachS vs multiple Grandmasters in Freestyle Chess2
Human/Machine teams retained an edge over machines for decades
2 Cowen, Tyler. Average Is Over. Chapter 5. 2013
Using the Human/Machine Model
Cloud deployed virtual machines are clustered based on their behavior. The results are communicated to analysts and used to improve the detection of malicious behavior.
Endgame Implementation
Package, process, and user information is collected from the machines.
DBSCAN, a clustering algorithm, groups the machines based on that information.
Endgame implementation
• An introduction to key data science concepts
• Existing challenges to applying those concepts to security data
• Why focusing on aiding a human security analyst can lead to better machine learning tools
• How Endgame’s enterprise product benefits from that focus
Key Takeaways
For more information contact: [email protected]