CISC 879 - Machine Learning for Solving Systems Problems
Presented by: SatyajeetDept of Computer & Information Sciences
University of Delaware
Automatic Analysis of Malware Behavior using Machine LearningAuthor’s: Konrad Rieck, Philipp Trinius, Carsten Willems, and
Thosten Holz
CISC 879 - Machine Learning for Solving Systems Problems
Abstract & Introduction
• Malware - • Poses major threat to security of computer systems.
• Very diverse – viruses, internet worms, trojan horses,
• Amount of malware – millions of hosts infected
• Obfuscation and polymorphism impede detection at file level
• Dynamic analysis helps characterizing and defending.
CISC 879 - Machine Learning for Solving Systems Problems
Abstract & Introduction Contd..
• Framework for automatic analysis of malware behavior using Machine learning
• Framework allows automatic analysis of novel classes of malware with similar behavior – Clustering.
• Assigning unknown classes of malware to these discovered classes – Classification.
• An incremental approach based on both for behavior based analysis.
CISC 879 - Machine Learning for Solving Systems Problems
Automatic analysis of Malware Behavior
• Framework steps and procedure• Executing and monitoring malware binaries in
sandbox environment. Report generated on system calls and their arguments.
• Sequential reports are embedded in a vector space where each dimension is associated with a behavioral pattern.
• ML techniques then applied to the embedded reports to identify and classify malware.
• Incremental analysis progress by alternating between clustering and classification.
CISC 879 - Machine Learning for Solving Systems Problems
Report representation• Can be textual or XML
• Human readable and suitable for computation of general statistics
• But not efficient for automatic analysis
• Hence MIST (Malware Instr. Set)
• Inspired from instr. set used in process design.
CISC 879 - Machine Learning for Solving Systems Problems
MIST
• Category of system calls
• Operation - Reflects a particular system call
• Arguments as argblocks.
CISC 879 - Machine Learning for Solving Systems Problems
Sandbox and MIST representation
CISC 879 - Machine Learning for Solving Systems Problems
Representation
• These sequential reports identify typical behavior of malware – Changing registry keys, modifying system files.
• But still not suitable for efficient analysis techniques. Hence the need to embed behavior reports in vector space – Using instruction q-grams.
• This embedding enables expressing the similarity of behavior geometrically – Calculating distance.
CISC 879 - Machine Learning for Solving Systems Problems
Clustering and Classification
• Reports are embedded in vector space – Process ready for applying ML techniques
• Clustering of behavior – where classes of similar behavior malware are identified.
• Classification of behavior – which allows to assign malware to known classes of behavior.
• What allows us to do this?
• Malware binaries are a family of similar variants with similar behavior patterns !
CISC 879 - Machine Learning for Solving Systems Problems
Contd..
CISC 879 - Machine Learning for Solving Systems Problems
Algorithms
• Prototype extraction
• Iterative algorithm
• Extracts small set of prototypes from set of reports. First one chosen at random.
• Clustering using Prototypes
• Prototypes at beginning are individual clusters
• Algorithm determines and merges nearest pairs of clusters
• Classification using Prototypes
• Allows to learn to discriminate between classes of malware.
CISC 879 - Machine Learning for Solving Systems Problems
Algorithms Contd..
• For each report algorithm determines the nearest prototype of clusters in training data, if within radius then assigns to cluster
• Else rejects and holds back for later incremental analysis.
• Incremental analysis• Reports to be analyzed are received from source.
• Initially classified using prototypes of known clusters
• Thereby variants of known malware are identified for further analysis.
• Prototypes extracted from remaining reports and clustered again.
CISC 879 - Machine Learning for Solving Systems Problems
Experiments and Results
CISC 879 - Machine Learning for Solving Systems Problems
Evaluating components
• Prototype extraction
• Evaluated using Precision, Recall and Compression.
• Precision – 0.99 when corpus compressed by 2.9 % & 7%
• Clustering
• Evaluated using F-measure
• F-measure for experiments – MIST 1 = 0.93 and MIST 2 = 0.95 better than previous related work 0.881
• Classification
• F-measure for experiments – MIST 1= 0.96 and MIST 2 = 0.99
CISC 879 - Machine Learning for Solving Systems Problems
Experiments and Results Contd..
CISC 879 - Machine Learning for Solving Systems Problems
Experiments and Results Contd..
CISC 879 - Machine Learning for Solving Systems Problems
Conclusion
• A new framework introduced which overcomes several previous deficiencies.
• The framework is learning based
• Framework can be implemented in practice
• Steps – Collection of malware, a study in sandbox environment, embed observed behavior in vector space, apply learning algorithms – clustering and classification.
• This process is efficient and learns automatically after initial setup and run.
CISC 879 - Machine Learning for Solving Systems Problems
Thank you !