Areej Al-Bataineh
Data Mining Basics Definition Some techniques
Association Rules Classification Clustering
Data mining meets Intrusion Detection Detection Approaches Data mining use in IDS Case Study
Behavioral Feature for Network Anomaly Detection Conclusions
04/10/23Data Mining in Intrusion Detection 2
Knowledge Discovery in Databases (KDD) “Process of extracting useful information from large
databases”
KDD basic steps1.Understanding the application domain2.Data integration and selection3.Data mining4.Pattern Evaluation5.Knowledge representation
Related Fields Machine learning, statistics, others
04/10/23Data Mining in Intrusion Detection 3
“concerned with uncovering patterns, associations, changes, anomalies, and statistically significant structures and events in data”
Why Data Mining? Understand existing data Predict new data
Components Representation
▪ Decide on what model can we build. ▪ Model is a compact summary of examples.
Learning Element▪ Builds a model from a set of examples
Performance Element▪ Applies the model to new observations
04/10/23 4Data Mining in Intrusion Detection
Well-known and used in Intrusion Detection Association Rules [Descriptive] Classification [Predictive] Clustering [Descriptive]
Preliminary step Raw Data Database Table (Training set) Columns – Attributes Rows - Records
04/10/23Data Mining in Intrusion Detection 5
Motivated by market-basket analysis
Generate Rules that capture implications between attribute values
Rule Example Lettuce & Tomato -> Salad Dressing [0.4, 0.9]
Parameters [s, c] Support (s) % records satisfy LHS and RHS Confidence (c) = P(satisfies RHS | satisfies LHS)
Mining Problem “Find all association rules that have support
and confidence > user-defined minimum value”
04/10/23Data Mining in Intrusion Detection 6
Predefined set of classes
Training set has Class as one of the attributes Supervised Learning
Mining Problem “Find a model for class attribute as a function of the
values of other attributes”
Use model to predict class for new records
Classifier representation If-then Rules Decision Trees
04/10/23Data Mining in Intrusion Detection 7
Given Data Set and Similarity Measure Unsupervised Learning
Mining Problem “Group records into clusters such that all records within a cluster are more
similar to one another . And records in separate clusters are less similar another”
Similarity Measures: Euclidean Distance if attributes are continuous. Other Problem-specific Measures.
Clustering Methods Partitioning
▪ Divide data into disjoint partitions Hierarchical
▪ Root is complete data set, Leaves are individual records, and Intermediate layers -> partitions
04/10/23Data Mining in Intrusion Detection 8
Detection Approach Misuse Detection▪ Based o known malicious
patterns (signatures) Anomaly Detection▪ Based on deviations from
established normal patterns (profiles)
Data Source Network-based (NIDS)▪ Network traffic
Host-based (HIDS)▪ Audit trails
04/10/23 9Data Mining in Intrusion Detection
Signature extractionRule matchingAlarm data analysis
Reduce false alarms Eliminate redundant alarms
Feature selectionTraining Data cleaning
04/10/23Data Mining in Intrusion Detection 10
Behavioral Feature for Network Anomaly Detection Training set = normal network traffic Feature provides semantics of the values of
data Feature selection is important Proposed method:▪ Feature extraction based on protocol behavior▪ Many Attacks uses protocol improperly▪ Ping of Death▪ SYN Flood▪ Teardrop
04/10/23Data Mining in Intrusion Detection 11
Attributes packet header fields
Feature Single or multiple attributes
Protocol Specifications Policy for interaction Define attributes and the range of values
Flow Collection of packets exchanged between entities
engaged in protocol Client/Server flows
04/10/23Data Mining in Intrusion Detection 12
Inter-Flow vs Intra-Flow Analysis (IVIA)
First step Identify attributes used in partitioning traffic data into flows ->
Src/Dst ports Result: HTTP flows, DNS flows, …etc
Next Step Examine change of attribute values
▪ Between flows (inter-flow)▪ Within a flow (intra-flow)
ResultsOperationallyVariable AttributesFlow DescriptorsOperationallyInvariant
04/10/23Data Mining in Intrusion Detection 13
Intra-Flow Changes
Inter-flow
Changes
Yes No
Yes
IHLService TypeTotal LengthIdentification
Flags_DF
Flags_MFFragment
OffsetTime to Live
Options
Source AddDestination Add
Protocol
No VersionFlags_reserved
Uses 1999 DARPA IDS Evaluation data set
Build association rules for IP fragments using OVAs
Result - Top 8 ranking rules
04/10/23Data Mining in Intrusion Detection 14
Rule Support Strength
ipFlagsMF =1 & ipTTL = 63 ipTLen = 28 0.526 0.981
ipID < 2817 & ipFlagsMF = 1 ipTLen > 28
0.309 0.968
ipID < 2817 & ipTTL > 63 ipTLen > 28 0.299 1.000
ipTLen > 28 ipID < 2817 0.309 1.000
ipID < 2817 ipTLen > 28 0.309 0.927
ipTTL > 63 ipTLen > 28 0.299 0.988
ipTLen > 28 ipTTL > 63 0.299 0.967
ipTLen > 28 & ipOffset > 118 ipTTL > 63 0.291 1.000
Transform OVAs into features that capture the protocol behavior
Behavior features Attribute observed over time/event
For an attribute observe Entropy Mean and standard deviations Parentage of event within value Percentage of events are monotonic Step size in attribute value
Training data requirement are reduced
Normal – acceptable uses of the protocol 04/10/23
Data Mining in Intrusion Detection 15
Uses aggregate attribute values for some window of packets Window size = 10 Examples
▪ TcpPerFIN = % of packets with FIN set▪ meanIAT = Mean inter-arrival time
50 flows for each protocol = 250 flows Number of packets per flow (5 – 37000)
Use decision tree classifier (C5)▪ FTP, SSH, Telent, SMTP, HTTP
Classifier tested on DARPA data set FTP SSH Telnet SMTP WWW 100% 100% 100% 82% 98%
Real Network Traffic (85% - 100%) Kazaa 100 %
04/10/23Data Mining in Intrusion Detection 16
04/10/23Data Mining in Intrusion Detection 17
>0.01
<=0.01
<=0.4
>0.4
<=0.79
>0.79
>546773
>546773
<=0.03
>0.03
>73
<=73
>0.79
Behavioral Features for Network Anomaly Detection Attribute values cannot be used as features Interpretation of protocol specifications Transform attributes into behavior features aggregation of the attribute values
Data Mining Challenges Self-tuning data mining techniques Pattern-finding and prior knowledge Modeling of temporal data Scalability Incremental mining
04/10/23 18Data Mining in Intrusion Detection
Tools Kdnuggets ▪ Web portal http://www.kdnuggets.com
WEKA▪ Most comprehensive and free collection of tools▪ http://www.cs.waikato.ac.nz/ml/weka
Data Sets Machine Learning Database Repository Knowledge Discovery in Databases Archive▪ http://kdd.ics.uci.edu
MIT Lincolin Labs▪ http://www.ll.mit.edu/IST/ideval
04/10/23Data Mining in Intrusion Detection 19
“Applications of Data Mining in Computer Security” By Barbara and Jajodia
“Machine Learning and Data Mining for Computer Security” By Maloof
“Data Mining: Challenges and Opportunities for Data Mining During the Next Decade” By Grossman
“Data Mining: Concepts and Techniques” By Han and Kamber
SANS IDS FAQs https://www2.sans.org/resources/idfaq/
ACM Crossroads: IDS http://www.acm.org/crossroads/xrds2-4/intrus.html
04/10/23 20Data Mining in Intrusion Detection
OLD Represent rules as a decision tree in memory Very inefficient Speed is linear in term of number of rules Rules growing fast
New Multi-pattern search algorithm Apply multiple rules in parallel Set-wise methodology Fire rule with the longest match
04/10/23Data Mining in Intrusion Detection 21
Recommended