Upload
hammer
View
53
Download
1
Embed Size (px)
DESCRIPTION
Learning Rules and Clusters for Network Anomaly Detection. Philip Chan, Matt Mahoney, Muhammad Arshad Florida Institute of Technology. Outline. Related work in anomaly detection Rule Learning algorithm: LERAD Cluster learning algorithm: CLAD Summary and ongoing work. - PowerPoint PPT Presentation
Citation preview
Learning Rules and Clusters for Network Anomaly Detection
Philip Chan, Matt Mahoney, Muhammad Arshad
Florida Institute of Technology
Outline
• Related work in anomaly detection• Rule Learning algorithm: LERAD• Cluster learning algorithm: CLAD• Summary and ongoing work
Related Work in Anomaly Detection
• Host-based– STIDE (Forrest et al., 96): system calls, instance-based– (Lane & Brodley, 99): user commands, instance-based– ADMIT (Sequeira & Zaki, 02): user commands,
clustering
• Network-based– NIDES (SRI, 95): addresses and ports, probabilistic– SPADE (Silicon Defense, 01): addresses and ports,
probabilistic– ADAM (Barbara et al., 01): hybrid anomaly-misuse
detection
LERAD: Learning Rules for Anomaly Detection
(ICDM 03)
Probabilistic Models
• Anomaly detection: – P(x | D NoAttacks)
– Given training data with no attacks, estimate the probability of seeing event x
– Easier if event x was observed during training• actually, since x is normal, we aren’t interested in its
likelihood
– Harder if event x was not observed (zero frequency problem)
• we are interested in the likelihood of anomalies
Estimating Probability with Zero Frequency
• r = number of unique values in an attribute in the training data
• n = number of instances with the attribute in the training data
• Likelihood of observing a novel value in an attribute is estimated by:
p = r / n
(Witten and Bell, 1991)
Anomaly Score
• Likelihood of novel event = p• During detection, if a novel event (unobserved
during training) actually occurs: – anomaly score = 1/p [surprise factor]
Example
• Training Sequence1 = a, b, c, d, e, b, f, g, c, h– P(NovelLetter) = 8/10
– Z is observed during detection, anomaly score = 10/8
• Training Sequence2 = a, a, b, b, b, a, b, b, a, a– P(NovelLetter) = 2/10
– Z is observed during detection, anomaly score = 10/2
Nonstationary Model
• More likely to see a novel value if novel values were seen recently (e.g., during an attack)
• During detection, record when the last novel value was observed
• ti = number of seconds since the last novel value in attribute Ai
• Anomaly score for Ai: Scorei = ti / pi
• Anomaly score for an instance = i Scorei
LEarning Rules for Anomaly Detection (LERAD)
• PHAD uses prior probabilities: P(Z)• ALAD uses conditional probabilities: P(Z|A)• More accurate to learn probabilities that are
conditioned on multiple attributes: P(Z|A,B,C…)• Combinatorial explosion• Fast algorithm based on sampling
Rules in LERAD
• • If the antecedent is satisfied, the Z attribute has
one of the values z1, z2, z3…
• Unlike association rules, our rules allow a set of values in the consequent
• Unlike classification rules, our rules don’t require a fixed attribute as the consequent
,...},,{,...,, 321 zzzZcCbBaA
Semantics of a Rule
• • If the antecedent is satisfied but none of the values
in the Z attribute is matched, the anomaly score is n/r (similar to PHAD/ALAD)
• r = size of Z (# of unique values in Z)• n = # of tuples that satisfy the antecedent and have
the Z attribute (support)•
,...},,{,...,, 321 zzzZcCbBaA
nrcCbBaAzzzZP /),,|,...},,{( 321
Overview of the Algorithm
• Randomly select pairs of tuples (packets, connections, …) from a sample of the training data
• Create candidate rules based on each pair• Estimate the score of each candidate rule based on
a sample of the training data• Prune the candidate rules• Update the consequent and calculate the score for
each rule using the entire training set
Creating Candidate Rules
• Find the matching attributes; for example, given this randomly selected pair of tuples:
• <A=1,B=2,C=3,D=4> and <A=1,B=2,C=3,D=6>• Attributes A, B, and, C match• Create these rules:• A=1, B=2 => C=?• B=2, C=3 => A=?• A=1, C=3 => B=?
Estimating Rule Scores
• Randomly pick a sample from the training set to estimate the score (n/r) for each rule
• The consequent of each rule is now estimated• n/r=100/3• n/r=10/2• n/r=200/100• The larger the score (n/r), the higher the
confidence that the rule captures normal behavior
}4,3,2{2,1 CBA
}5,1{3,2 ACB
}100,...,3,2,1{3,1 BCA
Pruning Candidate Rules
• To reduce the amount of time for learning from the entire training set and during detection
• High scoring rules: more confidence for top rules• Redundancy check: some rules are not necessary• Coverage check: minimum set of rules that
describe the data
Redundancy Check
• Rule 1: • Rule 2: • Rule 2 is more general than Rule 1, which is
redundant and can be removed• Rule 3: • Rule 2 and Rule 3 don’t overlap• Rule 4: • Rule 4 is more general than Rule 3, remove Rule 3
}4,3{2,1 CBA}4,3{1 CA
}6,5,4,3{2 CB
}6,5,4,3{* C
Coverage Check
• A rule can cover multiple tuples, but a tuple can only be covered by one rule (highest-scoring rule).
• Rules are checked in descending order of scores• For each rule in the candidate rule set
– mark tuples that are covered by the rule
• Rules that don’t cover any tuples are removed• Our coverage check includes the redundancy
check
Final Training
• The selected rules are trained on the entire training set: consequent and score are updated
• n/r=100000/5• n/r=4000/2• 90% for training the rules• 10% for validating the rules: rules that cause false
alarms are removed (being conservative--the remaining rules are highly predictive)
}6,5,4,3,2{2,1 CBA
}5,1{3,2 ACB
Scoring during Detection
• Score for a matched rule that is violated
S = t * n/r
where t is the duration since the last time the rule was violated (anomaly occurred wrt the rule)
• Anomaly score for the tuple = i Si
Attributes Used in LERAD-tcp
• TCP connections are reassembled (similar to ALAD)
• Last 2 bytes of the destination IP address• 4 bytes of the source IP address• Source and destination ports• Duration (from the first packet to the last)• Length, TCP flags• First 8 strings in the payload (delimited by
space/new line)
Attributes used in LERAD-all
• Attributes used in LERAD-tcp• UDP and ICMP header fields
Experimental Data and Parameters
• DARPA 99 data set• Training: Week 3; Testing: Weeks 4 & 5• Training: 35K tuples (LERAD-tcp); 69K tuples
(LERAD-all)• Testing: 178K tuples (LERAD-tcp); 941K tuples
(LERAD-all)• 1,000 pairs of tuples were sampled to form
candidate rules (more didn’t help much)• 100 tuples were sampled to estimate scores for
candidate rules (more didn’t help much)
Experimental Results
• Average of 5 runs• 10 false alarms per day• 201 attacks; 74 “hard-to-detect” attacks
(Lippmann, 2000)• LERAD-tcp: ~117 detections (58%); ~45 “hard-
to-detect” (60%) • LERAD-all: ~112 detections (56%); ~41 “hard-to-
detect” (55%)
LERAD-all vs. LERAD-tcp
Detections
(10FA/Day)
LERAD-all only
Both LERAD-tcp only
PROBE 7 23 0
DoS 10 30 8
R2L 0 30 5
U2R 0 10 10
Total 17 93 23
Experimental Time Statistics
• Preprocessing: ~7.5 minutes (2.9GB, training set), ~20 minutes (4GB, test set)
• LERAD-tcp: ~6 seconds (4MB, training), ~17 seconds (17MB, testing)
• LERAD-all: ~12 seconds (8MB, training), ~95 seconds (91MB, testing)
• 50-75 learned final rules
Results from Mixed Data (RAID 03)
• DARPA 99 data set: attacks are real, background is simulated
• Compared with collected real data• Artifacts: smaller range of values, little “crud,”
values stop growing• Modified LERAD: 87 detections, 49 (56%)
legitimate• Mixed data: 30 detections, 25 (83%) legitimate
CLAD: Clustering for Anomaly Detection
(In Data Mining against Cyber Threats,
Kumar et al., 03)
Finding Outliers
• Cluster the data points
• Outliers: points in far away and sparse clusters
• Inter-cluster distance: average distance from the rest of the clusters
• Density: number of data points in a fixed-volume cluster
CLAD
• Simple efficient clustering algorithm (large amount of data)
• Clusters with fixed radius• If a point is within the radius of an existing cluster
– Add the point to the cluster
• Else– The point becomes the centriod of a new cluster
CLAD Issues
• Distance for discrete attributes– Values that are more frequent are likely to be more
normal and are consider “closer”
– Difference in frequency of discrete values
• Power-law distributions: logarithm• Radius of clusters
– Select a small random sample
– Calculate the distance of all pairs
– Average of the smallest 1%
Sparse and Dense Regions
• Outliers are in distant and sparse regions• However, an attack might generate many
connections and can make its neighborhood not sparse.
• (distant and sparse) or (distant and dense)– Distant: distance > avg(distance) + sd(distance)– Sparse: density < avg(density) – sd(density)– Dense: density > avg(density) + sd(density)
Experiments
• Weeks 1, 2, 4, and 5• No explicit training-testing, looking for
outliers• A model for each port• Ports with less than 1% traffic are lumped
into the “Others” model• Anomaly scores are normalized in SD’s, the
“Combined” model simply merges the scores from different models
HTTP (Port 80)
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
0 1 2 3 4 5 6 7 8 9ln(Count)
Ln
(In
ter-
Dis
tan
ce
)
CD > 0.8
HTTP (Port 80)
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
0 1 2 3 4 5 6 7 8 9
Ln(Count)
Ln
(In
ter
Dis
tan
ce
)
CD < 0.2
SMTP (Port 25)
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
0 1 2 3 4 5 6 7 8
Ln (COUNT)
Ln
(In
ter
Dis
tan
ce )
CD > 0.8
SMTP (Port 25)
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
0 1 2 3 4 5 6 7 8
Ln(Count)
Ln
(In
ter
Dis
tan
ce)
CD <0.2
Results
Attack Type AttacksDetections
(10 FA/Day)
Probe 28 19 (70%)
DOS 45 25 (55%)
R2L41
15 (37%)
U2R/Data 37 14 (38%)
Total 151 74 (49%)
LERAD vs. CLAD
LERAD CLAD
Assume all training data are normal
Training data can have unlabeled attacks
Off-line algorithm On-line algorithm
Concise and comprehensible models
Harder to explain alerts
Efficient detection Comparing large # of centroids
Ongoing Work
• On-line, noise-tolerant LERAD• Applying LERAD to system calls, including
arguments• Tokenizing payload to create features
Data Mining for Computer Security Workshop at ICDM03
Melbourne, FLNov 19, 2003
www.cs.fit.edu/~pkc/dmsec03/
http://www.cs.fit.edu/~pkc/id/
Thank you
Questions?