Upload
others
View
10
Download
0
Embed Size (px)
Citation preview
DATA MINING
1
What To Cover
• Frequent Itemset Mining
• Association Rule Mining
• Clustering
• Classification
• Deviation (Outlier) Detection
2
Motivation for Outlier Analysis • Fraud Detection
• (Credit card, telecommunications, criminal activity in e-Commerce)
• Customized Marketing • (high/low income buying habits)
• Medical Treatments • (unusual responses to various drugs)
• Financial Applications • (stock tracking)
What is an outlier? • Observations inconsistent
with rest of the dataset – Global Outlier
• Special outliers – Local
Outlier • Observations inconsistent
with their neighborhoods • A local instability or
discontinuity O1 and O2 seem outliers from the rest
Outlier Detection Approaches • Objective:
• Define what data can be considered as inconsistent in a given data set • Statistical-Based Outlier Detection • Deviation-Based Outlier Detection • Distance-Based Outlier Detection
• Find an efficient method to mine the outliers
Outlier Analysis - Outline • Introduction / Motivation / Definition • Statistical-based Detection
• Distribution-based, depth-based • Deviation-based Method
• Sequential exception, OLAP data cube • Distance-based Detection
• Index-based, nested-loop, cell-based, local-outliers
Statistical-Based Outlier Detection (Distribution-based)
• Assumptions: • Knowledge of data
(distribution, mean, variance)
.,...,2,1 where ,)1(:
.,...,2,1 where ,)1(:
.,...,2,1 where ,:
15deviation standard within in is
.,...,2,1 where ,:
niFFoH
niGFoH
niGoH
Fo
niFoH
i
i
i
i
i
=ʹ′+−∈
=+−∈−
=∈
=
=∈
λλ
λλ
:nDistibutio Slippage-
:onDistributi Mixture
:onDistributi Inherent-
:Hypothesis eAlternativ
:Test yDiscordanc
:Hypothesis Working
Statistical-Based Outlier Detection (Distribution-based)
.,...,2,1 where ,)1(:
.,...,2,1 where ,)1(:
.,...,2,1 where ,:
15deviation standard within in is
.,...,2,1 where ,:
niFFoH
niGFoH
niGoH
Fo
niFoH
i
i
i
i
i
=ʹ′+−∈
=+−∈−
=∈
=
=∈
λλ
λλ
:nDistibutio Slippage-
:onDistributi Mixture
:onDistributi Inherent-
:Hypothesis eAlternativ
:Test yDiscordanc
:Hypothesis Working
• Assumptions: • Knowledge of data
(distribution, mean, variance)
Statistical-Based Outlier Detection • Strengths
• Most outlier research has been done in this area, many data distributions are known
• Weakness • Not good for multi-dimensional datasets
• Assumes the distribution is known –this is not always the case
Outlier Analysis - Outline • Introduction / Motivation / Definition • Statistical-based Detection
• Distribution-based, depth-based • Deviation-based Method
• Sequential exception, OLAP data cube • Distance-based Detection
• Index-based, nested-loop, cell-based, local-outliers
Distance-Based Outlier Detection • Given two parameters
• Radius r • Number of neighbors k
• Outlier: • Any point where within its radius r, there are less than k
neighbors
• Inlier: • Any point where within its radius r, there are k or more
neighbors
11
r
r
Algorithm: Nested Loop
• Steps • For each data point p
• Scan all other points and count how many neighbors within distance r • Enhancements:
• Stop when you find k or more è p is inlier • Use an index (given p’s location à find all points within distance r)
(+) Easy to implement (-) Not efficient for large datasets
12
Algorithm: Cell-Based • Divide the space into grid (cells)
• What if the cell size is smaller than r • If a cell has more than k è then all points are inliers without
checking them
13
• What if the cell size is larger than r • Check each point in cell C with the
points in C • For boundary points, check neighbor cell
as well
What To Cover
• Frequent Itemset Mining
• Association Rule Mining
• Clustering
• Classification
• Deviation (Outlier) Detection
14