DATA MINING - WPIweb.cs.wpi.edu/~cs561/s14/Lectures/W4/DataMining-3.pdf · 2014-02-13 · DATA MINING 1 . What To Cover ... Association Rule Mining • Clustering • Classification

DATA MINING

1

What To Cover

• Frequent Itemset Mining

• Association Rule Mining

• Clustering

• Classification

• Deviation (Outlier) Detection

2

Motivation for Outlier Analysis •  Fraud Detection

•  (Credit card, telecommunications, criminal activity in e-Commerce)

• Customized Marketing •  (high/low income buying habits)

• Medical Treatments •  (unusual responses to various drugs)

•  Financial Applications •  (stock tracking)

What is an outlier? • Observations inconsistent

with rest of the dataset – Global Outlier

• Special outliers – Local

Outlier •  Observations inconsistent

with their neighborhoods •  A local instability or

discontinuity O1 and O2 seem outliers from the rest

Outlier Detection Approaches • Objective:

• Define what data can be considered as inconsistent in a given data set •  Statistical-Based Outlier Detection •  Deviation-Based Outlier Detection •  Distance-Based Outlier Detection

•  Find an efficient method to mine the outliers

Outlier Analysis - Outline •  Introduction / Motivation / Definition • Statistical-based Detection

• Distribution-based, depth-based • Deviation-based Method

• Sequential exception, OLAP data cube • Distance-based Detection

•  Index-based, nested-loop, cell-based, local-outliers

Statistical-Based Outlier Detection (Distribution-based)

•  Assumptions: •  Knowledge of data

(distribution, mean, variance)

.,...,2,1 where ,)1(:

.,...,2,1 where ,)1(:

.,...,2,1 where ,:

15deviation standard within in is

.,...,2,1 where ,:

niFFoH

niGFoH

niGoH

Fo

niFoH

i

i

i

i

i

=ʹ′+−∈

=+−∈−

=∈

=

=∈

λλ

λλ

:nDistibutio Slippage-

:onDistributi Mixture

:onDistributi Inherent-

:Hypothesis eAlternativ

:Test yDiscordanc

:Hypothesis Working

Statistical-Based Outlier Detection (Distribution-based)

.,...,2,1 where ,)1(:

.,...,2,1 where ,)1(:

.,...,2,1 where ,:

15deviation standard within in is

.,...,2,1 where ,:

niFFoH

niGFoH

niGoH

Fo

niFoH

i

i

i

i

i

=ʹ′+−∈

=+−∈−

=∈

=

=∈

λλ

λλ

:nDistibutio Slippage-

:onDistributi Mixture

:onDistributi Inherent-

:Hypothesis eAlternativ

:Test yDiscordanc

:Hypothesis Working

•  Assumptions: •  Knowledge of data

(distribution, mean, variance)

Statistical-Based Outlier Detection • Strengths

• Most outlier research has been done in this area, many data distributions are known

• Weakness • Not good for multi-dimensional datasets

• Assumes the distribution is known –this is not always the case

Outlier Analysis - Outline •  Introduction / Motivation / Definition • Statistical-based Detection

• Distribution-based, depth-based • Deviation-based Method

• Sequential exception, OLAP data cube • Distance-based Detection

•  Index-based, nested-loop, cell-based, local-outliers

Distance-Based Outlier Detection • Given two parameters

•  Radius r •  Number of neighbors k

• Outlier: •  Any point where within its radius r, there are less than k

neighbors

•  Inlier: •  Any point where within its radius r, there are k or more

neighbors

11

r

r

Algorithm: Nested Loop

• Steps •  For each data point p

•  Scan all other points and count how many neighbors within distance r •  Enhancements:

•  Stop when you find k or more è p is inlier •  Use an index (given p’s location à find all points within distance r)

(+) Easy to implement (-) Not efficient for large datasets

12

Algorithm: Cell-Based •  Divide the space into grid (cells)

•  What if the cell size is smaller than r •  If a cell has more than k è then all points are inliers without

checking them

13

• What if the cell size is larger than r •  Check each point in cell C with the

points in C •  For boundary points, check neighbor cell

as well

What To Cover

• Frequent Itemset Mining

• Association Rule Mining

• Clustering

• Classification

• Deviation (Outlier) Detection

14

Documents

DATA MINING - WPIweb.cs.wpi.edu/~cs561/s14/Lectures/W4/DataMining-3.pdf · 2014-02-13 · DATA MINING 1 . What To Cover ... Association Rule Mining • Clustering • Classification