Ethem AlpaydınDepartment of Computer Engineering
Boğaziçi University [email protected]
Intelligent Data Mining
What is Data Mining ?
• Search for very strong patterns (correlations, dependencies) in big data that can generalise to accurate future decisions.
• Aka Knowledge discovery in databases, Business Intelligence
Example Applications• Association
“30% of customers who buy diapers also buy beer.” Basket Analysis
• Classification“Young women buy small inexpensive cars.” “Older wealthy men buy big cars.”
• RegressionCredit Scoring
Example Applications
• Sequential Patterns“Customers who latepay two or more of the first three installments have a 60% probability of defaulting.”
• Similar Time Sequences“The value of the stocks of company X has been similar to that of company Y’s.”
Example Applications
• Exceptions (Deviation Detection)“Is any of my customers behaving differently than usual?”
• Text mining (Web mining)“Which documents on the internet are similar to this document?”
IDIS – US Forest Service
• Identifies forest stands (areas similar in age, structure and species composition)
• Predicts how different stands would react to fire and what preventive measures should be taken?
GTE Labs
• KEFIR (Key findings reporter)• Evaluates health-care utilization
costs• Isolates groups whose costs are
likely to increase in the next year. • Find medical conditions for which
there is a known procedure that improves health condition and decreases costs.
Lockheed
• RECON Stock portfolio selection• Create a portfolio of 150-200
securities from an analysis of a DB of the performance of 1,500 securities over a 7 years period.
VISA
• Credit Card Fraud Detection• CRIS: Neural Network software
which learns to recognize spending patterns of card holders and scores transactions by risk.“If a card holder normally buys gas and groceries and the account suddenly shows purchase of stereo equipment in Hong Kong, CRIS sends a notice to bank which in turn can contact the card holder.”
ISL Ltd (Clementine) - BBC
• Audience prediction• Program schedulers must be able
to predict the likely audience for a program and the optimum time to show it.
• Type of program, time, competing programs, other events affect audience figures.
Data Mining is NOT Magic!
Data mining draws on the concepts and methods of databases, statistics, and machine learning.
From the Warehouse to the Mine
DataWarehouse
Standardform
TransactionalDatabases Extract,
transform,cleanse data
Define goals,data transformations
How to mine?
Verification Discovery
Computer-assisted, User-directed, Top-down
Query and ReportOLAP (Online Analytical Processing) tools
Automated, Data-driven, Bottom-up
Steps: 1. Define Goal
• Associations between products ?• New market segments or potential
customers?• Buying patterns over time or product
sales trends?• Discriminating among classes of
customers ?
Steps:2. Prepare Data
• Integrate, select and preprocess existing data (already done if there is a warehouse)
• Any other data relevant to the objective which might supplement existing data
Steps:2. Prepare Data (Cont’d)• Select the data: Identify relevant variables• Data cleaning: Errors, inconsistencies,
duplicates, missing data.• Data scrubbing: Mappings, data
conversions, new attributes• Visual Inspection: Data distribution,
structure, outliers, correlations btw attributes
• Feature Analysis: Clustering, Discretization
Steps:3. Select Tool• Identify task class
Clustering/Segmentation, Association, Classification,
Pattern detection/Prediction in time series
• Identify solution classExplanation (Decision trees, rules) vs Black Box (neural network)
• Model assesment, validation and comparisonk-fold cross validation, statistical tests
• Combination of models
Steps:4. Interpretation
• Are the results (explanations/predictions) correct, significant?
• Consultation with a domain expert
Example
• Data as a table of attributes
Name
Income Owns a house? Marital status
Ali
25,000 $
Yes Married
Veli 18,000 $ No Married
We would like to be able to explain the value of one attribute in terms of the values of other attributes that are relevant.
Default
NoYes
Modelling Data
Attributes x are observable
y =f (x) where f is unknown and probabilistic
fx y
Building a Model for Data
fxy
f*
-
Learning from Data
Given a sample X={xt,yt}t
we build f*(xt) a predictor to f (xt)
that minimizes the difference between our prediction and actual value
ttt xfyE 2)(*
Types of Applications
• Classification: y in {C1, C2,…,CK}
• Regression: y in Re• Time-Series Prediction: x
temporally dependent
• Clustering: Group x according to similarity
Example
Yearly income
savingsOKDEFAULT
Example Solution
2
RULE: IF yearly-income> 1 AND savings> 2 THEN OK ELSE DEFAULT
x2 : savings
x1 : yearly-income1
OKDEFAULT
Decision Treesx1 : yearly incomex2 : savingsy = 0: DEFAULTy = 1: OK
x1 > 1
x2 > 2 y = 0
y = 1 y = 0
yes
no
no
yes
Clustering
yearly-income
savingsOKDEFAULTType
1
Type 2 Type 3
Time-Series Prediction
timeJan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Jan
PresentPast Future
?
Discovery of frequent episodes
Methodology
InitialStandardForm
Testset
Trainset
Predictor 1
Predictor 2
Predictor L
Choosebest
Data reduction:Value and featureReductions
Train alternativepredictors ontrain set
Test trainedpredictors ontest data andchoose best
BestPredictor
Acceptbest ifgoodenough
Data Visualisation
• Plot data in fewer dimensions (typically 2) to allow visual analysis
• Visualisation of structure, groups and outliers
Data Visualisation
Yearly income
savings
Exceptions
Rule
Techniques for Training Predictors
• Parametric multivariate statistics• Memory-based (Case-based)
Models • Decision Trees• Artificial Neural Networks
Classification
• x : d-dimensional vector of attributes
• C1 , C2 ,... , CK : K classes
• Reject or doubt
• Compute P(Ci|x) from data and
choose k such that P(Ck|x)=maxj P(Cj|x)
Bayes’ Rulep(x|Cj) : likelihood that an object of class j has its features xP(Cj) : prior probability of class jp(x) : probability of an object (of any class) with feature xP(Cj|x) : posterior probability that object with feature x is of class j
Statistical Methods
• Parametric e.g., Gaussian, model for class densities, p(x|Cj)
Univariate
Multivariate
x
2
2
2
)(exp
2
1)|(
j
j
j
j
xCxp
dx
)()(
21
exp)2(
1)|( 1
2/ jjT
jj
djCp μxΣμxΣ
x
Training a Classifier
• Given data {xt}t of class Cj
Univariate: p(x|Cj) is N (j,j)
Multivariate: p(x|Cj) is Nd (j,j)
j
Cx
t
j n
xj
t̂
1
)ˆ(
ˆ
2
2
j
Cxj
t
j n
xj
t
n
nCP j
j )(ˆ
j
C
t
j nj
t x
x
μ̂1
)ˆ)(ˆ(
ˆ 2
j
C
Tj
tj
t
j nj
tx
μxμx
Example: 1D Case
Example: Different Variances
Example: Many Classes
2D Case: Equal Spheric Classes
Shared Covariances
Different Covariances
Actions and Risks
i : Action i
(i|Cj) : Loss of taking action i when the situation is Cj
R(i |x) = j (i|Cj) P(Cj |x)
Choose k st
R(k |x) = mini R(i |x)
Function Approximation (Scoring)
Regression
where is noise. In linear regression, Find w,w0 st
)|( tt xfy
00),|( wwxwwxf tt 2
00 )(),( wwxywwE t
t
t
0,00
wE
wE
E
w
Linear Regression
Polynomial Regression
• E.g., quadratic
01
2
2012 ),,|( wxwxwwwwxf ttt
201
2
2012 )(),,( wxwxwywwwE tt
t
t
Polynomial Regression
Multiple Linear Regression
• d inputs:
xwTtdd
tt
dt
dtt
wxwxwxw
wwwwxxxf
02211
21021 ),,,,|,,,(
221021
210
),,,,|,,,(
),,,,(
td
td
ttt
d
wwwwxxxfy
wwwwE
Feature Selection
• Subset selectionForward and backward
methods• Linear Projection
Principal Components Analysis (PCA)
Linear Discriminant Analysis (LDA)
Sequential Feature Selection
(x1) (x2) (x3) (x4)
(x1 x3) (x2 x3) (x3 x4)
(x1 x2 x3) (x2 x3 x4)
Forward Selection
(x1 x2 x3 x4)
(x1 x2 x3) (x1 x2 x4) (x1 x3 x4) (x2 x3 x4)
(x2 x4) (x1 x4) (x1 x2)
Backward Selection
Principal Components Analysis (PCA)
z2
x1
z1x2
z2
z1
Whiteningtransform
Linear Discriminant Analysis (LDA)
x1
z1x2
z1
Memory-based Methods
• Case-based reasoning • Nearest-neighbor algorithms• Keep a list of known instances and
interpolate response from those
Nearest Neighbor
x1
x2
Local Regression
x
y
Mixture of Experts
Missing Data
• Ignore cases with missing data• Mean imputation• Imputation by regression
Training Decision Trees
x1 > 1
x2 > 2 y = 0
y = 1 y = 0
yes
no
no
yes
x2
x11
2
Measuring Disorder
x1x1
70
19
85
04
x2x2
Entropy
n
n
n
n
nn
nn
e rightrightleftleft loglog
Artificial Neural Networks
x1
xd
x2
x0=+1
w1
w2
wd
w0
yg
Regression: IdentityClassification: Sigmoid (0/1)
)(
)( 02211
xwTg
wwxwxgy
Training a Neural Network
• d inputs:
d
iii
T xwggo0
)( xw
2
2)|(
Xt i
iit
Xt
tt xwgyoyXE w
Find w that min E on X
tt yX ,xTraining set:
Nonlinear Optimization
wı
E
ii w
Ew
Gradient-descent:Iterative learning Starting from random w is learning factor
Neural Networks for ClassificationK outputs oj , j=1,..,KEach oj estimates P (Cj|x)
)exp(11
)(
xw
xw
Tj
Tjj sigmoido
Multiple Outputs
x0=+1
oK
xdx2x1
o2o1
wKd
d
i
tiji
tTj
tj xwggo
0
)( xw
Iterative Training
ti
tj
tj
ji
j
jjiji
tTtj
t j
tj
tj
xgoyw
o
oE
wE
w
go
oyXE
j
)('
)(
)|(2
xw
w
ttX yx ,
i
tj
tj
tj
tjji
itj
tjji
xoooyw
xoyw
)1(
LinearNonlinear
Nonlinear classification
Linearly separable NOT Linearly separable;requires a nonlineardiscriminant
Multi-Layer Networks
x0=+1
hH
xdx2x1
h2h1
wKdh0=+1
tKH
o1 o2 oK
H
p
tpjp
tj htgo
0
d
i
tipi
tp xwsigmoidh
0
Probabilistic Networks
,...1.0)|(,05.0)|(
1.0)(
pp
p
Evaluating Learners
1. Given a model M, how can we assess its performance on real (future) data?
2. Given M1, M2, ..., ML which one is the best?
Cross-validation
1 2 3 k-1 k
1 2 3 k-1 k
Repeat k times and average
Combining Learners: Why?
InitialStandardForm
Validationset
Trainset
Predictor 1
Predictor 2
Predictor L
Choosebest
BestPredictor
Combining Learners: How?
InitialStandardForm
Validationset
Trainset
Predictor 1
Predictor 2
Predictor L
Voting
Conclusions:The Importance of Data
• Extract valuable information from large amounts of raw data
• Large amount of reliable data is a must. The quality of the solution depends highly on the quality of the data
• Data mining is not alchemy; we cannot turn stone into gold
Conclusions: The Importance of the Domain Expert
• Joint effort of human experts and computers
• Any information (symmetries, constraints, etc) regarding the application should be made use of to help the learning system
• Results should be checked for consistency by domain experts
Conclusions: The Importance of Being Patient
• Data mining is not straightforward; repeated trials are needed before the system is finetuned.
• Mining may be lengthy and costly. Large expectations lead to large disappointments !
Once again: Important Requirements for Mining
• Large amount of high quality data• Devoted and knowledgable experts on:
1. Application domain2. Databases (Data warehouse)3. Statistics and Machine Learning
• Time and patience
That’s all folks!