View
212
Download
0
Category
Preview:
Citation preview
Overview of Methods
Data mining techniques
What techniques do, examples,
Advantages & disadvantages
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
4-2
History• Statistics
• AI:– genetic algorithms, neural networks
• analogies with biology
– memory-based reasoning– link analysis from graph theory
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
4-3
Techniques• Statistical
– Market-Basket Analysis - find groups of items
– Memory-Based Reasoning- case based
– Cluster Detection - undirected (quantitative MBA)
• Artificial Intelligence– Link Analysis - MCI’s Friends & Family
– Decision Trees, Rule Induction - production rule
– Neural Networks - automatic pattern detection
– Genetic Algorithms - keep best parameters
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
4-4
Models• Regression: Y = a + bX• Classification: assign new record to class• Predictive: assign value to new record• Clustering: groups for data• Time-series: assign future value• Links: patterns in data
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
4-5
Fitting
• Underfitting: not enough detail– leave out important variables
• Overfitting: too much detail– memorizes training set, but doesn’t help
with new data• data set too small• redundancy in data
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
4-6
Comparison of Features
Rules Neural Net CaseBase Genetic
Noisy data Good Very good Good Very good
Missing data Good Good Very good Good
Large sets Very good Poor Good Good
Different types Good Numerical Very good Transform
Accuracy High Very high High High
Explanation Very good Poor Very good Good
Integration Good Good Good Very good
Ease Easy Difficult Easy Difficult
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
4-7
Data Mining Functions
• Classification– Identify categories in data
• Prediction– Formula to predict future observations
• Association– Rules using relationships among entities
• Detection– Anomalies & irregularities (fraud detection)
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
4-8
Financial ApplicationsTechnique Application Problem Type
Neural net Forecast stock price Prediction
NN, Rule Forecast bankruptcy
Fraud detection
Prediction
Detection
NN, Case Forecast interest rate Prediction
NN, visual Late loan detection Detection
Rule Credit assessment
Risk classification
Prediction
Classification
Rule, Case Corporate bond rate Prediction
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
4-9
Telecom Applications
Technique Application Problem Type
Neural net,
Rule induct
Forecast network behav.
Prediction
Rule induct Churn
Fraud detection
Classification
Detection
Case based Call tracking Classification
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
4-10
Marketing Applications
Technique Application Problem Type
Rule induct Market segment
Cross-selling
Classification
Association
Rule induct, visual Lifestyle analysis
Performance analy.
Classification
Association
Rule induct, genetic, visual
Reaction to promotion
Prediction
Case based Online sales support Classification
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
4-11
Web Applications
Technique Application Problem Type
Rule induct,
Visualization
User browsing similarity analy.
Classification,
Association
Rule-based heuristics
Web page content similarity
Association
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
4-12
Other ApplicationsTechnique Application Problem Type
Neural net Software cost Detection
Neural net,
rule induct
Litigation assessment
Prediction
Rule induct Insurance fraud
Healthcare except.
Detection
Detection
Case based Insurance claim
Software quality
Prediction
Classification
Genetic algor. Budget spending Classification
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
4-13
Data Sets
• Loan Applications– classification
• Job Applications– classification
• Insurance Fraud– detection
• Expenditure Data– prediction
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
4-14
Loan Data• 650 observations• OUTCOMES (binary):
– On-time cost of error: $300– Late (default) cost of error: $2,000
• Variables– Age, Income, Assets, Debts, Want, Credit
• Credit ordinal
– Transform: Assets, Debts, & Want →Risk
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
4-15
Job Application Data
• 500 observations• OUTCOMES (ordinal):
– Unacceptable– Minimal– Acceptable– Excellent
• Variables– Age, State, Degree, Major, Experience
• State nominal; degree & major ordinal• State is superfluous
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
4-16
Insurance Claim Data
• 5000 observations• OUTCOMES (binary):
– OK cost of error $500– Fraudulent cost of error $2,500
• Variables– Age, Gender, Claim, Tickets, Prior claims,
Attorney• Gender & attorney nominal, tickets & prior claims
categorical
McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved
4-17
Expenditure Data
• 10,000 observations• OUTCOMES:
– Could predict response in a number of categories– Others
• Variables:– Age, Gender, Marital, Dependents, Income, Job
years, Town years, Education years, Drivers license, Own home, Number of credit cards
– Churn, proportion of income spent on seven categories
Recommended